Safely aborting long running shell scripts


Problem statement

Imagine a scenario where you fire off a long-running bash script consisting of several discrete tasks sequentially strung together, only to find out that you messed up, and need to abort the script (after the current line that’s executing) without abnormally terminating the currently executing “worker” command.

How do you do that safely?

Background

It might feel like an academic exercise, but for me (sadly) it isn’t. It might so happen to you that you fire off a badboy like this:

#!/bin/bash
set -x
f() { ruby worker.rb "$@"; }
f someparam a9072833-ea84-461a-82b1-b112cf190fd3
f someparam db8f9e6e-146c-4ccd-bb15-9b17e38bcbba
f someparam 6e0ee5d3-c94e-43e0-9f75-d0e30b5d34ac
f someparam 4fc9bf4a-e4e5-4f12-8db1-d8950149f79e
# f repeats 50+ more times

thinking that each instance of f will take 0-3 minutes, only to find out that it takes 30+. So running them sequentially is just not feasible1. Oops.

But let’s further assume that the worker script is terribly fragile, and simply leaning into ^C with your entire body weight might do more harm than good.

Here are more rules:

  1. Aborting the in-progress worker is unwanted, you want it to finish if at all possible2.
  2. More importantly: starting the next worker in the script, even for a brief period of time, is verboten3.
  3. Stopping (SIGSTOP) the current worker is possible briefly, but preferably should be avoided4.

Those aren’t actually arbitrary limitations. Some processes – when heavily aided by Murphy – are faster than your ^C. And your shell habits should reflect that.

Specific setup

In order to model this, let me introduce the actors in this play in a bit more detail.

Starring role, the script that needs no introduction. The legend, the luminary, abortme.sh (the same as above5):

#!/bin/bash
set -x
f() { ruby worker.rb "$@"; }
f someparam a9072833-ea84-461a-82b1-b112cf190fd3
f someparam db8f9e6e-146c-4ccd-bb15-9b17e38bcbba
f someparam 6e0ee5d3-c94e-43e0-9f75-d0e30b5d34ac
f someparam 4fc9bf4a-e4e5-4f12-8db1-d8950149f79e
# f repeats 50+ more times

and his trusty sidekick, worker.rb:

#!/usr/bin/env ruby

$terminate = false

Signal.trap("CONT") { puts "Got CONT signal." }
Signal.trap("USR1") { puts "Got USR1 signal, terminating."; $terminate = true }
Signal.trap("TERM") { puts "Got TERM signal, terminating."; $terminate = true }
Signal.trap("INT") { puts "Got INT signal, terminating."; $terminate = true }

spinloop_time = ARGV.first.to_f
spinloop_time = 0.1 unless (0.001..1).include?(spinloop_time)

puts "Called with: #{ARGV.inspect}"
puts "Spinloop time: #{spinloop_time}"
puts "PIDs: me: #{Process.pid} , parent: #{Process.ppid}"
puts "Doing important work or something."
until $terminate
  sleep spinloop_time
end

puts "All done, very important result: 42."

Discussion

If you feel adventurous, now would be the time to come up with your take on the problem before reading further.

Let’s go through a few obvious attempts together:

First try: ^C while praying

Let me put this to rest right now. Mashing the Control-C and praying it works out might be a reasonable strategy if you don’t know any better.

It’s not very elegant:

$ bash abortme.sh 
+ f someparam a9072833-ea84-461a-82b1-b112cf190fd3
+ ruby worker.rb someparam a9072833-ea84-461a-82b1-b112cf190fd3
Called with: ["someparam", "a9072833-ea84-461a-82b1-b112cf190fd3"]
Spinloop time: 0.1
PIDs: me: 14759 , parent: 14758
Doing important work or something.
^CGot INT signal, terminating.
All done, very important result: 42.
+ f someparam db8f9e6e-146c-4ccd-bb15-9b17e38bcbba
+ ruby worker.rb someparam db8f9e6e-146c-4ccd-bb15-9b17e38bcbba
Called with: ["someparam", "db8f9e6e-146c-4ccd-bb15-9b17e38bcbba"]
Spinloop time: 0.1
PIDs: me: 14789 , parent: 14758
Doing important work or something.
^CGot INT signal, terminating.
^CGot INT signal, terminating.
All done, very important result: 42.
+ f someparam 6e0ee5d3-c94e-43e0-9f75-d0e30b5d34ac
+ ruby worker.rb someparam 6e0ee5d3-c94e-43e0-9f75-d0e30b5d34ac
^C

I tried repeatedly and couldn’t for the life of me get it to abort without properly executing the second worker.

Maybe a timing issue, but this honestly sucks, and violates the second (most important) rule6.

Second try: ^Z with a lot of work

So one nice thing about bash is the built in job-control.

Mashing ^Z stops the current job (SIGSTOP) and gives you back control of the terminal…

With a bit of effort, this can work:

$ bash abortme.sh 
+ f someparam a9072833-ea84-461a-82b1-b112cf190fd3
+ ruby worker.rb someparam a9072833-ea84-461a-82b1-b112cf190fd3
Called with: ["someparam", "a9072833-ea84-461a-82b1-b112cf190fd3"]
Spinloop time: 0.1
PIDs: me: 17217 , parent: 17216
Doing important work or something.
^Z
[1]+  Stopped                 bash abortme.sh

$ jobs -l
[1]+ 17216 Stopped                 bash abortme.sh

$ pstree -ap 17216
bash,17216 abortme.sh
  └─ruby,17217 worker.rb someparam a9072833-ea84-461a-82b1-b112cf190fd3

$ kill -CONT 17217
Got CONT signal.

# Whole lotta waiting here...
$ kill -USR1 17217
Got USR1 signal, terminating.
$ All done, very important result: 42.

$ kill %1
[1]+  Terminated              bash abortme.sh

What exactly happened?

  1. ^Z sends STOP signal to bash (and children).
  2. jobs -l shows the jobs (including pid)
  3. pstree -ap $jobpid shows the tree of processes for given job
  4. kill -CONT $worker makes the worker continue
  5. kill -USR1 $worker is here to simulate the worker finishing (after a time)
  6. kill %1 terminates the abortme.sh

How does it do in terms of the three commandments?

  1. Current worker finishes: ✅
  2. No additional worker started: ✅
  3. Not even brief stops of the current worker: ❌

Not bad, for a built-in. Mostly great, eh?

Can we do better? Yeah.

Solution

The STOP followed by furious job management obviously works well enough.

But if you have the inclination, there’s one nuance of shell execution that can make this nearly flawless7:

Since bash reads and executes scripts line-by-line, if you can in-place edit the script at the right spot, you can abort its execution after the current line.

What do I mean by that?

Quite simply, hexedit abortme.sh:

editing abortme.sh in hex editor to add “exit 111\n”

One rewrites part of the next line with exit 111\n in a hex editor8 and bash obediently aborts the execution after the current worker finishes9:

$ bash abortme.sh 
+ f someparam a9072833-ea84-461a-82b1-b112cf190fd3
+ ruby worker.rb someparam a9072833-ea84-461a-82b1-b112cf190fd3
Called with: ["someparam", "a9072833-ea84-461a-82b1-b112cf190fd3"]
Spinloop time: 0.1
PIDs: me: 22991 , parent: 22990
Doing important work or something.
Got USR1 signal, terminating.
All done, very important result: 42.
+ exit 111

No stopping, no process control, just happy little non-accidental partial script overwrite10.

Obviously there’s an inherent race condition (the current worker exiting at just the right moment for the next one to start… before you manage to edit the right spot in the file), so I wouldn’t call this an universal solution.

Still, it’s a weird quirk of the bash execution model one should keep in mind.

Word of warning, though: Do not edit the script using a regular high-level editor. You never know if the editor doesn’t do the whole write-to-tmp-file-then-rename11 dance (or similar), which would – at best – lead to this trick not working.

Also, if this leaves you very uneasy, then the second try above is far superior.

Closing words

I have somewhat enjoyed executing this hack. Even more than actually writing it down in an article form.

As a bonus, the whole script can be twisted from sequential execution to (unbounded) parallel with a simple change:

f() {
  local LOG=$(echo "$@" | sha256sum | awk '{print $1}')
  echo "Logging [$@] to $LOG"
  ruby worker.rb "$@" &>> "$LOG" &
}
f someparam a9072833-ea84-461a-82b1-b112cf190fd3
f someparam db8f9e6e-146c-4ccd-bb15-9b17e38bcbba
f someparam 6e0ee5d3-c94e-43e0-9f75-d0e30b5d34ac
f someparam 4fc9bf4a-e4e5-4f12-8db1-d8950149f79e
# f repeats 50+ more times
wait

That’s the half of map-reduce badly reimplemented for you. The retries are left as an exercise to the reader12.

  1. Let’s set aside the parallelization for a moment, that’s a solved problem. Ditto for retries. Might come to that as a bonus (and badly reimplement half of map-reduce in sh), but it’s somewhat out of scope here.

  2. After all, it has a very important result to output.

  3. No go. Oopsie the size of a Mars crater. You get the point.

  4. Ever paused some soft real-time and/or network client for tens of seconds and have it barf all over you? Yeah, that kind of situation.

  5. I actually use this structure to run batches of stuff that needs doing more than I’d like to admit in front of strangers. Or maybe not. Now you know my dirty secret.

  6. In some places this approach might happily eat some of your MPA tokens before you managed to abort fully and you can’t have that, right?

  7. If it wasn’t for the fact it’s somewhat tedious, prone to a race-condition, and giving off YOLO! vibes.

  8. hexedit is readily available on Devuan & co. And yes, I’m deadly serious, this time.

  9. Not pictured below, the kill -USR1 22991 sent from a different terminal.

  10. Yes, I’ve done this when executing some prod batch jobs. Yee-haw!

  11. Hello rsync, my old friend.

  12. Hint: grep -q ... && return 0 :)