Writing Robust Shell Scripts: Idempotency and Error Handling
Shell scripts have a reputation for being fragile: they fail silently, leave half-finished state behind, and break the moment a path contains a space. Most of that fragility is avoidable. This article covers the defensive techniques that turn a throwaway script into something you can run twice, run in CI, and trust in production.
Start with strict mode: set -euo pipefail
By default, Bash plows ahead after errors, treats unset variables as empty strings, and ignores failures in the middle of a pipeline. The conventional first line of a serious script counteracts all three:
#!/usr/bin/env bash
set -euo pipefail
Each flag does something specific:
-e(errexit): exit immediately if any command returns a non-zero status, instead of continuing as if nothing happened. This is what stops a script from charging past a failedcdand runningrm -rfin the wrong directory.-u(nounset): treat references to unset variables as an error rather than expanding them to the empty string. A typo like$HOEMbecomes a loud failure instead of a silent empty path.-o pipefail: make a pipeline return the exit status of the last command that failed, not just the last command. Without it,grep pattern file | sortreports success even whengrepfound nothing or the file was missing, becausesorthappily succeeded.
Strict mode is a sane default, not a silver bullet. -e has surprising edge cases: it does not fire for commands in an if condition, inside &&/|| chains, or when a function's failing command is part of a tested expression. When you genuinely expect a command to fail, handle it explicitly rather than fighting the flag:
# Don't let an expected non-zero exit kill the script
if ! grep -q "ready" status.txt; then
echo "not ready yet"
fi
# Capture an exit code without tripping errexit
set +e
some_flaky_command
rc=$?
set -e
[[ $rc -eq 0 ]] || echo "command failed with $rc"
With -u enabled, reference variables that may legitimately be unset using a default: "${VAR:-}" expands to empty without error, and "${VAR:?must be set}" aborts with a clear message.
Clean up with trap
Scripts that create temp files, lock files, or background processes need to clean them up even when they fail partway through. A trap on the EXIT pseudo-signal runs no matter how the script ends: normal exit, error under -e, or an unhandled signal.
#!/usr/bin/env bash
set -euo pipefail
workdir="$(mktemp -d)"
cleanup() {
rm -rf "$workdir"
}
trap cleanup EXIT
# ... do work in "$workdir"; it is removed on any exit
A single EXIT trap is usually enough, because EXIT fires after the default handlers for INT, TERM, and friends have terminated the script. If you need different behaviour for interruption versus normal completion, trap signals separately. Keep cleanup handlers idempotent and defensive: they may run when setup only partially completed, so guard against missing variables and use rm -f rather than plain rm.
Quoting and word-splitting
The single most common source of shell bugs is unquoted expansion. When you write $var without quotes, Bash performs word-splitting on whitespace and then glob expansion on the result. A filename like my report.txt becomes two arguments; a value containing * expands against the current directory.
# Wrong: breaks on spaces, expands globs
cp $src $dst
# Right: each variable is a single, literal argument
cp "$src" "$dst"
Rules of thumb that prevent the majority of quoting bugs:
- Quote every variable and command substitution by default:
"$var","$(cmd)". Only drop the quotes when you intend word-splitting. - Use
"$@"(quoted) to pass through all positional arguments preserving boundaries.$*and unquoted$@flatten them into a single split-and-globbed mess. - When iterating, prefer arrays over space-separated strings:
for f in "${files[@]}"; do ...; done. - Run ShellCheck over your scripts. It flags virtually every unquoted-expansion hazard automatically and is the cheapest reliability win available.
Making operations idempotent
An idempotent script produces the same end state whether it runs once or ten times. This matters because real scripts get re-run after partial failures, in retry loops, and during convergence-style provisioning. The pattern is always check the desired state, then act only if needed.
Many standard tools have idempotent flags built in. Reach for them before writing your own checks:
# Creates the directory, succeeds silently if it already exists
mkdir -p /opt/app/config
# Create an empty file or update its timestamp; never errors if present
touch /var/run/app.lock
# Symlink that replaces an existing link rather than failing
ln -sfn /opt/app/releases/current /opt/app/live
For appending configuration, the naive echo "line" >> file is not idempotent: run it twice and you get duplicate lines. Guard the append with a check:
line="export PATH=/opt/app/bin:\$PATH"
file="$HOME/.bashrc"
# Append only if an exact-match line is not already present
grep -qxF "$line" "$file" || printf '%s\n' "$line" >> "$file"
The -q silences output, -x requires a whole-line match, and -F treats the pattern as a fixed string so regex metacharacters in the line are not interpreted. For more complex desired-state logic, check before acting:
# Only create the user if it doesn't exist
if ! id -u appuser >/dev/null 2>&1; then
useradd --system appuser
fi
# Only download if the artifact is missing
[[ -f "$artifact" ]] || curl -fsSL "$url" -o "$artifact"
Exit codes
Exit codes are how scripts communicate success and failure to whatever called them. By convention 0 means success and any non-zero value (1-255) means failure. Always exit non-zero on failure so callers, CI pipelines, and && chains can detect it.
if ! validate_config; then
echo "config validation failed" >&2
exit 1
fi
A few details worth knowing:
- Send diagnostic messages to standard error with
>&2so they don't pollute data on stdout that another program may be parsing. $?holds the exit status of the most recent command. Capture it into a variable immediately if you need it, because the next command overwrites it.- Reserve distinct codes for distinct failure modes when callers need to branch (for example,
2for bad usage,3for a missing dependency). Avoid the range above 128, which the shell reserves for "killed by signal N" (a process killed bySIGTERM/signal 15 exits with 143).
Pitfalls of pipelines and command substitution
Even with pipefail, pipelines have traps. The most subtle is that each stage of a pipeline runs in a subshell, so variables assigned inside the loop body don't survive:
# BROKEN: count is modified in a subshell and lost
count=0
find . -name '*.log' | while read -r f; do
count=$((count + 1))
done
echo "$count" # prints 0
# FIX: avoid the pipe with process substitution
count=0
while read -r f; do
count=$((count + 1))
done < <(find . -name '*.log')
echo "$count" # correct
When reading lines, use while IFS= read -r line. The IFS= prevents leading and trailing whitespace from being stripped, and -r stops backslashes from being interpreted as escapes. For filenames specifically, drive the loop from find -print0 with read -d '' to survive newlines in names.
Command substitution has its own gotchas. $(...) strips all trailing newlines, which is usually convenient but bites you when trailing whitespace is significant. More importantly, an unquoted command substitution is word-split and glob-expanded like any other expansion:
# Unquoted: output is split on whitespace and globbed
files=$(ls) # fragile, and don't parse ls anyway
# Quote it to preserve the result as a single value
config="$(cat "$file")"
Also remember that under set -e, a failing command inside a command substitution used in an assignment may not abort the script, because the assignment itself succeeds. If the inner command matters, run it on its own line and check, or rely on pipefail plus an explicit test.
A consolidated template
#!/usr/bin/env bash
set -euo pipefail
workdir="$(mktemp -d)"
trap 'rm -rf "$workdir"' EXIT
main() {
local target="${1:?usage: deploy.sh <target>}"
mkdir -p "$workdir/build"
# ... idempotent, well-quoted work here ...
echo "deployed to $target"
}
main "$@"
Takeaway
Robust shell scripting comes down to a short checklist: start with set -euo pipefail and understand its edge cases, register a trap ... EXIT for cleanup, quote every expansion, make each operation check-before-act or use idempotent flags like mkdir -p and grep -qxF, return meaningful exit codes, and watch for subshell and word-splitting surprises in pipelines and command substitution. Run ShellCheck on everything. None of these techniques is advanced, but applied consistently they are the difference between a script that works once on your machine and one you can rerun anywhere without fear.