How to Add Retry Logic in Shell Script?

Adding retry logic to your shell scripts is a robust strategy for handling transient failures, improving script resilience, and ensuring that operations complete successfully even when faced with temporary network issues, unavailable services, or resource contention. It allows your script to automatically reattempt a command or task a specified number of times or until it succeeds, preventing premature script termination.

Why Implement Retry Logic?

Integrating retry mechanisms into shell scripts is crucial for several reasons:

Handling Transient Errors: Many failures are temporary (e.g., network glitches, resource lockouts, busy services). Retrying can overcome these without manual intervention.
Increased Reliability: Scripts become more fault-tolerant, especially in dynamic or distributed environments.
Reduced Manual Intervention: Automating retries minimizes the need for human oversight and re-execution of failed tasks.
Graceful Degradation: Allows systems to recover from minor disruptions without complete outage.

Basic Retry Loop: Infinite Attempts

The simplest form of retry logic involves an infinite loop that continuously reattempts a command until it succeeds. This approach is suitable for critical operations that must eventually succeed, provided there's an expectation that the condition will eventually be met.

The core idea is to execute a command and check its exit status. In shell scripting, the special variable $? holds the exit status of the most recently executed foreground command. An exit status of 0 typically indicates success, while any non-zero value indicates failure.

Here's an example demonstrating an infinite retry loop, similar to checking for a file's existence:

#!/bin/bash

echo "Beginning infinite retry attempt..."

# This loop will continuously attempt to find '/tmp/my_retry_file'.
# 'ls /tmp/my_retry_file >/dev/null 2>&1' attempts to list the file,
# redirecting all output to /dev/null to keep the console clean.
# The 'while' loop continues as long as 'ls' fails (exit status is not 0),
# which is explicitly checked by '[[ $? -ne 0 ]]'.
while ls /tmp/my_retry_file >/dev/null 2>&1; [[ $? -ne 0 ]]; do
  echo "Result unsuccessful. File not found. Retrying in 1 second..."
  sleep 1 # Wait for 1 second before the next attempt
done

echo "Result successful: /tmp/my_retry_file found."

In this script:

ls /tmp/my_retry_file >/dev/null 2>&1: This command attempts to list /tmp/my_retry_file. >/dev/null 2>&1 ensures that neither standard output nor standard error messages from ls clutter the console.
[[ $? -ne 0 ]]: This is a test command that checks if the exit status ($?) of the preceding ls command is not equal to zero. If ls failed, $? will be non-zero, making this condition true, and the while loop continues. If ls succeeds, $? will be 0, making this condition false, and the loop terminates.
sleep 1: This command pauses the script for 1 second, preventing the script from aggressively retrying and potentially overwhelming the system or target service.

For more information on Bash scripting fundamentals, refer to the Bash reference manual.

Finite Retry Loop: Limiting Attempts

An infinite retry loop might not always be desirable, as some issues are persistent and won't resolve themselves. A more common and practical approach is to limit the number of retry attempts. This prevents scripts from getting stuck indefinitely and allows for ultimate failure reporting.

To implement finite retries, you introduce a counter that increments with each attempt and a maximum attempt limit.

#!/bin/bash

MAX_ATTEMPTS=5
ATTEMPT=1
RETRY_DELAY=2 # seconds

echo "Beginning finite retry attempt (max $MAX_ATTEMPTS attempts)..."

# This loop retries finding '/tmp/my_retry_file' up to MAX_ATTEMPTS times.
# It uses the same condition as the infinite loop but adds a check for the attempt count.
while ls /tmp/my_retry_file >/dev/null 2>&1; [[ $? -ne 0 ]] && [[ $ATTEMPT -le $MAX_ATTEMPTS ]]; do
  echo "Attempt $ATTEMPT/$MAX_ATTEMPTS: File not found. Retrying in $RETRY_DELAY seconds..."
  sleep "$RETRY_DELAY"
  ATTEMPT=$((ATTEMPT + 1))
done

# Check why the loop exited
if [[ $ATTEMPT -le $MAX_ATTEMPTS ]]; then
  echo "Result successful: /tmp/my_retry_file found after $((ATTEMPT - 1)) attempts."
else
  echo "Failed after $MAX_ATTEMPTS attempts: /tmp/my_retry_file not found."
  exit 1 # Indicate script failure
fi

In this enhanced script:

MAX_ATTEMPTS and ATTEMPT: Variables to control the maximum retries and track the current attempt number.
[[ $? -ne 0 ]] && [[ $ATTEMPT -le $MAX_ATTEMPTS ]]: The loop now continues only if the command fails AND the current ATTEMPT number is less than or equal to MAX_ATTEMPTS.
ATTEMPT=$((ATTEMPT + 1)): Increments the ATTEMPT counter after each failed try.
Post-loop if/else: Determines whether the loop exited due to success or reaching the maximum attempt limit, providing clear feedback and an appropriate exit status.

Advanced Retry Strategies

For more sophisticated scenarios, simple fixed delays and attempt limits can be enhanced with advanced strategies.

1. Exponential Backoff

Exponential backoff is a strategy where the delay between retries increases exponentially. This is highly effective in reducing congestion for busy services, as it gives the service more time to recover.

#!/bin/bash

MAX_ATTEMPTS=5
BASE_DELAY=1 # seconds
ATTEMPT=1

echo "Beginning retry with exponential backoff..."

while ! unreliable_command; [[ $? -ne 0 ]] && [[ $ATTEMPT -le $MAX_ATTEMPTS ]]; do
  DELAY=$((BASE_DELAY * (2 ** (ATTEMPT - 1)))) # Calculate exponential delay: 1, 2, 4, 8, 16...
  echo "Attempt $ATTEMPT/$MAX_ATTEMPTS: Command failed. Retrying in $DELAY seconds..."
  sleep "$DELAY"
  ATTEMPT=$((ATTEMPT + 1))
done

if [[ $ATTEMPT -le $MAX_ATTEMPTS ]]; then
  echo "Command succeeded after $((ATTEMPT - 1)) attempts."
else
  echo "Command failed after $MAX_ATTEMPTS attempts."
  exit 1
fi

# Example unreliable_command function (replace with your actual command)
unreliable_command() {
  # Simulate a command that fails 3 times, then succeeds
  if [[ $ATTEMPT -le 3 ]]; then
    echo "Simulating failure..."
    return 1 # Fail
  else
    echo "Simulating success..."
    return 0 # Succeed
  fi
}

2. Custom Delay and Max Attempts

For fine-grained control, you can pass retry parameters as function arguments or environment variables.

#!/bin/bash

# Function to run a command with retry logic
run_with_retry() {
  local CMD="$1"
  local MAX_TRIES="${2:-3}" # Default to 3 attempts
  local DELAY="${3:-5}"     # Default to 5-second delay
  local CURRENT_TRY=1

  echo "Executing: '$CMD' with max $MAX_TRIES tries and $DELAYs delay..."

  while ! eval "$CMD"; do # 'eval' allows running a string as a command
    if [[ $CURRENT_TRY -ge $MAX_TRIES ]]; then
      echo "Error: Command '$CMD' failed after $MAX_TRIES attempts."
      return 1 # Indicate ultimate failure
    fi
    echo "Command failed (attempt $CURRENT_TRY/$MAX_TRIES). Retrying in $DELAY seconds..."
    sleep "$DELAY"
    CURRENT_TRY=$((CURRENT_TRY + 1))
  done
  echo "Command '$CMD' succeeded."
  return 0 # Indicate success
}

# Example usage:
run_with_retry "ping -c 1 example.com" 10 3 # Try pinging example.com 10 times, 3s delay
run_with_retry "ls /nonexistent_file" 2 1  # This will fail after 2 attempts

3. Retry with Timeout

Instead of just counting attempts, you can retry for a specific duration. This is useful when you want to wait for an external service to become available within a given timeframe.

#!/bin/bash

TIMEOUT_SECONDS=60
START_TIME=$(date +%s)
RETRY_DELAY=5

echo "Retrying command for a maximum of $TIMEOUT_SECONDS seconds..."

while true; do
  if your_critical_command; then
    echo "Command succeeded."
    break
  fi

  CURRENT_TIME=$(date +%s)
  ELAPSED_TIME=$((CURRENT_TIME - START_TIME))

  if [[ $ELAPSED_TIME -ge $TIMEOUT_SECONDS ]]; then
    echo "Error: Command failed to succeed within $TIMEOUT_SECONDS seconds."
    exit 1
  fi

  echo "Command failed. Retrying in $RETRY_DELAY seconds (elapsed: ${ELAPSED_TIME}s / ${TIMEOUT_SECONDS}s)."
  sleep "$RETRY_DELAY"
done

# Example your_critical_command function (replace with your actual command)
your_critical_command() {
  # Simulate success after 30 seconds
  if [[ $(( $(date +%s) - START_TIME )) -ge 30 ]]; then
    echo "Simulating success for critical command..."
    return 0
  else
    echo "Simulating failure for critical command..."
    return 1
  fi
}

Practical Examples

1. HTTP Request Retries (using `curl`)

When making web requests, retries are essential for dealing with network flakiness or server-side issues. curl has built-in retry options.

#!/bin/bash

URL="https://example.com/api/data"

echo "Attempting to fetch data from $URL with retries..."

# -f/--fail: Fail silently (don't output HTML on HTTP errors)
# --retry <num>: Retry up to <num> times
# --retry-delay <seconds>: Wait <seconds> between retries
# --retry-max-time <seconds>: Total time for retries
if curl -f --retry 5 --retry-delay 2 --retry-max-time 30 "$URL" -o data.json; then
  echo "Data fetched successfully and saved to data.json"
else
  echo "Failed to fetch data from $URL after multiple attempts."
  exit 1
fi

For more curl options, consult the curl man page.

2. Database Connection Retries

Connecting to a database often requires retries, especially during application startup or after a database restart.

#!/bin/bash

DB_HOST="localhost"
DB_PORT="5432" # Example for PostgreSQL
MAX_TRIES=10
DELAY=3

echo "Waiting for database on $DB_HOST:$DB_PORT to become available..."

for i in $(seq 1 $MAX_TRIES); do
  # nc -z: zero-I/O mode, just scan for listening daemons
  # -w 1: timeout for connection in 1 second
  if nc -z -w 1 "$DB_HOST" "$DB_PORT"; then
    echo "Database is available."
    exit 0
  fi
  echo "Attempt $i/$MAX_TRIES: Database not yet available. Retrying in $DELAY seconds..."
  sleep "$DELAY"
done

echo "Error: Database on $DB_HOST:$DB_PORT did not become available after $MAX_TRIES attempts."
exit 1

Key Parameters for Retry Logic

Parameter	Description	Common Values
Max Attempts	The maximum number of times to reattempt an operation.	3, 5, 10
Retry Delay	The fixed time to wait between retry attempts.	1s, 5s, 10s
Backoff Factor	Multiplier for increasing delay in exponential backoff.	2 (doubles delay)
Max Delay	Upper limit for the delay between retries (with backoff).	60s, 300s
Timeout	Total time duration allowed for all retry attempts to complete.	30s, 120s, 300s

By carefully combining these strategies, you can make your shell scripts exceptionally resilient to transient failures, leading to more stable and reliable automated processes.