Beyond the Crontab
Setting up a cron job is easy. Keeping it running reliably in production for months and years is hard. Most cron failures come from missing one of these practices.
1. Use Lock Files to Prevent Overlap
If a job takes longer than its interval, you get two instances running simultaneously. This causes data corruption, duplicate processing, and race conditions.
#!/bin/bash
LOCKFILE=/tmp/my-job.lock
if [ -f "$LOCKFILE" ]; then
echo "Job already running, skipping"
exit 0
fi
trap "rm -f $LOCKFILE" EXIT
touch "$LOCKFILE"
# Your actual work here
process_data.sh
Or use flock for a more robust solution:
# In crontab - flock ensures only one instance runs
*/5 * * * * /usr/bin/flock -n /tmp/my-job.lock /path/to/my-job.sh
2. Always Use set -euo pipefail
Without this, bash scripts continue executing after errors. A failed database connection does not stop the rest of the script from running with stale or missing data.
#!/bin/bash
set -euo pipefail
# Now any error stops execution immediately
3. Log Everything with Timestamps
When debugging a 3 AM failure at 9 AM, timestamps in logs are the difference between finding the cause in minutes versus hours.
#!/bin/bash
LOG=/var/log/my-job.log
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"
}
log "Starting backup..."
pg_dump mydb > backup.sql
log "Backup completed: $(du -h backup.sql | cut -f1)"
4. Set Explicit Timeouts
A job that hangs indefinitely is worse than one that fails. At least a failed job can be detected. A hung job just sits there, holding resources and blocking the next run.
# Kill the job if it runs longer than 30 minutes
timeout 1800 /path/to/my-job.sh
# For curl operations
curl --max-time 30 --connect-timeout 10 https://api.example.com/data
5. Handle Temporary Failures with Retries
Network glitches, API rate limits, and temporary database locks are all transient. Retrying with backoff handles these automatically.
#!/bin/bash
MAX_RETRIES=3
RETRY_DELAY=30
for i in $(seq 1 $MAX_RETRIES); do
if curl -fsS https://api.example.com/sync; then
break
fi
if [ $i -eq $MAX_RETRIES ]; then
echo "Failed after $MAX_RETRIES retries" >&2
exit 1
fi
sleep $RETRY_DELAY
done
6. Validate Output, Not Just Exit Codes
A backup script that creates a 0-byte file technically "succeeds." Check that the output actually makes sense.
#!/bin/bash
set -euo pipefail
pg_dump mydb > backup.sql
# Validate the backup is not empty
MIN_SIZE=1000 # bytes
ACTUAL_SIZE=$(stat -f%z backup.sql 2>/dev/null || stat -c%s backup.sql)
if [ "$ACTUAL_SIZE" -lt "$MIN_SIZE" ]; then
echo "Backup suspiciously small: ${ACTUAL_SIZE} bytes" >&2
exit 1
fi
7. Stagger Your Schedules
Scheduling everything at midnight or on the hour creates resource spikes. Stagger jobs to spread the load.
# Bad: everything at midnight
0 0 * * * /path/to/backup.sh
0 0 * * * /path/to/cleanup.sh
0 0 * * * /path/to/report.sh
# Better: staggered
0 0 * * * /path/to/backup.sh
15 0 * * * /path/to/cleanup.sh
30 0 * * * /path/to/report.sh
8. Use Absolute Paths Everywhere
Cron does not load your shell profile. Relative paths, aliases, and user-specific PATH entries do not exist.
# Bad
python backup.py
# Good
/usr/local/bin/python3 /home/deploy/scripts/backup.py
9. Separate Concerns: One Job, One Task
A single cron job that does backups AND sends reports AND cleans up old files is hard to debug, hard to monitor, and hard to maintain. Split them into separate jobs with separate monitoring.
10. Monitor with a Dead Man's Switch
Add a check-in at the end of every critical cron job. If the check-in does not arrive, you get alerted. This catches every failure mode: crashes, hangs, timeouts, and permission errors.
#!/bin/bash
set -euo pipefail
# Do the work
/path/to/actual-work.sh
# Check in with monitoring
curl -fsS --retry 3 https://cronguard.app/api/ping/your-monitor-id
Conclusion
These practices are not complex, but they are easy to skip when you are "just adding a quick cron job." Every production cron job should have lock files, error handling, timeouts, logging, and monitoring. Spend 10 minutes setting these up and save yourself hours of debugging later.