The Invisible Infrastructure
Cron jobs are the workhorses of backend infrastructure. They run backups, process payments, send emails, sync data, and clean up logs. They do it quietly, reliably, and without asking for attention.
Until they stop. And nobody notices.
That is the fundamental problem with cron jobs: they are designed to be invisible. There is no user interface showing their status. No green light that turns red. Just silence where there used to be work getting done.
Common Causes of Silent Failures
1. Environment Variable Differences
Cron runs jobs in a minimal shell environment. The PATH, HOME, and other variables your script depends on may not be set. A script that works perfectly from the terminal can fail completely under cron because it cannot find a binary or config file.
# This works in your terminal
python3 /home/deploy/scripts/backup.py
# But cron might not find python3 because /usr/local/bin isn't in PATH
# Fix: use absolute paths
/usr/local/bin/python3 /home/deploy/scripts/backup.py
2. Swallowed Exit Codes
Bash scripts with multiple commands can mask failures. If a command in the middle fails but the last command succeeds, the script exits with code 0 and everything looks fine.
#!/bin/bash
# The backup might fail, but the echo always succeeds
pg_dump mydb > backup.sql
echo "Done" # exit code 0, even if pg_dump failed
Fix this with set -e to abort on any error:
#!/bin/bash
set -euo pipefail
pg_dump mydb > backup.sql
echo "Done" # only runs if pg_dump succeeded
3. Disk Space Exhaustion
A backup job that writes to a full disk produces a zero-byte or truncated file. The script might not even error out; it just silently creates a useless backup. Weeks later, when you actually need to restore, you discover every backup since last Tuesday is empty.
4. Permission Changes
System updates, deploys, or container rebuilds can change file permissions. A cron job that ran fine for months suddenly cannot read its config file or write to its output directory. The crontab still triggers the job, but the job immediately fails.
5. Network Timeouts
Jobs that call external APIs or sync with remote services fail when those services are temporarily unavailable. Without timeout handling and retry logic, the job either hangs indefinitely or fails and moves on without completing its work.
6. Resource Contention
Two cron jobs scheduled at the same time can compete for CPU, memory, or database connections. One or both may fail or produce incorrect results under load. This is especially common with jobs scheduled at midnight or on the hour, when many tasks tend to run simultaneously.
Why Default Cron Monitoring Does Not Work
Email Output
Cron can email output to MAILTO, but this only works if:
- A mail server is configured on the host
- Someone actually reads those emails
- The job produces output on failure (many do not)
In practice, cron emails either go to /dev/null, pile up in a mailbox nobody checks, or are never configured in the first place.
Log Files
Redirecting output to log files is better, but someone needs to read those logs. Grepping through /var/log/syslog for cron entries is not a monitoring strategy.
The Dead Man's Switch Approach
The most reliable way to monitor cron jobs is the dead man's switch pattern: instead of watching for failure, you watch for the absence of success.
Your cron job "checks in" at the end of a successful run by hitting a URL. If the check-in does not arrive within the expected window, something went wrong. It does not matter why it failed; the absence of a success signal is the alert.
#!/bin/bash
set -euo pipefail
# Do the actual work
pg_dump mydb > /backups/mydb-$(date +%Y%m%d).sql
gzip /backups/mydb-$(date +%Y%m%d).sql
# Check in with CronGuard - only runs if everything above succeeded
curl -fsS --retry 3 https://cronguard.app/api/ping/your-monitor-id
If the script fails at any point before the curl, the check-in never happens, and you get an alert. Simple, reliable, and it catches every failure mode: crashes, hangs, timeouts, permission errors, disk space issues, and everything else.
What Good Monitoring Looks Like
- Expected schedule awareness - the monitor knows when the job should run
- Grace periods - jobs can vary in duration without triggering false alarms
- Multiple notification channels - email, Slack, Discord, webhooks
- Historical tracking - see patterns in job duration and failure frequency
- Team visibility - everyone can see what is running and what is failing
Prevention Checklist
- Use
set -euo pipefailin all bash cron scripts - Use absolute paths for binaries and files
- Check disk space before writing large files
- Add timeouts to network operations
- Stagger job schedules to avoid resource contention
- Test cron jobs in the cron environment, not just your terminal
- Monitor with a dead man's switch, not just log scraping
Conclusion
Cron jobs fail silently because that is their nature. They were designed to run without human interaction, which means they also fail without human notification. The only reliable solution is proactive monitoring that detects the absence of success rather than waiting for visible failure.