Watching for What Does Not Happen
Most monitoring works by watching for bad things: high CPU, error logs, failed health checks. But cron jobs break this model. When a cron job fails, there is often nothing to detect. No error log. No spike in metrics. Just silence.
Dead man's switch monitoring inverts the approach: instead of watching for failure, it watches for the absence of success. If a job does not check in within its expected window, something is wrong.
How It Works
- Create a monitor with an expected schedule (e.g., every hour, every day at 2 AM)
- Configure a grace period (how long to wait after the expected time before alerting)
- Add a check-in call to the end of your cron job script
- If the check-in does not arrive within the grace period, the monitor alerts you
#!/bin/bash
set -euo pipefail
# Do the actual work
/usr/local/bin/process-orders.sh
# Check in - only executes if everything above succeeded
curl -fsS https://cronguard.app/api/ping/abc123
Why This Catches Everything
Traditional monitoring misses these failure modes:
| Failure Mode | Log Monitoring | Dead Man's Switch |
|---|---|---|
| Script crashes before logging | Misses it | Catches it |
| Cron daemon stops running | Misses it | Catches it |
| Server reboots without restoring crontab | Misses it | Catches it |
| Job hangs indefinitely | Misses it | Catches it |
| Permission error prevents execution | Maybe | Catches it |
| Network failure prevents work | Maybe | Catches it |
| Disk full, can't write output | Maybe | Catches it |
The pattern is simple: if the check-in does not arrive, something prevented the job from completing successfully. It does not matter what went wrong.
Grace Periods Matter
Jobs rarely take exactly the same time on every run. A backup that normally takes 5 minutes might take 20 minutes when the database is larger than usual. Without a grace period, you get false alarms every time a job runs longer than expected.
Set grace periods based on the worst-case runtime you have observed, plus a comfortable margin. For a job that normally takes 5 minutes but has peaked at 15: set a 30-minute grace period.
What to Include in Check-ins
Beyond just "I'm alive," check-in payloads can include useful context:
# Include status information
curl -fsS https://cronguard.app/api/ping/abc123 \
-d "Processed 1,247 orders in 3m 42s"
This gives you a historical record of what each run accomplished, making it easier to spot trends (processing time increasing, fewer records being processed, etc.).
Failure vs. Missing
Some monitoring systems distinguish between two states:
- Missing - the job did not check in at all (crash, hang, cron not running)
- Failed - the job checked in but reported an error
You can report failures explicitly:
#!/bin/bash
if ! /path/to/job.sh; then
# Report failure with details
curl -fsS https://cronguard.app/api/ping/abc123/fail \
-d "Exit code: $?"
exit 1
fi
# Report success
curl -fsS https://cronguard.app/api/ping/abc123
Multiple Environments
Create separate monitors for production, staging, and development. Each environment has different expected schedules and different urgency levels. A missing check-in from production needs an immediate alert; staging can wait until morning.
Common Implementation Mistakes
1. Checking in Before the Work
# WRONG - checks in before doing anything
curl -fsS https://cronguard.app/api/ping/abc123
/path/to/actual-work.sh # might fail, but we already checked in
Always check in after the work completes successfully.
2. Not Using set -e
Without set -e, the script continues after errors and reaches the check-in
curl even though earlier commands failed.
3. Ignoring curl Failures
# Use -f to fail on HTTP errors, -sS for clean output, --retry for reliability
curl -fsS --retry 3 --max-time 10 https://cronguard.app/api/ping/abc123
Conclusion
Dead man's switch monitoring is the only approach that catches every type of cron job failure, including the ones you cannot predict. It works because it does not need to understand how your job fails; it only needs to know that it did not succeed. Add one curl to the end of your scripts and stop wondering if your cron jobs are actually running.