Back to blog
Monitoring

Dead Man's Switch Monitoring: The Only Reliable Way to Watch Cron Jobs

Why traditional monitoring misses cron failures and how the dead man's switch pattern catches every type of failure by detecting the absence of success.

CronGuard Team··6 min read

Watching for What Does Not Happen

Most monitoring works by watching for bad things: high CPU, error logs, failed health checks. But cron jobs break this model. When a cron job fails, there is often nothing to detect. No error log. No spike in metrics. Just silence.

Dead man's switch monitoring inverts the approach: instead of watching for failure, it watches for the absence of success. If a job does not check in within its expected window, something is wrong.

How It Works

  1. Create a monitor with an expected schedule (e.g., every hour, every day at 2 AM)
  2. Configure a grace period (how long to wait after the expected time before alerting)
  3. Add a check-in call to the end of your cron job script
  4. If the check-in does not arrive within the grace period, the monitor alerts you
#!/bin/bash
set -euo pipefail

# Do the actual work
/usr/local/bin/process-orders.sh

# Check in - only executes if everything above succeeded
curl -fsS https://cronguard.app/api/ping/abc123

Why This Catches Everything

Traditional monitoring misses these failure modes:

Failure Mode Log Monitoring Dead Man's Switch
Script crashes before logging Misses it Catches it
Cron daemon stops running Misses it Catches it
Server reboots without restoring crontab Misses it Catches it
Job hangs indefinitely Misses it Catches it
Permission error prevents execution Maybe Catches it
Network failure prevents work Maybe Catches it
Disk full, can't write output Maybe Catches it

The pattern is simple: if the check-in does not arrive, something prevented the job from completing successfully. It does not matter what went wrong.

Grace Periods Matter

Jobs rarely take exactly the same time on every run. A backup that normally takes 5 minutes might take 20 minutes when the database is larger than usual. Without a grace period, you get false alarms every time a job runs longer than expected.

Set grace periods based on the worst-case runtime you have observed, plus a comfortable margin. For a job that normally takes 5 minutes but has peaked at 15: set a 30-minute grace period.

What to Include in Check-ins

Beyond just "I'm alive," check-in payloads can include useful context:

# Include status information
curl -fsS https://cronguard.app/api/ping/abc123 \
  -d "Processed 1,247 orders in 3m 42s"

This gives you a historical record of what each run accomplished, making it easier to spot trends (processing time increasing, fewer records being processed, etc.).

Failure vs. Missing

Some monitoring systems distinguish between two states:

  • Missing - the job did not check in at all (crash, hang, cron not running)
  • Failed - the job checked in but reported an error

You can report failures explicitly:

#!/bin/bash
if ! /path/to/job.sh; then
  # Report failure with details
  curl -fsS https://cronguard.app/api/ping/abc123/fail \
    -d "Exit code: $?"
  exit 1
fi

# Report success
curl -fsS https://cronguard.app/api/ping/abc123

Multiple Environments

Create separate monitors for production, staging, and development. Each environment has different expected schedules and different urgency levels. A missing check-in from production needs an immediate alert; staging can wait until morning.

Common Implementation Mistakes

1. Checking in Before the Work

# WRONG - checks in before doing anything
curl -fsS https://cronguard.app/api/ping/abc123
/path/to/actual-work.sh  # might fail, but we already checked in

Always check in after the work completes successfully.

2. Not Using set -e

Without set -e, the script continues after errors and reaches the check-in curl even though earlier commands failed.

3. Ignoring curl Failures

# Use -f to fail on HTTP errors, -sS for clean output, --retry for reliability
curl -fsS --retry 3 --max-time 10 https://cronguard.app/api/ping/abc123

Conclusion

Dead man's switch monitoring is the only approach that catches every type of cron job failure, including the ones you cannot predict. It works because it does not need to understand how your job fails; it only needs to know that it did not succeed. Add one curl to the end of your scripts and stop wondering if your cron jobs are actually running.

Back to all posts