Dead Man's Switch Monitoring: The Only Reliable Way to Watch Cron Jobs

Watching for What Does Not Happen

Most monitoring works by watching for bad things: high CPU, error logs, failed health checks. But cron jobs break this model. When a cron job fails, there is often nothing to detect. No error log. No spike in metrics. Just silence.

Dead man's switch monitoring inverts the approach: instead of watching for failure, it watches for the absence of success. If a job does not check in within its expected window, something is wrong.

How It Works

Create a monitor with an expected schedule (e.g., every hour, every day at 2 AM)
Configure a grace period (how long to wait after the expected time before alerting)
Add a check-in call to the end of your cron job script
If the check-in does not arrive within the grace period, the monitor alerts you

#!/bin/bash
set -euo pipefail

# Do the actual work
/usr/local/bin/process-orders.sh

# Check in - only executes if everything above succeeded
curl -fsS https://cronguard.app/api/ping/abc123

Why This Catches Everything

Traditional monitoring misses these failure modes:

Failure Mode	Log Monitoring	Dead Man's Switch
Script crashes before logging	Misses it	Catches it
Cron daemon stops running	Misses it	Catches it
Server reboots without restoring crontab	Misses it	Catches it
Job hangs indefinitely	Misses it	Catches it
Permission error prevents execution	Maybe	Catches it
Network failure prevents work	Maybe	Catches it
Disk full, can't write output	Maybe	Catches it

The pattern is simple: if the check-in does not arrive, something prevented the job from completing successfully. It does not matter what went wrong.

Grace Periods Matter

Jobs rarely take exactly the same time on every run. A backup that normally takes 5 minutes might take 20 minutes when the database is larger than usual. Without a grace period, you get false alarms every time a job runs longer than expected.

Set grace periods based on the worst-case runtime you have observed, plus a comfortable margin. For a job that normally takes 5 minutes but has peaked at 15: set a 30-minute grace period.

What to Include in Check-ins

Beyond just "I'm alive," check-in payloads can include useful context:

# Include status information
curl -fsS https://cronguard.app/api/ping/abc123 \
  -d "Processed 1,247 orders in 3m 42s"

This gives you a historical record of what each run accomplished, making it easier to spot trends (processing time increasing, fewer records being processed, etc.).

Failure vs. Missing

Some monitoring systems distinguish between two states:

Missing - the job did not check in at all (crash, hang, cron not running)
Failed - the job checked in but reported an error

You can report failures explicitly:

#!/bin/bash
if ! /path/to/job.sh; then
  # Report failure with details
  curl -fsS https://cronguard.app/api/ping/abc123/fail \
    -d "Exit code: $?"
  exit 1
fi

# Report success
curl -fsS https://cronguard.app/api/ping/abc123

Multiple Environments

Create separate monitors for production, staging, and development. Each environment has different expected schedules and different urgency levels. A missing check-in from production needs an immediate alert; staging can wait until morning.

Common Implementation Mistakes

1. Checking in Before the Work

# WRONG - checks in before doing anything
curl -fsS https://cronguard.app/api/ping/abc123
/path/to/actual-work.sh  # might fail, but we already checked in

Always check in after the work completes successfully.

2. Not Using set -e

Without set -e, the script continues after errors and reaches the check-in curl even though earlier commands failed.

3. Ignoring curl Failures

# Use -f to fail on HTTP errors, -sS for clean output, --retry for reliability
curl -fsS --retry 3 --max-time 10 https://cronguard.app/api/ping/abc123

Frequently asked questions about dead man's switch monitoring

What is dead man's switch monitoring? It inverts the usual approach to monitoring. Instead of watching for bad things like high CPU, error logs, or failed health checks, it watches for the absence of success: you create a monitor with an expected schedule and a grace period, add a check-in call to the end of your cron job script, and if that check-in does not arrive within the grace period the monitor alerts you.

Why does log monitoring miss so many cron failures? Because a failing cron job often produces nothing to detect — no error log, no spike in metrics, just silence. Log monitoring misses a script that crashes before it logs anything, a cron daemon that stops running, a server that reboots without restoring the crontab, and a job that hangs indefinitely. A dead man's switch catches all of those, because the absence of the check-in is the signal.

How long should the grace period be? Base it on the worst-case runtime you have observed, plus a comfortable margin. Jobs rarely take exactly the same time on every run — a backup that normally takes 5 minutes might take 20 when the database is larger than usual — and without a grace period you get false alarms every time a job runs long. For a job that normally takes 5 minutes but has peaked at 15, set a 30-minute grace period.

Where in the script should the check-in go? Always after the work completes successfully, never before. Checking in first means the check-in still happens even if the actual work then fails. You also need set -e, otherwise the script continues after errors and reaches the check-in even though earlier commands failed.

What is the difference between a missing job and a failed job? Missing means the job did not check in at all — a crash, a hang, or cron not running. Failed means the job checked in but reported an error, which you can do explicitly by calling the fail endpoint with details when your script detects a non-zero exit.

Dead Man's Switch Monitoring: The Only Reliable Way to Watch Cron Jobs

Watching for What Does Not Happen

How It Works

Why This Catches Everything

Grace Periods Matter

What to Include in Check-ins

Failure vs. Missing

Multiple Environments

Common Implementation Mistakes

1. Checking in Before the Work

2. Not Using set -e

3. Ignoring curl Failures

Frequently asked questions about dead man's switch monitoring

Further reading

Related posts

Set up your first monitor.
It'll take 30 seconds.

Watching for What Does Not Happen

How It Works

Why This Catches Everything

Grace Periods Matter

What to Include in Check-ins

Failure vs. Missing

Multiple Environments

Common Implementation Mistakes

1. Checking in Before the Work

2. Not Using set -e

3. Ignoring curl Failures

Frequently asked questions about dead man's switch monitoring

Further reading

Related posts

Set up your first monitor.It'll take 30 seconds.

Set up your first monitor.
It'll take 30 seconds.