Beyond "Did It Run?"
Knowing whether a cron job ran is the bare minimum. Real observability means understanding how it ran: how long it took, how much data it processed, whether it is getting slower over time, and what it did during execution.
The Three Pillars for Cron Jobs
1. Metrics
Track quantitative data about each run:
- Execution duration - is the job getting slower?
- Records processed - is throughput consistent?
- Success/failure rate - is reliability declining?
- Resource usage - CPU, memory, disk I/O during execution
#!/bin/bash
set -euo pipefail
START=$(date +%s)
# Do the work
PROCESSED=$(python3 process_orders.py 2>&1 | tail -1)
END=$(date +%s)
DURATION=$((END - START))
# Report metrics with check-in
curl -fsS https://cronguard.app/api/ping/abc123 \
-d "duration=${DURATION}s, processed=${PROCESSED}"
2. Logs
Structured logging makes cron output searchable and parseable:
#!/bin/bash
LOG_FILE="/var/log/cron-jobs/order-processor.log"
log_json() {
local level=$1 msg=$2
echo "{"timestamp":"$(date -Iseconds)","level":"$level","job":"order-processor","message":"$msg"}" >> "$LOG_FILE"
}
log_json "info" "Starting order processing"
RESULT=$(python3 process_orders.py 2>&1) || {
log_json "error" "Failed: $RESULT"
exit 1
}
log_json "info" "Completed: $RESULT"
3. Traces (for Complex Jobs)
For jobs that call multiple services or have multiple phases, tracing shows where time is spent and where failures occur:
Phase 1: Fetch data from API [2.3s] OK
Phase 2: Transform records [0.8s] OK
Phase 3: Write to database [1.2s] OK
Phase 4: Send notifications [0.5s] FAILED - SMTP timeout
Phase 5: Upload report to S3 [SKIPPED]
Duration Tracking
Duration trends are one of the most valuable metrics. A backup that normally takes 5 minutes but gradually increases to 30 minutes is a warning sign: growing data, degrading disk performance, or increasing contention.
Setting Duration Alerts
Alert when execution time exceeds historical norms:
- Warning: 2x average duration
- Critical: 5x average duration or approaching the interval between runs
Output Validation
Track what your jobs produce, not just whether they ran:
#!/bin/bash
set -euo pipefail
# Run the job and capture metrics
RESULT=$(python3 sync_orders.py --json-stats)
SYNCED=$(echo "$RESULT" | jq -r '.synced')
SKIPPED=$(echo "$RESULT" | jq -r '.skipped')
ERRORS=$(echo "$RESULT" | jq -r '.errors')
# Alert if error rate is too high
if [ "$ERRORS" -gt 10 ]; then
echo "WARNING: $ERRORS errors during sync" >&2
fi
# Check in with metrics
curl -fsS https://cronguard.app/api/ping/abc123 \
-d "synced=$SYNCED, skipped=$SKIPPED, errors=$ERRORS"
Historical Analysis
Store metrics over time to answer questions like:
- Is this job getting slower? (performance degradation)
- Is it processing fewer records? (data pipeline issue)
- Does it fail on specific days? (pattern detection)
- Did the last deploy affect job performance? (change correlation)
Dashboard Design
A good cron job dashboard shows:
- Status grid - all jobs at a glance (green/yellow/red)
- Last run info - when, how long, what it reported
- Duration chart - execution time over days/weeks
- Failure history - when and why jobs failed
- Upcoming runs - what is scheduled next
Alerting Strategy
Not every metric needs an alert. Focus on actionable signals:
| Signal | Alert Level | Action |
|---|---|---|
| Job did not check in | Critical | Investigate immediately |
| Job reported failure | High | Check logs, may need manual intervention |
| Duration 3x normal | Warning | Monitor, investigate if persistent |
| Output below threshold | Warning | Check data source, may be upstream issue |
Conclusion
Observability for cron jobs means going beyond "did it run?" to understand execution duration, output quality, and failure patterns. Track metrics over time, log structured output, and alert on actionable signals. The goal is to detect problems before they become incidents and to have enough context to fix them quickly when they do.