Back to blog
Monitoring

Cron Job Observability: Metrics, Logs, and Traces for Scheduled Tasks

How to bring observability practices to your cron jobs. Track execution time, success rates, and output to understand what your scheduled tasks are actually doing.

CronGuard Team··6 min read

Beyond "Did It Run?"

Knowing whether a cron job ran is the bare minimum. Real observability means understanding how it ran: how long it took, how much data it processed, whether it is getting slower over time, and what it did during execution.

The Three Pillars for Cron Jobs

1. Metrics

Track quantitative data about each run:

  • Execution duration - is the job getting slower?
  • Records processed - is throughput consistent?
  • Success/failure rate - is reliability declining?
  • Resource usage - CPU, memory, disk I/O during execution
#!/bin/bash
set -euo pipefail

START=$(date +%s)

# Do the work
PROCESSED=$(python3 process_orders.py 2>&1 | tail -1)

END=$(date +%s)
DURATION=$((END - START))

# Report metrics with check-in
curl -fsS https://cronguard.app/api/ping/abc123 \
  -d "duration=${DURATION}s, processed=${PROCESSED}"

2. Logs

Structured logging makes cron output searchable and parseable:

#!/bin/bash
LOG_FILE="/var/log/cron-jobs/order-processor.log"

log_json() {
  local level=$1 msg=$2
  echo "{"timestamp":"$(date -Iseconds)","level":"$level","job":"order-processor","message":"$msg"}" >> "$LOG_FILE"
}

log_json "info" "Starting order processing"
RESULT=$(python3 process_orders.py 2>&1) || {
  log_json "error" "Failed: $RESULT"
  exit 1
}
log_json "info" "Completed: $RESULT"

3. Traces (for Complex Jobs)

For jobs that call multiple services or have multiple phases, tracing shows where time is spent and where failures occur:

Phase 1: Fetch data from API      [2.3s] OK
Phase 2: Transform records        [0.8s] OK
Phase 3: Write to database         [1.2s] OK
Phase 4: Send notifications        [0.5s] FAILED - SMTP timeout
Phase 5: Upload report to S3       [SKIPPED]

Duration Tracking

Duration trends are one of the most valuable metrics. A backup that normally takes 5 minutes but gradually increases to 30 minutes is a warning sign: growing data, degrading disk performance, or increasing contention.

Setting Duration Alerts

Alert when execution time exceeds historical norms:

  • Warning: 2x average duration
  • Critical: 5x average duration or approaching the interval between runs

Output Validation

Track what your jobs produce, not just whether they ran:

#!/bin/bash
set -euo pipefail

# Run the job and capture metrics
RESULT=$(python3 sync_orders.py --json-stats)

SYNCED=$(echo "$RESULT" | jq -r '.synced')
SKIPPED=$(echo "$RESULT" | jq -r '.skipped')
ERRORS=$(echo "$RESULT" | jq -r '.errors')

# Alert if error rate is too high
if [ "$ERRORS" -gt 10 ]; then
  echo "WARNING: $ERRORS errors during sync" >&2
fi

# Check in with metrics
curl -fsS https://cronguard.app/api/ping/abc123 \
  -d "synced=$SYNCED, skipped=$SKIPPED, errors=$ERRORS"

Historical Analysis

Store metrics over time to answer questions like:

  • Is this job getting slower? (performance degradation)
  • Is it processing fewer records? (data pipeline issue)
  • Does it fail on specific days? (pattern detection)
  • Did the last deploy affect job performance? (change correlation)

Dashboard Design

A good cron job dashboard shows:

  • Status grid - all jobs at a glance (green/yellow/red)
  • Last run info - when, how long, what it reported
  • Duration chart - execution time over days/weeks
  • Failure history - when and why jobs failed
  • Upcoming runs - what is scheduled next

Alerting Strategy

Not every metric needs an alert. Focus on actionable signals:

Signal Alert Level Action
Job did not check in Critical Investigate immediately
Job reported failure High Check logs, may need manual intervention
Duration 3x normal Warning Monitor, investigate if persistent
Output below threshold Warning Check data source, may be upstream issue

Conclusion

Observability for cron jobs means going beyond "did it run?" to understand execution duration, output quality, and failure patterns. Track metrics over time, log structured output, and alert on actionable signals. The goal is to detect problems before they become incidents and to have enough context to fix them quickly when they do.

Back to all posts