Back to blog
Reliability

Anatomy of a Cron Job Disaster: A Real-World Incident Postmortem

A detailed walkthrough of a real cron job failure that caused a 72-hour data loss. What went wrong, how it was discovered, and the lessons learned.

CronGuard Team··8 min read

The Setup

A mid-sized SaaS company runs a nightly cron job that processes customer invoices. Every night at 1 AM, the job pulls pending invoices from the database, generates PDFs, and emails them to customers. The job has been running reliably for 18 months.

There is no external monitoring. The team checks logs occasionally but relies on the assumption that "it has always worked."

Day 0 (Tuesday): The Silent Failure

A routine server update upgrades the system's OpenSSL library. The update is unrelated to the invoice job, so nobody thinks to check it.

The update changes the default TLS behavior. The SMTP connection to the email provider now fails with a certificate verification error. But the invoice script uses a Python library that catches the exception and continues silently.

# The problematic code
try:
    send_email(invoice_pdf, customer_email)
except Exception:
    pass  # "We'll handle errors later"

The cron job runs. It "succeeds" (exit code 0). Invoices are generated but never sent. No log entry. No alert. No indication anything is wrong.

Day 1 (Wednesday): Business as Usual

The job runs again that night. Same result: invoices generated, emails silently failing. The team works on new features. Nobody checks the invoice system because nobody has reason to.

Day 2 (Thursday): First Customer Complaint

A customer emails support asking about their invoice. Support checks the admin panel, sees the invoice was "generated," and tells the customer it was sent. They assume it went to spam.

The customer does not find it in spam. Support escalates, but the ticket gets queued behind higher-priority items.

Day 3 (Friday): The Discovery

More customer complaints arrive. Support escalates to engineering. A developer checks the SMTP logs and discovers: zero emails sent since Tuesday night.

Three days of invoices. Hundreds of customers. All affected.

The Fix (Friday Afternoon)

  1. Identify the cause: OpenSSL update changed TLS defaults
  2. Update the SMTP configuration to use the correct CA bundle
  3. Re-run the invoice job for all missed invoices (Tuesday through Thursday)
  4. Send apology emails to affected customers
  5. Notify the finance team about delayed invoice processing

The Damage

  • 72 hours of invoices not delivered
  • 347 customers did not receive invoices on time
  • 23 support tickets before the issue was discovered
  • ~$12,000 in delayed payments due to late invoices
  • Customer trust erosion (harder to measure, arguably worst impact)

Root Cause Analysis

Primary Cause: Silent Exception Handling

The except: pass pattern swallowed the error entirely. The script did not log the failure, did not alert, and exited with code 0.

Contributing Factor 1: No Monitoring

There was no dead man's switch, no output validation, and no metric tracking. The only way to detect failure was a human manually checking logs (which nobody did).

Contributing Factor 2: No Output Validation

The job was considered successful if it ran without crashing. Nobody checked whether the emails were actually sent. A simple post-run check ("did we send X emails tonight?") would have caught this immediately.

Contributing Factor 3: Unrelated Change Impact

The OpenSSL update was not connected to the invoice system in anyone's mental model. Nobody thought to test email sending after a system library update.

Remediation Actions

Immediate

  1. Replace all except: pass with proper error handling and logging
  2. Add a dead man's switch check-in at the end of the invoice job
  3. Add output validation: verify email send count matches invoice count

Short-term

  1. Add monitoring to all other cron jobs (backup, data sync, cleanup)
  2. Create a runbook for cron job failures
  3. Add the invoice system to the post-deploy smoke test checklist

Long-term

  1. Implement structured logging across all scheduled tasks
  2. Build a dashboard showing cron job health at a glance
  3. Schedule quarterly "cron job audit" to review all scheduled tasks

The Improved Script

#!/bin/bash
set -euo pipefail

LOG="/var/log/invoice-processor.log"
exec >> "$LOG" 2>&1

echo "[$(date -Iseconds)] Starting invoice processing"

# Run the job and capture metrics
RESULT=$(python3 /app/process_invoices.py --json-output)

GENERATED=$(echo "$RESULT" | jq -r '.generated')
SENT=$(echo "$RESULT" | jq -r '.sent')
FAILED=$(echo "$RESULT" | jq -r '.failed')

echo "[$(date -Iseconds)] Generated: $GENERATED, Sent: $SENT, Failed: $FAILED"

# Validate: all generated invoices should be sent
if [ "$FAILED" -gt 0 ]; then
  echo "[$(date -Iseconds)] ERROR: $FAILED invoices failed to send" >&2
  # Still check in, but report failure
  curl -fsS https://cronguard.app/api/ping/invoice-monitor/fail \
    -d "Generated: $GENERATED, Sent: $SENT, Failed: $FAILED"
  exit 1
fi

# Success check-in with metrics
curl -fsS --retry 3 https://cronguard.app/api/ping/invoice-monitor \
  -d "Generated: $GENERATED, Sent: $SENT"

echo "[$(date -Iseconds)] Invoice processing complete"

Lessons

  1. Never use except: pass - always log, always alert, always fail loudly
  2. Monitor outcomes, not just execution - "the job ran" is not the same as "the job did its work"
  3. Assume cron jobs will fail - design for detection, not for perfection
  4. Unrelated changes break things - system updates affect applications in unpredictable ways
  5. Customer complaints are the worst monitoring system - if customers find the bug, you waited too long

Conclusion

This incident was entirely preventable. A dead man's switch would have detected the failure on Tuesday night. Output validation would have caught the zero-email problem immediately. Proper exception handling would have logged the error and alerted the team.

The cost of monitoring: a few minutes of setup. The cost of not monitoring: 72 hours of silent failure, hundreds of angry customers, and a Friday afternoon spent in damage control.

Back to all posts