Anatomy of a Cron Job Disaster: A Real-World Incident Postmortem

The Setup

A mid-sized SaaS company runs a nightly cron job that processes customer invoices. Every night at 1 AM, the job pulls pending invoices from the database, generates PDFs, and emails them to customers. The job has been running reliably for 18 months.

There is no external monitoring. The team checks logs occasionally but relies on the assumption that "it has always worked."

Day 0 (Tuesday): The Silent Failure

A routine server update upgrades the system's OpenSSL library. The update is unrelated to the invoice job, so nobody thinks to check it.

The update changes the default TLS behavior. The SMTP connection to the email provider now fails with a certificate verification error. But the invoice script uses a Python library that catches the exception and continues silently.

# The problematic code
try:
    send_email(invoice_pdf, customer_email)
except Exception:
    pass  # "We'll handle errors later"

The cron job runs. It "succeeds" (exit code 0). Invoices are generated but never sent. No log entry. No alert. No indication anything is wrong.

Day 1 (Wednesday): Business as Usual

The job runs again that night. Same result: invoices generated, emails silently failing. The team works on new features. Nobody checks the invoice system because nobody has reason to.

Day 2 (Thursday): First Customer Complaint

A customer emails support asking about their invoice. Support checks the admin panel, sees the invoice was "generated," and tells the customer it was sent. They assume it went to spam.

The customer does not find it in spam. Support escalates, but the ticket gets queued behind higher-priority items.

Day 3 (Friday): The Discovery

More customer complaints arrive. Support escalates to engineering. A developer checks the SMTP logs and discovers: zero emails sent since Tuesday night.

Three days of invoices. Hundreds of customers. All affected.

The Fix (Friday Afternoon)

Identify the cause: OpenSSL update changed TLS defaults
Update the SMTP configuration to use the correct CA bundle
Re-run the invoice job for all missed invoices (Tuesday through Thursday)
Send apology emails to affected customers
Notify the finance team about delayed invoice processing

The Damage

72 hours of invoices not delivered
347 customers did not receive invoices on time
23 support tickets before the issue was discovered
~$12,000 in delayed payments due to late invoices
Customer trust erosion (harder to measure, arguably worst impact)

Root Cause Analysis

Primary Cause: Silent Exception Handling

The except: pass pattern swallowed the error entirely. The script did not log the failure, did not alert, and exited with code 0.

Contributing Factor 1: No Monitoring

There was no dead man's switch, no output validation, and no metric tracking. The only way to detect failure was a human manually checking logs (which nobody did).

Contributing Factor 2: No Output Validation

The job was considered successful if it ran without crashing. Nobody checked whether the emails were actually sent. A simple post-run check ("did we send X emails tonight?") would have caught this immediately.

Contributing Factor 3: Unrelated Change Impact

The OpenSSL update was not connected to the invoice system in anyone's mental model. Nobody thought to test email sending after a system library update.

Remediation Actions

Immediate

Replace all except: pass with proper error handling and logging
Add a dead man's switch check-in at the end of the invoice job
Add output validation: verify email send count matches invoice count

Short-term

Add monitoring to all other cron jobs (backup, data sync, cleanup)
Create a runbook for cron job failures
Add the invoice system to the post-deploy smoke test checklist

Long-term

Implement structured logging across all scheduled tasks
Build a dashboard showing cron job health at a glance
Schedule quarterly "cron job audit" to review all scheduled tasks

The Improved Script

#!/bin/bash
set -euo pipefail

LOG="/var/log/invoice-processor.log"
exec >> "$LOG" 2>&1

echo "[$(date -Iseconds)] Starting invoice processing"

# Run the job and capture metrics
RESULT=$(python3 /app/process_invoices.py --json-output)

GENERATED=$(echo "$RESULT" | jq -r '.generated')
SENT=$(echo "$RESULT" | jq -r '.sent')
FAILED=$(echo "$RESULT" | jq -r '.failed')

echo "[$(date -Iseconds)] Generated: $GENERATED, Sent: $SENT, Failed: $FAILED"

# Validate: all generated invoices should be sent
if [ "$FAILED" -gt 0 ]; then
  echo "[$(date -Iseconds)] ERROR: $FAILED invoices failed to send" >&2
  # Still check in, but report failure
  curl -fsS https://cronguard.app/api/ping/invoice-monitor/fail \
    -d "Generated: $GENERATED, Sent: $SENT, Failed: $FAILED"
  exit 1
fi

# Success check-in with metrics
curl -fsS --retry 3 https://cronguard.app/api/ping/invoice-monitor \
  -d "Generated: $GENERATED, Sent: $SENT"

echo "[$(date -Iseconds)] Invoice processing complete"

Lessons

Never use except: pass - always log, always alert, always fail loudly
Monitor outcomes, not just execution - "the job ran" is not the same as "the job did its work"
Assume cron jobs will fail - design for detection, not for perfection
Unrelated changes break things - system updates affect applications in unpredictable ways
Customer complaints are the worst monitoring system - if customers find the bug, you waited too long

Frequently asked questions about this cron job incident

What actually broke the invoice job? A routine server update upgraded the system's OpenSSL library, which changed the default TLS behavior. The SMTP connection to the email provider then failed with a certificate verification error, but the invoice script used a Python library call wrapped in an exception handler that caught the error and continued silently, so the job still exited with code 0.

Why did nobody notice for three days? There was no external monitoring: no dead man's switch, no output validation, and no metric tracking. The only way to detect the failure was a human manually checking logs, which nobody did, so the first signal was a customer emailing support on Thursday and the real discovery came on Friday when a developer looked at the SMTP logs.

What did the outage cost? Seventy-two hours of invoices were never delivered, 347 customers did not receive their invoices on time, 23 support tickets were opened before the issue was found, roughly 12,000 dollars in payments were delayed, and there was customer trust erosion that is harder to measure but arguably the worst impact.

Would monitoring have caught this? Yes. A dead man's switch would have detected the failure on Tuesday night, output validation comparing the email send count against the invoice count would have caught the zero-email problem immediately, and proper exception handling would have logged the error and alerted the team.

Why is "the job ran" not the same as "the job worked"? The job was considered successful simply because it ran without crashing, so nobody checked whether the emails were actually sent. Monitoring outcomes rather than execution means asking a question like "did we send X emails tonight?" after every run, which is exactly the post-run check that was missing here.