This post describes real problems I encountered and solved. Names, specific systems, and identifying details have been changed. The technical problems and approaches are real, but code snippets are simplified illustrations not production code.
The Monday Morning Surprise
The script was simple. Pull data from three APIs. Clean it. Update the database. Send a summary email.
It ran every night at 2am. Worked perfectly for weeks.
Until the night API #2 went down.
No email arrived. No error alert. Just silence.
Monday morning: "Why is the dashboard showing last week's data?"
That's when I learned: a script that runs isn't the same as a script that works.
Problem 1: Silent Failures
Celery's default: if a task fails, it logs the error and moves on. At 2am, nobody's watching logs.
The fix: Make failures loud.
from celery.signals import task_failure
@task_failure.connect
def alert_on_failure(sender, task_id, exception, **kwargs):
send_slack_alert(f"Task {sender.name} failed: {exception}")
send_email_alert(f"Task {sender.name} failed", str(exception))Now when something breaks at 2am, my phone buzzes. Not ideal for sleep, but better than finding out Monday.
Problem 2: Transient Failures
Not all failures are real failures.
Network blip. API rate limit. Database timeout during restart. These aren't bugs they're temporary. Retry in a minute and everything works.
The naive fix: Retry every 60 seconds.
The problem: When the API is overloaded, ten tasks all retrying at exactly 60 seconds hit it simultaneously. Another rate limit. Retry storm.
The real fix: Exponential backoff with jitter.
@shared_task(
bind=True,
max_retries=5,
autoretry_for=(ConnectionError, TimeoutError),
retry_backoff=True, # 1s, 2s, 4s, 8s, 16s...
retry_backoff_max=600, # Cap at 10 minutes
retry_jitter=True, # Add randomness
)
def fetch_external_data(self, source):
return fetch_from_api(source)The jitter is key. Instead of all tasks retrying at exactly 60 seconds, they retry at 57, 63, 61, 59... The randomness spreads the load.
Problem 3: Partial Failures
What if API A succeeds, API B fails, and API C never runs?
With naive error handling, you get nothing. The whole task fails. But API A's data was valid! Why throw it away?
The fix: Independent tasks with graceful merging.
from celery import chord, group
def nightly_sync():
workflow = chord(
# Run in parallel - each can fail independently
group(
fetch_api_a.s(),
fetch_api_b.s(),
fetch_api_c.s(),
),
# Merge whatever succeeded
merge_results.s()
)
workflow.apply_async()
@shared_task
def merge_results(results):
# results = [data_a, None, data_c] if B failed
successful = [r for r in results if r is not None]
if not successful:
raise ValueError("All sources failed")
save_to_database(successful)
failed_sources = [i for i, r in enumerate(results) if r is None]
if failed_sources:
send_alert(f"Partial sync - missing sources: {failed_sources}")Now if B fails, A and C still get saved. The email says "Partial sync." Partial data beats no data.
Problem 4: Long-Running Tasks That Die
Some tasks take 30+ minutes. Worker restarts. Server runs out of memory. Network drops.
If the task dies at record 5,000 of 10,000, what happens?
Naive: Restart from zero. Redo 5,000 records. Waste time, maybe cause duplicates.
The fix: Checkpoint progress.
@shared_task(bind=True)
def process_large_batch(self, batch_id):
checkpoint_key = f"checkpoint:{self.request.id}"
# Resume from checkpoint if it exists
checkpoint = redis.hgetall(checkpoint_key)
start_index = int(checkpoint.get('last_index', 0))
records = get_batch(batch_id)
for i, record in enumerate(records[start_index:], start=start_index):
process_record(record)
# Checkpoint every 100 records
if i % 100 == 0:
redis.hset(checkpoint_key, 'last_index', i)
redis.expire(checkpoint_key, 86400) # 24h TTL
# Success - clean up checkpoint
redis.delete(checkpoint_key)If the task dies at 5,000 and restarts, it reads the checkpoint: "Last completed: 4,900." Resumes from 4,901. No repeated work.
Problem 5: Missed Schedules
Celery Beat says "run at 2am." But what if Celery is down at 2am?
Deployment. Server restart. Brief outage. When Celery comes back, it doesn't automatically run missed tasks. They're just gone.
The fix: Idempotent tasks with catchup logic.
@shared_task
def daily_sync():
today = datetime.now().date()
# Already ran today?
if SyncLog.objects.filter(date=today, status='success').exists():
return {'status': 'skipped', 'reason': 'already_ran'}
# Missed yesterday?
yesterday = today - timedelta(days=1)
if not SyncLog.objects.filter(date=yesterday, status='success').exists():
run_sync_for_date(yesterday) # Catchup
# Run today's sync
result = run_sync_for_date(today)
SyncLog.objects.create(date=today, status='success', count=result['count'])
return resultSafe to run multiple times (skips if done). Self-healing (catches up missed runs). Always has an audit trail.
Results
| Metric | Before | After |
|---|---|---|
| Silent failures | Unknown | 0 |
| Time to detect failure | 1-3 days | 5 minutes |
| Data loss from partial failures | 100% | 0% |
| Recovery from transient errors | 0% | ~95% |
I stopped dreading Monday mornings.
Lessons Learned
Failures are normal. APIs go down. Networks blip. Design for failure from day one.
Exponential backoff + jitter. Prevents retry storms. Lets overloaded systems recover.
Checkpoint long tasks. If it takes more than a few minutes, save progress. Resume beats restart.
Make tasks idempotent. Safe to run twice. Catches up missed runs. Always know what happened.
Make failures loud. If you can't see it break, you can't fix it.
Related Reading
- Turning a 4-Hour Report Into a Button Click - The automation this scheduling supports
- When the Frontend Sends a Query as a String - Another internal tool
- Dashboards Leadership Actually Uses - Monitoring these jobs
