Scheduled Jobs That Actually Recover From Failures

This post describes real problems I encountered and solved. Names, specific systems, and identifying details have been changed. The technical problems and approaches are real, but code snippets are simplified illustrations not production code.

The Monday Morning Surprise

The script was simple. Pull data from three APIs. Clean it. Update the database. Send a summary email.

It ran every night at 2am. Worked perfectly for weeks.

Until the night API #2 went down.

No email arrived. No error alert. Just silence.

Monday morning: "Why is the dashboard showing last week's data?"

That's when I learned: a script that runs isn't the same as a script that works.

Problem 1: Silent Failures

Celery's default: if a task fails, it logs the error and moves on. At 2am, nobody's watching logs.

The fix: Make failures loud.

from celery.signals import task_failure  # https://docs.celeryq.dev/en/stable/userguide/signals.html
 
@task_failure.connect
def alert_on_failure(sender, task_id, exception, **kwargs):
    send_slack_alert(f"Task {sender.name} failed: {exception}")
    send_email_alert(f"Task {sender.name} failed", str(exception))

Now when something breaks at 2am, my phone buzzes. Not ideal for sleep, but better than finding out Monday.

Problem 2: Transient Failures

Not all failures are real failures.

Network blip. API rate limit. Database timeout during restart. These aren't bugs they're temporary. Retry in a minute and everything works.

The naive fix: Retry every 60 seconds.

The problem: When the API is overloaded, ten tasks all retrying at exactly 60 seconds hit it simultaneously. Another rate limit. Retry storm.

The real fix: Exponential backoff with jitter.

@shared_task(
    bind=True,
    max_retries=5,
    autoretry_for=(ConnectionError, TimeoutError),
    retry_backoff=True,       # 1s, 2s, 4s, 8s, 16s...
    retry_backoff_max=600,    # Cap at 10 minutes
    retry_jitter=True,        # Add randomness
)
def fetch_external_data(self, source):
    return fetch_from_api(source)

The jitter is key. Instead of all tasks retrying at exactly 60 seconds, they retry at 57, 63, 61, 59... The randomness spreads the load.

Problem 3: Partial Failures

What if API A succeeds, API B fails, and API C never runs?

With naive error handling, you get nothing. The whole task fails. But API A's data was valid! Why throw it away?

The fix: Independent tasks with graceful merging using Celery Canvas.

from celery import chord, group
 
def nightly_sync():
    workflow = chord(
        # Run in parallel - each can fail independently
        group(
            fetch_api_a.s(),
            fetch_api_b.s(),
            fetch_api_c.s(),
        ),
        # Merge whatever succeeded
        merge_results.s()
    )
    workflow.apply_async()
 
@shared_task
def merge_results(results):
    # results = [data_a, None, data_c] if B failed
    successful = [r for r in results if r is not None]
 
    if not successful:
        raise ValueError("All sources failed")
 
    save_to_database(successful)
 
    failed_sources = [i for i, r in enumerate(results) if r is None]
    if failed_sources:
        send_alert(f"Partial sync - missing sources: {failed_sources}")

Now if B fails, A and C still get saved. The email says "Partial sync." Partial data beats no data.

Problem 4: Long-Running Tasks That Die

Some tasks take 30+ minutes. Worker restarts. Server runs out of memory. Network drops.

If the task dies at record 5,000 of 10,000, what happens?

Naive: Restart from zero. Redo 5,000 records. Waste time, maybe cause duplicates.

The fix: Checkpoint progress.

@shared_task(bind=True)
def process_large_batch(self, batch_id):
    checkpoint_key = f"checkpoint:{self.request.id}"
 
    # Resume from checkpoint if it exists
    checkpoint = redis.hgetall(checkpoint_key)
    start_index = int(checkpoint.get('last_index', 0))
 
    records = get_batch(batch_id)
 
    for i, record in enumerate(records[start_index:], start=start_index):
        process_record(record)
 
        # Checkpoint every 100 records
        if i % 100 == 0:
            redis.hset(checkpoint_key, 'last_index', i)
            redis.expire(checkpoint_key, 86400)  # 24h TTL
 
    # Success - clean up checkpoint
    redis.delete(checkpoint_key)

If the task dies at 5,000 and restarts, it reads the checkpoint: "Last completed: 4,900." Resumes from 4,901. No repeated work.

Problem 5: Missed Schedules

Celery Beat says "run at 2am." But what if Celery is down at 2am?

Deployment. Server restart. Brief outage. When Celery comes back, it doesn't automatically run missed tasks. They're just gone.

The fix: Idempotent tasks with catchup logic.

@shared_task
def daily_sync():
    today = datetime.now().date()
 
    # Already ran today?
    if SyncLog.objects.filter(date=today, status='success').exists():
        return {'status': 'skipped', 'reason': 'already_ran'}
 
    # Missed yesterday?
    yesterday = today - timedelta(days=1)
    if not SyncLog.objects.filter(date=yesterday, status='success').exists():
        run_sync_for_date(yesterday)  # Catchup
 
    # Run today's sync
    result = run_sync_for_date(today)
 
    SyncLog.objects.create(date=today, status='success', count=result['count'])
    return result

Safe to run multiple times (skips if done). Self-healing (catches up missed runs). Always has an audit trail.

Results

Metric	Before	After
Silent failures	Unknown	0
Time to detect failure	1-3 days	5 minutes
Data loss from partial failures	100%	0%
Recovery from transient errors	0%	~95%

I stopped dreading Monday mornings.

Lessons Learned

Failures are normal. APIs go down. Networks blip. Design for failure from day one.

Exponential backoff + jitter. Prevents retry storms. Lets overloaded systems recover.

Checkpoint long tasks. If it takes more than a few minutes, save progress. Resume beats restart.

Make tasks idempotent. Safe to run twice. Catches up missed runs. Always know what happened.

Make failures loud. If you can't see it break, you can't fix it.

Turning a 4-Hour Report Into a Button Click - The automation this scheduling supports
When the Frontend Sends a Query as a String - Another internal tool
Dashboards Leadership Actually Uses - Monitoring these jobs