Why Did My Celery Workers Keep Dying at 3am? (Debugging Python Memory Leaks)

The Mystery

Every morning for a week, I woke up to alerts. My background workers were dead. Thousands of emails stuck in a queue. Users complaining.

The weird part? It always happened around 3am. Never during the day.

Let me explain what was happening and why - in simple terms.

What Is a "Worker" Anyway?

Think of a worker like a factory employee.

YOUR APP                          THE WORKER
┌──────────┐                      ┌──────────┐
│          │   "Send this email"  │          │
│  Website │  ────────────────►   │  Worker  │
│          │                      │          │
└──────────┘                      └──────────┘
 
You (the website) give tasks to the worker.
The worker does them in the background.
You don't wait - you move on immediately.

Why use workers?

Imagine you're at a restaurant:

Without workers: You order food, then stand at the kitchen door waiting 20 minutes
With workers: You order food, sit down, and someone brings it when ready

Workers let your website stay fast while slow stuff (emails, reports, image processing) happens in the background.

The Problem - Workers Get "Tired"

Here's what I didn't understand at first:

A worker is just a program running on a computer. And programs use memory.

Think of memory like a desk:

FRESH WORKER (just started)
 
┌─────────────────────────────────────────────┐
│                                             │
│   Clean desk!                               │
│   Plenty of space to work.                  │
│                                             │
│            📋                               │
│         (one task)                          │
│                                             │
└─────────────────────────────────────────────┘

Every time the worker does a task, it uses some desk space. When it's done, it should clean up.

But here's the problem: It doesn't clean up perfectly.

AFTER 1,000 TASKS
 
┌─────────────────────────────────────────────┐
│  📄 📋 📝 📄 📋 📄 📝 📋 📄 📝 📋 📄 📋   │
│  📝 📄 📋 📝 📄 📋 📝 📄 📋 📝 📄 📋 📝   │
│                                             │
│   Messy desk!                               │
│   Old papers piling up.                     │
│   Still works, but slower.                  │
│                                             │
└─────────────────────────────────────────────┘
 
 
AFTER 100,000 TASKS (3am)
 
┌─────────────────────────────────────────────┐
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│                                             │
│   💀 DESK OVERFLOW! CRASH!                  │
│                                             │
└─────────────────────────────────────────────┘

Why 3am? Because by then, the worker had been running for ~18 hours, processing hundreds of thousands of tasks. The "desk" finally overflowed.

Why Doesn't Python Clean Up?

You might think: "But Python has garbage collection! It should clean up automatically!"

You're right. But here's the thing most people don't realize:

Python cleans up for Python. Not for the operating system.

Let me explain with an analogy:

IMAGINE A FILING CABINET
 
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ B │ │ C │ │   │ │  ← Cabinet has 4 drawers
│ └───┘ └───┘ └───┘ └───┘ │    (memory from operating system)
└─────────────────────────┘
 
You store files A, B, C.
 
Now delete file B:
 
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ 🗑 │ │ C │ │   │ │  ← Drawer B is now EMPTY
│ └───┘ └───┘ └───┘ └───┘ │    but still EXISTS
└─────────────────────────┘
 
Python: "Great! I can reuse drawer B!"
Operating System: "I still see 4 drawers being used."

The drawer is empty, but the cabinet is still there. Python won't return that drawer to the OS. It keeps it "just in case."

Over time:

Python keeps requesting more drawers
Python empties old drawers but doesn't return them
Eventually, the room fills with empty cabinets
Crash!

What Was Actually Leaking?

In my case, three things were piling up:

Database Connections

Every time my worker talked to the database, it opened a connection. Like making a phone call.

TASK 1: "Hey database, get me email #123"
        → Opens phone line 1 ☎️
 
TASK 2: "Hey database, get me email #456"
        → Opens phone line 2 ☎️ (didn't hang up line 1!)
 
TASK 3: "Hey database, get me email #789"
        → Opens phone line 3 ☎️ (still not hanging up!)
 
...after 1000 tasks...
 
☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️
ALL LINES BUSY! CAN'T MAKE NEW CALLS!

Cached Data

My code was "remembering" things to be faster:

# This looks innocent
saved_templates = {}
 
def get_template(name):
    if name not in saved_templates:
        saved_templates[name] = load_from_disk(name)
    return saved_templates[name]

But over 100,000 tasks with different templates:

Hour 1:   saved_templates = {template1}
Hour 5:   saved_templates = {template1, template2, ..., template50}
Hour 10:  saved_templates = {template1, template2, ..., template200}
Hour 18:  saved_templates = {template1, template2, ..., template2000} 💀

It never forgot anything!

Memory Fragmentation

Even when Python "freed" memory, the space was fragmented:

CLEAN MEMORY:
[████████████████████████████████]
One big block - nice!
 
 
FRAGMENTED MEMORY (after many tasks):
[█░█░░█░█░░░█░█░░█░░░█░█░░█░█░░█]
 ↑   ↑     ↑
 Used Empty Used
 
Python: "I have lots of empty space!"
Reality: "It's all tiny holes. Can't fit anything big."

The Solution (It's Embarrassingly Simple)

After a week of debugging, the fix was one line of configuration (see the official Celery documentation on worker options):

# Before (workers live forever, accumulate garbage)
celery worker --concurrency=4
 
# After (workers restart fresh every 1000 tasks)
celery worker --concurrency=4 --max-tasks-per-child=1000

That's it.

What does this do?

WITHOUT max-tasks-per-child:
 
Worker 1: ═══════════════════════════════════════════► 💀 CRASH
          Start                                        3am
 
WITH max-tasks-per-child=1000:
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
Worker 1: ════════► 🔄 (restart fresh!)
          1000 tasks
 
...forever, always healthy!

Think of it like shifts at a factory:

Before: One employee works 24 hours straight until they collapse
After: Employees work 8-hour shifts, go home, fresh employee takes over

The Complete Fix

Here's my final configuration:

# celery.py
 
from celery import Celery
 
app = Celery('myapp')
 
app.conf.update(
    # 🔄 RESTART workers every 1000 tasks (prevents memory buildup)
    worker_max_tasks_per_child=1000,
 
    # 📥 Don't grab too many tasks at once (prevents memory spikes)
    worker_prefetch_multiplier=4,
 
    # ✅ Don't say "done" until actually done (prevents lost tasks)
    task_acks_late=True,
 
    # 🔁 If worker crashes, retry the task (nothing gets lost)
    task_reject_on_worker_lost=True,
)

And to fix the database connection problem (using Django's close_old_connections):

from django.db import close_old_connections
 
@shared_task
def process_email(email_id):
    # Hang up any old phone calls first
    close_old_connections()
 
    try:
        email = Email.objects.get(id=email_id)
        send_email(email)
    finally:
        # Hang up when done
        close_old_connections()

How Do I Know It's Working?

Now I monitor memory like this:

import psutil
 
def check_worker_health():
    memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
 
    if memory_mb > 300:
        print(f"⚠️ Warning: Using {memory_mb}MB")
    if memory_mb > 400:
        print(f"🚨 Critical: Using {memory_mb}MB!")
 
    return memory_mb

Before the fix:

Memory over time:
100MB → 200MB → 300MB → 400MB → 500MB → 💀
 
Always going UP until crash.

After the fix:

Memory over time:
100MB → 150MB → 🔄 restart → 100MB → 150MB → 🔄 restart → 100MB
 
Stays flat! Never crashes!

Key Lessons

Lesson 1: Long-Running Processes Accumulate Garbage

Python programs aren't meant to run forever. They slowly fill up with:

Old database connections
Cached data
Memory fragments

Solution: Let them restart periodically.

Lesson 2: "Per-Task" Memory Is Misleading

Each task might only use 1MB. But after 100,000 tasks, your process uses 500MB.

The math doesn't add up because of hidden accumulation.

Solution: Monitor total process memory, not per-task memory.

Lesson 3: The Fix Is Often Configuration, Not Code

I spent a week reading code, looking for memory leaks. The fix was adding one flag:

--max-tasks-per-child=1000

Solution: Know your tools' configuration options.

Summary

THE PROBLEM:
Workers ran forever → Memory accumulated → 3am crash
 
THE CAUSE:
- Database connections not closing
- Caches growing forever
- Memory fragmentation
 
THE FIX:
1. Restart workers every N tasks (max-tasks-per-child)
2. Close database connections properly
3. Monitor total memory, not per-task memory
 
THE RESULT:
No more 3am crashes. I sleep peacefully now.

Quick Reference

If your Celery workers crash after running for hours:

# Add this to your celery config
app.conf.worker_max_tasks_per_child = 1000

If you use Django ORM in tasks:

from django.db import close_old_connections
 
@shared_task
def my_task():
    close_old_connections()
    try:
        # your code
    finally:
        close_old_connections()

If you have caches that grow forever (use Python's built-in lru_cache):

from functools import lru_cache
 
@lru_cache(maxsize=100)  # Limit to 100 items!
def get_template(name):
    return load_template(name)

That's it. Simple fixes for a problem that cost me a week of sleep.

Retry Storms - Another cause of worker crashes: cascading failures from retries
Queue Sizing and Backpressure - How queues fill up when workers can't keep up
Batch to Event-Driven - Alternative architectures that handle failures more gracefully