The Mystery
Every morning for a week, I woke up to alerts. My background workers were dead. Thousands of emails stuck in a queue. Users complaining.
The weird part? It always happened around 3am. Never during the day.
Let me explain what was happening and why - in simple terms.
What Is a "Worker" Anyway?
Think of a worker like a factory employee.
YOUR APP THE WORKER
┌──────────┐ ┌──────────┐
│ │ "Send this email" │ │
│ Website │ ────────────────► │ Worker │
│ │ │ │
└──────────┘ └──────────┘
You (the website) give tasks to the worker.
The worker does them in the background.
You don't wait - you move on immediately.Why use workers?
Imagine you're at a restaurant:
- Without workers: You order food, then stand at the kitchen door waiting 20 minutes
- With workers: You order food, sit down, and someone brings it when ready
Workers let your website stay fast while slow stuff (emails, reports, image processing) happens in the background.
The Problem - Workers Get "Tired"
Here's what I didn't understand at first:
A worker is just a program running on a computer. And programs use memory.
Think of memory like a desk:
FRESH WORKER (just started)
┌─────────────────────────────────────────────┐
│ │
│ Clean desk! │
│ Plenty of space to work. │
│ │
│ 📋 │
│ (one task) │
│ │
└─────────────────────────────────────────────┘Every time the worker does a task, it uses some desk space. When it's done, it should clean up.
But here's the problem: It doesn't clean up perfectly.
AFTER 1,000 TASKS
┌─────────────────────────────────────────────┐
│ 📄 📋 📝 📄 📋 📄 📝 📋 📄 📝 📋 📄 📋 │
│ 📝 📄 📋 📝 📄 📋 📝 📄 📋 📝 📄 📋 📝 │
│ │
│ Messy desk! │
│ Old papers piling up. │
│ Still works, but slower. │
│ │
└─────────────────────────────────────────────┘
AFTER 100,000 TASKS (3am)
┌─────────────────────────────────────────────┐
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│ 📄📋📝📄📋📄📝📋📄📝📋📄📋📝📄📋📝📄📋📝 │
│ 📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝📄📋📝 │
│ │
│ 💀 DESK OVERFLOW! CRASH! │
│ │
└─────────────────────────────────────────────┘Why 3am? Because by then, the worker had been running for ~18 hours, processing hundreds of thousands of tasks. The "desk" finally overflowed.
Why Doesn't Python Clean Up?
You might think: "But Python has garbage collection! It should clean up automatically!"
You're right. But here's the thing most people don't realize:
Python cleans up for Python. Not for the operating system.
Let me explain with an analogy:
IMAGINE A FILING CABINET
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ B │ │ C │ │ │ │ ← Cabinet has 4 drawers
│ └───┘ └───┘ └───┘ └───┘ │ (memory from operating system)
└─────────────────────────┘
You store files A, B, C.
Now delete file B:
┌─────────────────────────┐
│ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │
│ │ A │ │ 🗑 │ │ C │ │ │ │ ← Drawer B is now EMPTY
│ └───┘ └───┘ └───┘ └───┘ │ but still EXISTS
└─────────────────────────┘
Python: "Great! I can reuse drawer B!"
Operating System: "I still see 4 drawers being used."The drawer is empty, but the cabinet is still there. Python won't return that drawer to the OS. It keeps it "just in case."
Over time:
- Python keeps requesting more drawers
- Python empties old drawers but doesn't return them
- Eventually, the room fills with empty cabinets
- Crash!
What Was Actually Leaking?
In my case, three things were piling up:
Database Connections
Every time my worker talked to the database, it opened a connection. Like making a phone call.
TASK 1: "Hey database, get me email #123"
→ Opens phone line 1 ☎️
TASK 2: "Hey database, get me email #456"
→ Opens phone line 2 ☎️ (didn't hang up line 1!)
TASK 3: "Hey database, get me email #789"
→ Opens phone line 3 ☎️ (still not hanging up!)
...after 1000 tasks...
☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️☎️
ALL LINES BUSY! CAN'T MAKE NEW CALLS!Cached Data
My code was "remembering" things to be faster:
# This looks innocent
saved_templates = {}
def get_template(name):
if name not in saved_templates:
saved_templates[name] = load_from_disk(name)
return saved_templates[name]But over 100,000 tasks with different templates:
Hour 1: saved_templates = {template1}
Hour 5: saved_templates = {template1, template2, ..., template50}
Hour 10: saved_templates = {template1, template2, ..., template200}
Hour 18: saved_templates = {template1, template2, ..., template2000} 💀It never forgot anything!
Memory Fragmentation
Even when Python "freed" memory, the space was fragmented:
CLEAN MEMORY:
[████████████████████████████████]
One big block - nice!
FRAGMENTED MEMORY (after many tasks):
[█░█░░█░█░░░█░█░░█░░░█░█░░█░█░░█]
↑ ↑ ↑
Used Empty Used
Python: "I have lots of empty space!"
Reality: "It's all tiny holes. Can't fit anything big."The Solution (It's Embarrassingly Simple)
After a week of debugging, the fix was one line of configuration (see the official Celery documentation on worker options):
# Before (workers live forever, accumulate garbage)
celery worker --concurrency=4
# After (workers restart fresh every 1000 tasks)
celery worker --concurrency=4 --max-tasks-per-child=1000That's it.
What does this do?
WITHOUT max-tasks-per-child:
Worker 1: ═══════════════════════════════════════════► 💀 CRASH
Start 3am
WITH max-tasks-per-child=1000:
Worker 1: ════════► 🔄 (restart fresh!)
1000 tasks
Worker 1: ════════► 🔄 (restart fresh!)
1000 tasks
Worker 1: ════════► 🔄 (restart fresh!)
1000 tasks
...forever, always healthy!Think of it like shifts at a factory:
- Before: One employee works 24 hours straight until they collapse
- After: Employees work 8-hour shifts, go home, fresh employee takes over
The Complete Fix
Here's my final configuration:
# celery.py
from celery import Celery
app = Celery('myapp')
app.conf.update(
# 🔄 RESTART workers every 1000 tasks (prevents memory buildup)
worker_max_tasks_per_child=1000,
# 📥 Don't grab too many tasks at once (prevents memory spikes)
worker_prefetch_multiplier=4,
# ✅ Don't say "done" until actually done (prevents lost tasks)
task_acks_late=True,
# 🔁 If worker crashes, retry the task (nothing gets lost)
task_reject_on_worker_lost=True,
)And to fix the database connection problem (using Django's close_old_connections):
from django.db import close_old_connections
@shared_task
def process_email(email_id):
# Hang up any old phone calls first
close_old_connections()
try:
email = Email.objects.get(id=email_id)
send_email(email)
finally:
# Hang up when done
close_old_connections()How Do I Know It's Working?
Now I monitor memory like this:
import psutil
def check_worker_health():
memory_mb = psutil.Process().memory_info().rss / 1024 / 1024
if memory_mb > 300:
print(f"⚠️ Warning: Using {memory_mb}MB")
if memory_mb > 400:
print(f"🚨 Critical: Using {memory_mb}MB!")
return memory_mbBefore the fix:
Memory over time:
100MB → 200MB → 300MB → 400MB → 500MB → 💀
Always going UP until crash.After the fix:
Memory over time:
100MB → 150MB → 🔄 restart → 100MB → 150MB → 🔄 restart → 100MB
Stays flat! Never crashes!Key Lessons
Lesson 1: Long-Running Processes Accumulate Garbage
Python programs aren't meant to run forever. They slowly fill up with:
- Old database connections
- Cached data
- Memory fragments
Solution: Let them restart periodically.
Lesson 2: "Per-Task" Memory Is Misleading
Each task might only use 1MB. But after 100,000 tasks, your process uses 500MB.
The math doesn't add up because of hidden accumulation.
Solution: Monitor total process memory, not per-task memory.
Lesson 3: The Fix Is Often Configuration, Not Code
I spent a week reading code, looking for memory leaks. The fix was adding one flag:
--max-tasks-per-child=1000Solution: Know your tools' configuration options.
Summary
THE PROBLEM:
Workers ran forever → Memory accumulated → 3am crash
THE CAUSE:
- Database connections not closing
- Caches growing forever
- Memory fragmentation
THE FIX:
1. Restart workers every N tasks (max-tasks-per-child)
2. Close database connections properly
3. Monitor total memory, not per-task memory
THE RESULT:
No more 3am crashes. I sleep peacefully now.Quick Reference
If your Celery workers crash after running for hours:
# Add this to your celery config
app.conf.worker_max_tasks_per_child = 1000If you use Django ORM in tasks:
from django.db import close_old_connections
@shared_task
def my_task():
close_old_connections()
try:
# your code
finally:
close_old_connections()If you have caches that grow forever (use Python's built-in lru_cache):
from functools import lru_cache
@lru_cache(maxsize=100) # Limit to 100 items!
def get_template(name):
return load_template(name)That's it. Simple fixes for a problem that cost me a week of sleep.
Related Reading
- Retry Storms - Another cause of worker crashes: cascading failures from retries
- Queue Sizing and Backpressure - How queues fill up when workers can't keep up
- Batch to Event-Driven - Alternative architectures that handle failures more gracefully
