Back to Notes
·9 min read

Why Did Our Queue Crash on Black Friday? (Understanding Queue Sizing and Backpressure)

#RabbitMQ#Architecture#Celery#Queues#Python#Performance
Why Did Our Queue Crash on Black Friday? (Understanding Queue Sizing and Backpressure)

The Meltdown

Our email system handled 50,000 emails per hour. Easy.

Then marketing launched a Black Friday campaign. Without telling engineering.

Suddenly we had 500,000 emails to send. Ten times our normal load.

The queue backed up. Memory spiked. RabbitMQ started dropping connections. Workers couldn't even report they were done.

The system didn't slow down gracefully. It collapsed.


What Is a Queue?

Think of a queue like a waiting line at a coffee shop:

CUSTOMERS                    BARISTA
   ↓                            ↓
  ☕ ☕ ☕ ☕ ☕ → [WAITING LINE] → 👨‍🍳 → ☕ Ready!
 
Customers arrive.
They wait in line.
Barista makes one drink at a time.

In software:

REQUESTS                    WORKER
   ↓                           ↓
  📧 📧 📧 📧 📧 → [QUEUE] → 🤖 → ✓ Sent!
 
Email requests arrive.
They wait in the queue.
Worker sends one at a time.

The queue is just a waiting room. It holds work until workers are ready.


The Problem - When Lines Get Too Long

Queues work great when customers arrive at roughly the same rate workers process them.

Normal day:

ARRIVAL RATE:   100 emails/minute
PROCESSING:     100 emails/minute
QUEUE LENGTH:   ~0 (no backup)
 
Life is good!

Black Friday:

ARRIVAL RATE:   1000 emails/minute
PROCESSING:      100 emails/minute
DEFICIT:         900 emails/minute pile up
 
After 1 hour:   54,000 emails waiting
After 2 hours: 108,000 emails waiting
After 4 hours: 216,000 emails (SYSTEM CRASHES 💥)

The queue keeps growing until something breaks.

Think of it like a restaurant:

NORMAL NIGHT:
10 customers/hour arrive
10 customers/hour seated
→ No wait
 
BLACK FRIDAY:
100 customers/hour arrive
10 customers/hour seated
→ Line goes out the door
→ Around the block
→ People start leaving (or worse, passing out)

What Happens When a Queue Gets Too Big

When the queue fills up, bad things happen:

STAGE 1: Queue Growing
─────────────────────────────
Queue: ████░░░░░░ (40% full)
Status: "Hmm, getting busy"
 
STAGE 2: Queue Large
─────────────────────────────
Queue: ████████░░ (80% full)
Status: "Workers can't keep up"
Memory: Climbing
 
STAGE 3: Queue Critical
─────────────────────────────
Queue: ██████████ (100% full)
Status: "No room for new messages"
Memory: Almost maxed
 
STAGE 4: Collapse
─────────────────────────────
Queue: 💥💥💥💥💥
Status: "RabbitMQ out of memory"
         "Workers can't connect"
         "New messages rejected"
         "Everything stops"

The First Question - What Should Happen When It's Full?

This is the key decision. When the queue is full, you have three options:

Option 1: Drop Old Messages (First In, First Out)

QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
 
ACTION: Drop oldest message
 
[Dropped] [Old] [Old] [Old] [New]
 
"Sorry, that message waited too long. It's gone."

Good for: Notifications, status updates, things that expire anyway

Bad for: Important emails, financial transactions

Option 2: Reject New Messages

QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
 
ACTION: Reject the new message
 
NEW MESSAGE: "Sorry, queue is full. Try later."
 
"Come back when we have room."

Good for: When you want to tell senders to slow down

Bad for: Fire-and-forget systems

Option 3: Move to Overflow (Dead Letter Queue)

QUEUE FULL:
[Old] [Old] [Old] [Old] [Old] → NEW MESSAGE ARRIVES
 
ACTION: Send to overflow queue for later
 
MAIN QUEUE: [Old] [Old] [Old] [Old] [Old]
OVERFLOW:   [New]
 
"We'll get to you, just not right now."

Good for: When you can't lose messages but can delay them


The Better Solution - Don't Let It Get Full

The real fix isn't "what to do when full." It's "how to prevent it from getting full."

This is called backpressure. It means telling senders to slow down.

Think of it like a bouncer at a club:

WITHOUT BACKPRESSURE:
Club capacity: 100 people
People trying to enter: 1000
Result: Chaos, crushing, fire hazard
 
WITH BACKPRESSURE (Bouncer):
Club capacity: 100 people
People trying to enter: 1000
Bouncer: "Sorry, we're at capacity. Line forms here."
Result: Orderly line, people wait their turn

In code, backpressure means:

def send_email(email_data):
    # Check queue depth first
    current_depth = get_queue_depth()
 
    if current_depth > 50000:
        # DON'T add more to the queue
        # Either:
        # 1. Return an error
        # 2. Wait until there's room
        # 3. Save for later
        raise BackpressureError("Queue full, try again later")
 
    # Queue has room - proceed
    queue.send(email_data)

Priority - Not All Messages Are Equal

Here's a key insight: When overloaded, focus on what matters most.

EMAIL TYPES:
 
CRITICAL: Password resets, security alerts
          → Never drop. Ever.
 
HIGH: Order confirmations, receipts
      → Should send, but can wait
 
NORMAL: Marketing campaigns
        → Nice to send, okay to drop some
 
LOW: Newsletters, digests
     → Can wait hours, can be dropped

The strategy:

NORMAL DAY:
Critical: ██ (handled immediately)
High:     ████ (handled quickly)
Normal:   ████████ (bulk of traffic)
Low:      ██ (background)
 
 
OVERLOADED (Black Friday):
Critical: ██ (STILL handled immediately)
High:     ████ (still priority)
Normal:   ████░░░░ (some dropped)
Low:      ░░ (mostly dropped)
 
Critical emails ALWAYS go through.
Low priority emails can wait or be dropped.

Think of it like a hospital emergency room:

NORMAL DAY:
Heart attack patient → Immediate attention
Broken arm → Wait 30 minutes
Cold symptoms → Wait 2 hours
 
MASS CASUALTY EVENT:
Heart attack patient → STILL immediate attention
Broken arm → Wait longer
Cold symptoms → "Please come back tomorrow"
 
You don't ignore heart attacks because there's a crowd.

The Visual Flow

Here's how I handle incoming email requests now:

EMAIL REQUEST ARRIVES


┌─────────────────────────────┐
│  What's the priority?       │
└─────────────────────────────┘

    ┌────┴────┬────────┬────────┐
    ▼         ▼        ▼        ▼
CRITICAL    HIGH    NORMAL     LOW
    │         │        │        │
    ▼         ▼        ▼        ▼
┌───────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Queue  │ │Queue │ │Queue │ │Queue │
│Limit: │ │Limit:│ │Limit:│ │Limit:│
│Unlim. │ │50000 │ │100K  │ │10000
└───────┘ └──────┘ └──────┘ └──────┘
    │         │        │        │
    ▼         ▼        ▼        ▼
 Always    Usually   Maybe    Often
  Sent      Sent     Sent    Dropped

Rate Limiting Yourself

Another way to prevent overload: Control how fast you add to the queue.

WITHOUT RATE LIMITING:
Marketing: "Send 500,000 emails NOW!"
System: *tries to queue 500,000 at once*
System: 💥
 
WITH RATE LIMITING:
Marketing: "Send 500,000 emails NOW!"
System: "I'll add 100 per second to the queue"
System: "That's 5,000 per minute"
System: "All 500,000 will be queued in ~100 minutes"
System: ✓ (Still running smoothly)

Think of it like filling a bathtub:

WITHOUT RATE LIMITING:
You: *turns faucet to maximum*
Bathtub: *overflows*
 
WITH RATE LIMITING:
You: *turns faucet to medium*
Bathtub: *fills slowly but doesn't overflow*

Dead Letter Queues - The Safety Net

What happens to messages that can't be delivered?

They go to a Dead Letter Queue (DLQ) - a special place for problem messages. See RabbitMQ's documentation on Dead Letter Exchanges for the implementation details.

NORMAL MESSAGE:
Request → Queue → Worker → ✓ Done!
 
PROBLEM MESSAGE:
Request → Queue → Worker → ✗ Failed


           Dead Letter Queue


         "We'll investigate later"

The DLQ is like a lost and found:

NORMAL PACKAGES:
Mail arrives → Delivered to address → Done!
 
PROBLEM PACKAGES:
Mail arrives → Wrong address!


            Lost and Found


        "Someone will figure this out"

What goes to the DLQ:

  • Messages that failed too many times
  • Messages that sat in queue too long (expired)
  • Messages rejected because queue was full

Monitoring - Know Before It Breaks

You need to see the queue filling up before it crashes.

QUEUE HEALTH DASHBOARD:
 
emails.critical:  ██░░░░░░░░ (20%)  ✓ Healthy
emails.high:      ████░░░░░░ (40%)  ✓ Healthy
emails.normal:    ████████░░ (80%)  ⚠️ Warning!
emails.low:       ██████████ (100%) 🚨 FULL!
 
ALERTS:
⚠️ emails.normal at 80% capacity for 5 minutes
🚨 emails.low queue full, messages being dropped

What to alert on:

GREEN (0-60%):     No action needed
YELLOW (60-80%):   "Getting busy, watching it"
ORANGE (80-95%):   "Add more workers or reduce load"
RED (95-100%):     "Messages being dropped!"

The Key Insight

Here's what I learned from that Black Friday disaster:

Queues don't solve capacity problems. They defer them.

WITHOUT QUEUE:
1000 requests → System handles 100900 immediately fail
 
WITH QUEUE:
1000 requests → Queue holds 900...then what?
 
If you can't process them EVENTUALLY, they still fail.
Just later.

A queue is temporary storage, not magic. If producers consistently outpace consumers, the queue will fill up. Every time.

The real solutions are:

  1. More consumers (process faster)
  2. Backpressure (slow down producers)
  3. Load shedding (drop low-priority work)
  4. Rate limiting (spread the load over time)

Key Lessons

Lesson 1: Design for Overload First

Don't wait until Black Friday to think about capacity. Plan for 10x traffic on day one.

Lesson 2: Not All Messages Are Equal

Critical messages should never be dropped. Low-priority messages can wait or be dropped when needed.

Lesson 3: Backpressure Is Your Friend

Telling producers to slow down is better than crashing when they don't.

Lesson 4: Monitor Before You Crash

If you can see the queue filling up, you can react before it's too late.


Quick Reference

Queue strategies:

When full, what happens to new messages?
 
drop-head:       Drop oldest (for expiring data)
reject-publish:  Reject new (for backpressure)
dead-letter:     Move to DLQ (for investigation)

Priority levels:

CRITICAL: Never drop, unlimited queue
HIGH:     Rarely drop, generous queue
NORMAL:   Can drop when overloaded
LOW:      Aggressive dropping okay

Warning signs:

⚠️ Queue depth steadily increasing
⚠️ Memory usage climbing
⚠️ Workers falling behind
🚨 Messages being rejected
🚨 Dead letter queue growing

Summary

THE PROBLEM:
Marketing launched a 10x campaign without warning
Queue filled up, system crashed
 
WHY IT HAPPENED:
- No queue limits (grew until out of memory)
- No backpressure (kept accepting work)
- No priority (marketing emails treated same as password resets)
- No monitoring (didn't see it coming)
 
THE FIX:
1. Set queue size limits
2. Implement backpressure (slow down when busy)
3. Use priorities (critical emails always go through)
4. Add dead letter queues (don't lose messages)
5. Monitor queue depth (alert before full)
 
THE RESULT:
10x traffic → Graceful degradation
Critical emails always sent
Low-priority emails delayed (not lost)
System stays running

The question isn't "what's the right queue size?" It's "what happens when you exceed it?" Design for that, and the queue size takes care of itself.


Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.