Which LLM For Which Task (And Why I Didn't Self-Host)

When I started building my email generation platform, I thought model selection was simple: use the best model for everything.

GPT-4 for email generation. GPT-4 for extraction. GPT-4 for classification.

Then I ran the numbers:

5,000 emails/day
~800 tokens average per email (prompt + response)
GPT-4: $0.03/1K input + $0.06/1K output
Daily cost: ~$180

That's $5,400/month just for LLM calls. Not sustainable.

I needed to actually think about which model for which task.

The Task Breakdown

My platform has different types of LLM tasks:

TASK REQUIREMENTS:
 
    Task                    Frequency    Latency    Quality
    ─────────────────────────────────────────────────────────
    Email preview           100/day      <3 sec     Good enough
    Final email generation  5,000/day    Don't care High
    Subject line generation 5,000/day    Don't care Medium
    Business info extraction 500/day     Don't care High accuracy
    Lead classification     5,000/day    Don't care Medium
    Content cleaning        500/day      Don't care Low

Different requirements → different models.

Why I Didn't Self-Host Small Models

First thought: "I'll run Llama 7B locally. Free inference!"

Tried it. Here's what I learned:

The Infrastructure Math

To run a 7B parameter model with decent speed:

Minimum GPU: RTX 3090 or better (24GB VRAM)
Server cost: ~$150/month (cloud) or $1,500+ upfront (own hardware)
Inference speed: ~20 tokens/second on good hardware
Concurrent requests: 1-2 per GPU

For 5,000 emails/day with ~200 tokens output each:

SELF-HOSTING MATH:
 
    1,000,000 tokens/day output
    ÷ 20 tokens/second
    = ~14 hours of GPU time
 
    Need multiple GPUs for reasonable throughput.
 
 
CLOUD GPU COSTS:
 
    AWS g4dn.xlarge: ~$0.50/hour × 14 hours = $7/day
    Plus setup, maintenance, scaling headaches.
 
 
API COSTS FOR SAME WORKLOAD:
 
    GPT-3.5: ~$2/day
    Claude Haiku: ~$1.50/day

The API was cheaper AND I didn't have to manage infrastructure.

The Quality Gap

I tested Llama, Mistral, and Phi-2 against GPT-3.5 for email generation:

QUALITY COMPARISON (out of 10):
 
    Model         Coherence  Tone Match  Instructions  Personalization
    ────────────────────────────────────────────────────────────────────
    GPT-4         9          9           10            9
    GPT-3.5       8          8           9             7
    Claude Sonnet 9          9           9             8
    Llama 7B      6          5           6             4
    Mistral 7B    7          6           7             5

The small models struggled with:

Maintaining consistent tone across paragraphs
Following complex formatting instructions
Incorporating multiple context pieces naturally
Not sounding robotic

For a product where email quality directly impacts results, "good enough" wasn't good enough.

When Small Models Make Sense

They're not useless. They work for:

Simple classification (spam/not spam)
Basic extraction with clear patterns
Internal tooling where quality bar is lower
High-volume, low-stakes tasks
When you have fine-tuning data for your specific use case

For my use case — cold emails that represent a brand — the quality gap was too visible.

The Model Selection I Landed On

class ModelRouter:
    TASK_MODELS = {
        # User-facing, needs speed
        'email_preview': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.7,
            'max_tokens': 500,
            'timeout': 10,
        },
 
        # Final output, quality matters
        'email_final': {
            'model': 'gpt-4',
            'temperature': 0.7,
            'max_tokens': 800,
            'timeout': 30,
            'fallback': 'claude-3-sonnet',
        },
 
        # Short, creative
        'subject_line': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.9,  # More creative
            'max_tokens': 50,
            'timeout': 5,
        },
 
        # Accuracy critical, structured output
        'extract_business_info': {
            'model': 'gpt-4',
            'temperature': 0.1,  # Deterministic
            'max_tokens': 300,
            'timeout': 15,
        },
 
        # Simple yes/no type tasks
        'classify_lead': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.1,
            'max_tokens': 10,
            'timeout': 5,
        },
 
        # Bulk cleaning, low stakes
        'clean_content': {
            'model': 'gpt-3.5-turbo',
            'temperature': 0.1,
            'max_tokens': 200,
            'timeout': 10,
        },
    }
 
    def get_config(self, task_type: str) -> dict:
        return self.TASK_MODELS.get(
            task_type,
            self.TASK_MODELS['email_preview']
        )

Temperature: The Setting I Ignored Too Long

I left temperature at default (0.7) for everything. Big mistake.

What Temperature Actually Does

TEMPERATURE GUIDE:
 
    0.0 - 0.3    Deterministic. Same input → nearly same output.
                 Good for: extraction, classification.
 
    0.4 - 0.6    Balanced. Some variation, still focused.
                 Good for: structured generation.
 
    0.7 - 0.8    Creative. Natural variation.
                 Good for: emails, content.
 
    0.9 - 1.0    Very creative. Risk of going off-track.
                 Good for: brainstorming, subject lines.

Real Example: Subject Lines

Temperature 0.3:

Email 1: "Quick question about your logistics software"
Email 2: "Quick question about your logistics platform"
Email 3: "Quick question about your logistics solution"

All similar. Boring. Users noticed.

Temperature 0.9:

Email 1: "The hidden cost in your current logistics setup"
Email 2: "What I noticed about [Company]'s shipping approach"
Email 3: "A 3-minute read that might save you 30 hours"

More variety. Occasionally weird ones, but mostly better.

The Problem: Batch Emails to Same Company

50 leads at one company. Same context. Same prompt structure.

With fixed temperature, emails were too similar. User complained: "These all sound the same."

Fix: Vary temperature slightly in batch processing

def get_temperature_for_batch(
    base_temp: float,
    index: int,
    batch_size: int
) -> float:
    """
    Vary temperature across a batch to get natural variation.
    Base 0.7 → actual range 0.65 to 0.80
    """
    variation = 0.15  # ±0.075 from base
    offset = (index / batch_size) * variation - (variation / 2)
    return max(0.1, min(1.0, base_temp + offset))

Now emails to the same company have natural variation without going off-brand.

Response Time Reality

I measured P95 response times across models:

LATENCY (seconds):
 
    Model              P50     P95     P99
    ───────────────────────────────────────
    GPT-3.5 Turbo      1.1     2.8     5.2
    GPT-4              3.2     8.5     15.0
    GPT-4 Turbo        2.1     5.2     9.0
    Claude 3 Haiku     0.8     1.5     2.5
    Claude 3 Sonnet    1.8     4.2     8.0

Implications:

Preview (user waiting): GPT-3.5 or Claude Haiku. 8 seconds is too long.
Batch processing: GPT-4 fine. Nobody watching.
Set realistic timeouts: GPT-4 with 5s timeout = lots of false failures.

def get_timeout(model: str) -> int:
    if 'gpt-4' in model and 'turbo' not in model:
        return 30  # It's slow, accept it
    elif 'gpt-4-turbo' in model:
        return 15
    elif 'gpt-3.5' in model:
        return 10
    elif 'haiku' in model:
        return 8
    elif 'sonnet' in model:
        return 15
    else:
        return 20

Fallback Chains: When Primary Fails

OpenAI goes down. What happens?

Before: Everything fails. Queue backs up. Users angry.

After: Automatic fallback.

FALLBACK_CHAINS = {
    'email_final': [
        {'provider': 'openai', 'model': 'gpt-4'},
        {'provider': 'anthropic', 'model': 'claude-3-sonnet'},
        {'provider': 'openai', 'model': 'gpt-3.5-turbo'},
    ],
    'email_preview': [
        {'provider': 'openai', 'model': 'gpt-3.5-turbo'},
        {'provider': 'anthropic', 'model': 'claude-3-haiku'},
    ],
    'extract_business_info': [
        {'provider': 'openai', 'model': 'gpt-4'},
        {'provider': 'anthropic', 'model': 'claude-3-sonnet'},
        # No GPT-3.5 fallback - quality too important
    ],
}
 
async def call_with_fallback(task_type: str, prompt: str) -> dict:
    chain = FALLBACK_CHAINS.get(
        task_type,
        FALLBACK_CHAINS['email_preview']
    )
 
    for option in chain:
        try:
            return await call_provider(
                option['provider'],
                option['model'],
                prompt,
                timeout=get_timeout(option['model'])
            )
        except (ProviderError, TimeoutError) as e:
            logger.warning(
                f"{option['model']} failed: {e}, trying next"
            )
            continue
 
    raise AllProvidersFailed(f"All models failed for {task_type}")

Key insight: Fallback quality should match task importance.

For extraction (accuracy critical), I'd rather fail than use GPT-3.5 and get wrong data.

For previews (speed critical), degraded quality is fine — user just wants to see something.

The Cost Breakdown After Optimization

DAILY COSTS:
 
    Task              Model       Volume     Cost
    ───────────────────────────────────────────────
    Email preview     GPT-3.5     100        $0.02
    Email final       GPT-4       5,000      $15.00
    Subject lines     GPT-3.5     5,000      $0.50
    Extraction        GPT-4       500        $1.50
    Classification    GPT-3.5     5,000      $0.25
    ───────────────────────────────────────────────
    TOTAL                                    ~$17/day

Down from $180/day with "GPT-4 for everything."

Still not cheap, but sustainable for a product that needs to scale.

Key Takeaways

WHAT I LEARNED:
 
Match model to task       GPT-4 for classification is like
                          hiring a lawyer to sort mail.
 
Small models aren't free  Infrastructure costs add up.
                          Quality gaps are real.
                          APIs are often cheaper.
 
Temperature is a tool     Extraction needs 0.1.
                          Creative needs 0.9.
                          Stop using defaults.
 
Measure latency           Users don't care how smart GPT-4 is
                          if they're waiting 10 seconds.
 
Build fallback chains     Provider outages happen.
                          Have a plan.
 
Vary temperature          Same prompt + same temperature =
in batches                similar outputs. Users notice.
 
Quality requirements      Not every task needs the best model.
drive model selection     But some do.

Controlling 20,000 Requests Without Burning Money - The infrastructure side of LLM platforms

The real insight: model selection is a product decision, not a technical one. What's the user willing to wait for? What quality level do they actually need? Answer those first, then pick the model.