When I started building my email generation platform, I thought model selection was simple: use the best model for everything.
GPT-4 for email generation. GPT-4 for extraction. GPT-4 for classification.
Then I ran the numbers:
- 5,000 emails/day
- ~800 tokens average per email (prompt + response)
- GPT-4: $0.03/1K input + $0.06/1K output
- Daily cost: ~$180
That's $5,400/month just for LLM calls. Not sustainable.
I needed to actually think about which model for which task.
The Task Breakdown
My platform has different types of LLM tasks:
TASK REQUIREMENTS:
Task Frequency Latency Quality
─────────────────────────────────────────────────────────
Email preview 100/day <3 sec Good enough
Final email generation 5,000/day Don't care High
Subject line generation 5,000/day Don't care Medium
Business info extraction 500/day Don't care High accuracy
Lead classification 5,000/day Don't care Medium
Content cleaning 500/day Don't care LowDifferent requirements → different models.
Why I Didn't Self-Host Small Models
First thought: "I'll run Llama 7B locally. Free inference!"
Tried it. Here's what I learned:
The Infrastructure Math
To run a 7B parameter model with decent speed:
- Minimum GPU: RTX 3090 or better (24GB VRAM)
- Server cost: ~$150/month (cloud) or $1,500+ upfront (own hardware)
- Inference speed: ~20 tokens/second on good hardware
- Concurrent requests: 1-2 per GPU
For 5,000 emails/day with ~200 tokens output each:
SELF-HOSTING MATH:
1,000,000 tokens/day output
÷ 20 tokens/second
= ~14 hours of GPU time
Need multiple GPUs for reasonable throughput.
CLOUD GPU COSTS:
AWS g4dn.xlarge: ~$0.50/hour × 14 hours = $7/day
Plus setup, maintenance, scaling headaches.
API COSTS FOR SAME WORKLOAD:
GPT-3.5: ~$2/day
Claude Haiku: ~$1.50/dayThe API was cheaper AND I didn't have to manage infrastructure.
The Quality Gap
I tested Llama 7B, Mistral 7B, and Phi-2 against GPT-3.5 for email generation:
QUALITY COMPARISON (out of 10):
Model Coherence Tone Match Instructions Personalization
────────────────────────────────────────────────────────────────────
GPT-4 9 9 10 9
GPT-3.5 8 8 9 7
Claude Sonnet 9 9 9 8
Llama 7B 6 5 6 4
Mistral 7B 7 6 7 5The small models struggled with:
- Maintaining consistent tone across paragraphs
- Following complex formatting instructions
- Incorporating multiple context pieces naturally
- Not sounding robotic
For a product where email quality directly impacts results, "good enough" wasn't good enough.
When Small Models Make Sense
They're not useless. They work for:
- Simple classification (spam/not spam)
- Basic extraction with clear patterns
- Internal tooling where quality bar is lower
- High-volume, low-stakes tasks
- When you have fine-tuning data for your specific use case
For my use case — cold emails that represent a brand — the quality gap was too visible.
The Model Selection I Landed On
class ModelRouter:
TASK_MODELS = {
# User-facing, needs speed
'email_preview': {
'model': 'gpt-3.5-turbo',
'temperature': 0.7,
'max_tokens': 500,
'timeout': 10,
},
# Final output, quality matters
'email_final': {
'model': 'gpt-4',
'temperature': 0.7,
'max_tokens': 800,
'timeout': 30,
'fallback': 'claude-3-sonnet',
},
# Short, creative
'subject_line': {
'model': 'gpt-3.5-turbo',
'temperature': 0.9, # More creative
'max_tokens': 50,
'timeout': 5,
},
# Accuracy critical, structured output
'extract_business_info': {
'model': 'gpt-4',
'temperature': 0.1, # Deterministic
'max_tokens': 300,
'timeout': 15,
},
# Simple yes/no type tasks
'classify_lead': {
'model': 'gpt-3.5-turbo',
'temperature': 0.1,
'max_tokens': 10,
'timeout': 5,
},
# Bulk cleaning, low stakes
'clean_content': {
'model': 'gpt-3.5-turbo',
'temperature': 0.1,
'max_tokens': 200,
'timeout': 10,
},
}
def get_config(self, task_type: str) -> dict:
return self.TASK_MODELS.get(
task_type,
self.TASK_MODELS['email_preview']
)Temperature: The Setting I Ignored Too Long
I left temperature at default (0.7) for everything. Big mistake.
What Temperature Actually Does
TEMPERATURE GUIDE:
0.0 - 0.3 Deterministic. Same input → nearly same output.
Good for: extraction, classification.
0.4 - 0.6 Balanced. Some variation, still focused.
Good for: structured generation.
0.7 - 0.8 Creative. Natural variation.
Good for: emails, content.
0.9 - 1.0 Very creative. Risk of going off-track.
Good for: brainstorming, subject lines.Real Example: Subject Lines
Temperature 0.3:
Email 1: "Quick question about your logistics software"
Email 2: "Quick question about your logistics platform"
Email 3: "Quick question about your logistics solution"All similar. Boring. Users noticed.
Temperature 0.9:
Email 1: "The hidden cost in your current logistics setup"
Email 2: "What I noticed about [Company]'s shipping approach"
Email 3: "A 3-minute read that might save you 30 hours"More variety. Occasionally weird ones, but mostly better.
The Problem: Batch Emails to Same Company
50 leads at one company. Same context. Same prompt structure.
With fixed temperature, emails were too similar. User complained: "These all sound the same."
Fix: Vary temperature slightly in batch processing
def get_temperature_for_batch(
base_temp: float,
index: int,
batch_size: int
) -> float:
"""
Vary temperature across a batch to get natural variation.
Base 0.7 → actual range 0.65 to 0.80
"""
variation = 0.15 # ±0.075 from base
offset = (index / batch_size) * variation - (variation / 2)
return max(0.1, min(1.0, base_temp + offset))Now emails to the same company have natural variation without going off-brand.
Response Time Reality
I measured P95 response times across models:
LATENCY (seconds):
Model P50 P95 P99
───────────────────────────────────────
GPT-3.5 Turbo 1.1 2.8 5.2
GPT-4 3.2 8.5 15.0
GPT-4 Turbo 2.1 5.2 9.0
Claude 3 Haiku 0.8 1.5 2.5
Claude 3 Sonnet 1.8 4.2 8.0Implications:
- Preview (user waiting): GPT-3.5 or Claude Haiku. 8 seconds is too long.
- Batch processing: GPT-4 fine. Nobody watching.
- Set realistic timeouts: GPT-4 with 5s timeout = lots of false failures.
def get_timeout(model: str) -> int:
if 'gpt-4' in model and 'turbo' not in model:
return 30 # It's slow, accept it
elif 'gpt-4-turbo' in model:
return 15
elif 'gpt-3.5' in model:
return 10
elif 'haiku' in model:
return 8
elif 'sonnet' in model:
return 15
else:
return 20Fallback Chains: When Primary Fails
OpenAI goes down. What happens?
Before: Everything fails. Queue backs up. Users angry.
After: Automatic fallback.
FALLBACK_CHAINS = {
'email_final': [
{'provider': 'openai', 'model': 'gpt-4'},
{'provider': 'anthropic', 'model': 'claude-3-sonnet'},
{'provider': 'openai', 'model': 'gpt-3.5-turbo'},
],
'email_preview': [
{'provider': 'openai', 'model': 'gpt-3.5-turbo'},
{'provider': 'anthropic', 'model': 'claude-3-haiku'},
],
'extract_business_info': [
{'provider': 'openai', 'model': 'gpt-4'},
{'provider': 'anthropic', 'model': 'claude-3-sonnet'},
# No GPT-3.5 fallback - quality too important
],
}
async def call_with_fallback(task_type: str, prompt: str) -> dict:
chain = FALLBACK_CHAINS.get(
task_type,
FALLBACK_CHAINS['email_preview']
)
for option in chain:
try:
return await call_provider(
option['provider'],
option['model'],
prompt,
timeout=get_timeout(option['model'])
)
except (ProviderError, TimeoutError) as e:
logger.warning(
f"{option['model']} failed: {e}, trying next"
)
continue
raise AllProvidersFailed(f"All models failed for {task_type}")Key insight: Fallback quality should match task importance.
For extraction (accuracy critical), I'd rather fail than use GPT-3.5 and get wrong data.
For previews (speed critical), degraded quality is fine — user just wants to see something.
The Cost Breakdown After Optimization
DAILY COSTS:
Task Model Volume Cost
───────────────────────────────────────────────
Email preview GPT-3.5 100 $0.02
Email final GPT-4 5,000 $15.00
Subject lines GPT-3.5 5,000 $0.50
Extraction GPT-4 500 $1.50
Classification GPT-3.5 5,000 $0.25
───────────────────────────────────────────────
TOTAL ~$17/dayDown from $180/day with "GPT-4 for everything."
Still not cheap, but sustainable for a product that needs to scale.
Key Takeaways
WHAT I LEARNED:
Match model to task GPT-4 for classification is like
hiring a lawyer to sort mail.
Small models aren't free Infrastructure costs add up.
Quality gaps are real.
APIs are often cheaper.
Temperature is a tool Extraction needs 0.1.
Creative needs 0.9.
Stop using defaults.
Measure latency Users don't care how smart GPT-4 is
if they're waiting 10 seconds.
Build fallback chains Provider outages happen.
Have a plan.
Vary temperature Same prompt + same temperature =
in batches similar outputs. Users notice.
Quality requirements Not every task needs the best model.
drive model selection But some do.Related Reading
- Controlling 20,000 Requests Without Burning Money - The infrastructure side of LLM platforms
The real insight: model selection is a product decision, not a technical one. What's the user willing to wait for? What quality level do they actually need? Answer those first, then pick the model.
