The Disaster
We built an email system that used AI to personalize product recommendations. It worked beautifully in testing.
Then a customer complained: "Your email says this product costs $29.99. I can't find it anywhere on your site."
Because it didn't exist. The AI made it up.
Why Do LLMs "Lie"?
First, let's understand what's actually happening. LLMs don't "lie" on purpose. They have a different problem:
They don't know the difference between "true" and "sounds true".
Think of an LLM like a very confident friend who's great at conversation but terrible at facts:
YOU: "Hey, what's a good laptop under $500?"
CONFIDENT FRIEND: "Oh, the Dell XPS 15 is great at $499!"
YOU: "Really? Let me check... it's actually $1,299."
FRIEND: "Oh. Well, it SOUNDS like it could be $499, right?"The friend isn't lying. They're pattern-matching. "Dell XPS" + "laptop recommendation" + "under $500 context" = plausible-sounding answer.
LLMs work the same way. They predict what text should come next based on patterns. Not based on truth.
What Went Wrong in Our System
Here's what our email system was doing:
STEP 1: Get user preferences
"Sarah likes home office equipment"
STEP 2: Ask LLM to write personalized email
"Write an email recommending products for Sarah"
STEP 3: Send whatever the LLM says
π¨ NO CHECKING! π¨The LLM saw "home office" + "product recommendation" and generated something plausible:
"Check out our ErgoPro Monitor Stand at just $29.99!"Sounds real. Looks real. Completely made up.
The Solution - Never Trust, Always Verify
The fix is simple in concept:
Treat LLM output like user input. Never trust it. Always validate.
Think of it like a newspaper editor:
REPORTER (LLM): "The mayor said taxes will drop 50%!"
EDITOR (Your code): "Source?"
REPORTER: "Uh... it sounds like something he'd say?"
EDITOR: "REJECTED. Get me a real quote."Your code needs to be the editor. Every single time.
How to Validate LLM Output
Step 1: Force Structure
Don't let the LLM write free-form text. Make it fill out a form.
Before (dangerous):
prompt = "Write an email recommending products to Sarah"
email_text = llm.generate(prompt)
send_email(email_text) # π¨ Could contain anything!After (safe): (OpenAI now supports structured outputs natively)
prompt = """
Return a JSON object with these exact fields:
- subject: string (max 60 characters)
- product_id: string (must be from the provided catalog)
- product_name: string (exact name from catalog)
- price: number (exact price from catalog)
"""
response = llm.generate(prompt)
data = json.loads(response) # Now we can check each field!It's like the difference between:
- "Tell me about yourself" (could say anything)
- "Fill out this application form" (must answer specific questions)
Step 2: Check Against Reality
Once you have structured data, verify every fact.
def validate_product(llm_response, real_catalog):
"""
Check if the LLM's claims match reality.
"""
# Does this product actually exist?
product_id = llm_response['product_id']
if product_id not in real_catalog:
return False, "Product doesn't exist!"
real_product = real_catalog[product_id]
# Is the name correct?
if llm_response['product_name'] != real_product['name']:
return False, "Wrong product name!"
# Is the price correct?
if llm_response['price'] != real_product['price']:
return False, "Wrong price!"
return True, "All checks passed!"Visual:
LLM says: product_id = "SKU-123"
product_name = "ErgoPro Stand"
price = 29.99
β
VALIDATOR
β
Real catalog: SKU-123 β "Monitor Arm Pro" @ $49.99
β
β REJECTED!
Name doesn't match.
Price doesn't match.Step 3: Use Pydantic for Automatic Validation
Pydantic is a Python library that checks data automatically. Think of it as a bouncer at a club:
from pydantic import BaseModel, validator
class ProductRecommendation(BaseModel):
product_id: str
product_name: str
price: float
@validator('price')
def price_must_be_positive(cls, v):
if v <= 0:
raise ValueError('Price must be positive')
return v
@validator('product_name')
def name_not_empty(cls, v):
if len(v.strip()) == 0:
raise ValueError('Product name cannot be empty')
return vNow when you try to create a recommendation:
# This works β
rec = ProductRecommendation(
product_id="SKU-123",
product_name="Monitor Stand",
price=49.99
)
# This fails automatically β
rec = ProductRecommendation(
product_id="SKU-123",
product_name="", # Empty name - REJECTED!
price=-10 # Negative price - REJECTED!
)The bouncer (Pydantic) checks everyone at the door. Bad data doesn't get in.
What If Validation Fails?
Sometimes the LLM will return garbage. You need a backup plan.
Option 1: Retry
Ask again, maybe with a clearer prompt:
def get_recommendation(user, catalog, max_retries=3):
for attempt in range(max_retries):
response = llm.generate(prompt)
if validate(response, catalog):
return response
print(f"Attempt {attempt + 1} failed, retrying...")
# All retries failed
return fallback_recommendation(user, catalog)Option 2: Try a Different LLM
Different models fail in different ways:
def get_recommendation_with_fallback(user, catalog):
# Try OpenAI first
response = openai.generate(prompt)
if validate(response, catalog):
return response
# OpenAI failed? Try Gemini
response = gemini.generate(prompt)
if validate(response, catalog):
return response
# Gemini failed too? Try Claude
response = claude.generate(prompt)
if validate(response, catalog):
return response
# Everyone failed - use safe template
return safe_template(user, catalog)Visual:
Request
β
βββββββββ
βOpenAI β βββ Validate βββ β Failed
βββββββββ
β
βββββββββ
βGemini β βββ Validate βββ β Failed
βββββββββ
β
βββββββββ
βClaude β βββ Validate βββ β Passed!
βββββββββ
β
Send EmailOption 3: Safe Fallback Template
When all else fails, use a pre-written template that can't hallucinate:
def safe_template(user, catalog):
"""
No LLM involved. Just fill in the blanks.
Can't hallucinate because there's nothing to make up.
"""
# Pick top 3 products based on simple rules
top_products = get_popular_products(catalog, limit=3)
return {
"subject": f"Products you might like, {user.first_name}",
"body": f"""
Hi {user.first_name},
Here are some popular items:
1. {top_products[0].name} - ${top_products[0].price}
2. {top_products[1].name} - ${top_products[1].price}
3. {top_products[2].name} - ${top_products[2].price}
Best regards
""",
"is_personalized": False # Flag so we know this wasn't AI-generated
}Not as personalized, but guaranteed accurate.
The Complete System
Here's how it all fits together:
def generate_email(user, catalog):
"""
Generate a personalized email with multiple safety layers.
"""
# Layer 1: Try to get AI-generated content
for provider in [openai, gemini, claude]:
try:
response = provider.generate(prompt)
# Layer 2: Parse into structured format
data = ProductRecommendation(**response)
# Layer 3: Validate against real catalog
if validate_against_catalog(data, catalog):
return data
except Exception as e:
log_error(f"{provider} failed: {e}")
continue
# Layer 4: All AI attempts failed, use safe fallback
log_warning("All LLMs failed, using template")
return safe_template(user, catalog)Think of it like airport security:
LLM Response
β
βββββββββββββββββββββββββββββββββ
Layer 1: β Can we parse it as JSON? β βββ No β Try another LLM
βββββββββββββββββββββββββββββββββ
β Yes
βββββββββββββββββββββββββββββββββ
Layer 2: β Does it match our schema? β βββ No β Try another LLM
β (Pydantic validation) β
βββββββββββββββββββββββββββββββββ
β Yes
βββββββββββββββββββββββββββββββββ
Layer 3: β Do the facts check out? β βββ No β Try another LLM
β (Catalog validation) β
βββββββββββββββββββββββββββββββββ
β Yes
β
APPROVED
Send to customerMonitoring Hallucinations
You should track how often the LLM fails:
# Simple counter
hallucination_count = 0
total_requests = 0
def track_validation(passed):
global hallucination_count, total_requests
total_requests += 1
if not passed:
hallucination_count += 1
# Alert if too many failures
failure_rate = hallucination_count / total_requests
if failure_rate > 0.10: # More than 10% failing
send_alert("LLM hallucination rate is high!")What's normal?
- 1-5% failure rate: Normal, your fallbacks handle it
- 5-10% failure rate: Worth investigating, maybe prompt needs work
- 10%+ failure rate: Something's wrong, check your prompts/model
Key Lessons
Lesson 1: LLMs Are Confident Guessers
They don't know facts. They know patterns. "Sounds right" is not "is right".
Lesson 2: Treat LLM Output Like User Input
You wouldn't let users submit a form and trust whatever they typed. Don't do it with LLMs either.
Lesson 3: Structured Output Is Your Friend
Free-form text is hard to validate. JSON with specific fields is easy to validate.
Lesson 4: Always Have a Fallback
LLMs will fail. Not if, but when. Have a plan B that doesn't involve AI at all.
Quick Reference
Force structured output:
response_format = {
"type": "json_object",
"schema": {
"product_id": "string",
"product_name": "string",
"price": "number"
}
}Validate with Pydantic:
from pydantic import BaseModel, validator
class ProductRec(BaseModel):
product_id: str
price: float
@validator('price')
def must_be_positive(cls, v):
if v <= 0:
raise ValueError('Invalid price')
return vCheck against real data:
def validate(llm_data, real_catalog):
if llm_data.product_id not in real_catalog:
return False
real = real_catalog[llm_data.product_id]
return llm_data.price == real['price']Have a fallback:
def get_content(user):
for provider in [openai, gemini, claude]:
result = try_provider(provider)
if result:
return result
return safe_template(user) # No AI, no hallucinationSummary
THE PROBLEM:
LLM made up a product that didn't exist
WHY IT HAPPENED:
LLMs predict "plausible" text, not "true" text
THE FIX:
1. Force structured output (JSON, not free text)
2. Validate every field (Pydantic)
3. Check against real data (catalog lookup)
4. Have fallbacks (multiple LLMs + templates)
5. Monitor failure rates
THE RESULT:
Zero hallucinated products reaching customersThe LLM is still doing the creative work. But now it has an editor checking every word.
Related Reading
- Optimizing LLM Validation Latency - How to validate without killing response times
- Retry Storms - What happens when LLM fallback chains trigger cascading retries
- Linear Algebra for Web Developers - Understanding the math behind embeddings and AI systems
