Back to Notes
Β·8 min read

How Do You Stop an LLM from Inventing Prices? (Preventing AI Hallucinations in Production)

#LLM#Python#AI#Pydantic#Production#OpenAI
How Do You Stop an LLM from Inventing Prices? (Preventing AI Hallucinations in Production)

The Disaster

We built an email system that used AI to personalize product recommendations. It worked beautifully in testing.

Then a customer complained: "Your email says this product costs $29.99. I can't find it anywhere on your site."

Because it didn't exist. The AI made it up.


Why Do LLMs "Lie"?

First, let's understand what's actually happening. LLMs don't "lie" on purpose. They have a different problem:

They don't know the difference between "true" and "sounds true".

Think of an LLM like a very confident friend who's great at conversation but terrible at facts:

YOU: "Hey, what's a good laptop under $500?"
 
CONFIDENT FRIEND: "Oh, the Dell XPS 15 is great at $499!"
 
YOU: "Really? Let me check... it's actually $1,299."
 
FRIEND: "Oh. Well, it SOUNDS like it could be $499, right?"

The friend isn't lying. They're pattern-matching. "Dell XPS" + "laptop recommendation" + "under $500 context" = plausible-sounding answer.

LLMs work the same way. They predict what text should come next based on patterns. Not based on truth.


What Went Wrong in Our System

Here's what our email system was doing:

STEP 1: Get user preferences
        "Sarah likes home office equipment"
 
STEP 2: Ask LLM to write personalized email
        "Write an email recommending products for Sarah"
 
STEP 3: Send whatever the LLM says
        🚨 NO CHECKING! 🚨

The LLM saw "home office" + "product recommendation" and generated something plausible:

"Check out our ErgoPro Monitor Stand at just $29.99!"

Sounds real. Looks real. Completely made up.


The Solution - Never Trust, Always Verify

The fix is simple in concept:

Treat LLM output like user input. Never trust it. Always validate.

Think of it like a newspaper editor:

REPORTER (LLM): "The mayor said taxes will drop 50%!"
 
EDITOR (Your code): "Source?"
 
REPORTER: "Uh... it sounds like something he'd say?"
 
EDITOR: "REJECTED. Get me a real quote."

Your code needs to be the editor. Every single time.


How to Validate LLM Output

Step 1: Force Structure

Don't let the LLM write free-form text. Make it fill out a form.

Before (dangerous):

prompt = "Write an email recommending products to Sarah"
email_text = llm.generate(prompt)
send_email(email_text)  # 🚨 Could contain anything!

After (safe): (OpenAI now supports structured outputs natively)

prompt = """
Return a JSON object with these exact fields:
- subject: string (max 60 characters)
- product_id: string (must be from the provided catalog)
- product_name: string (exact name from catalog)
- price: number (exact price from catalog)
"""
 
response = llm.generate(prompt)
data = json.loads(response)  # Now we can check each field!

It's like the difference between:

  • "Tell me about yourself" (could say anything)
  • "Fill out this application form" (must answer specific questions)

Step 2: Check Against Reality

Once you have structured data, verify every fact.

def validate_product(llm_response, real_catalog):
    """
    Check if the LLM's claims match reality.
    """
 
    # Does this product actually exist?
    product_id = llm_response['product_id']
 
    if product_id not in real_catalog:
        return False, "Product doesn't exist!"
 
    real_product = real_catalog[product_id]
 
    # Is the name correct?
    if llm_response['product_name'] != real_product['name']:
        return False, "Wrong product name!"
 
    # Is the price correct?
    if llm_response['price'] != real_product['price']:
        return False, "Wrong price!"
 
    return True, "All checks passed!"

Visual:

LLM says:         product_id = "SKU-123"
                  product_name = "ErgoPro Stand"
                  price = 29.99
                        ↓
                   VALIDATOR
                        ↓
Real catalog:     SKU-123 β†’ "Monitor Arm Pro" @ $49.99
                        ↓
                   ❌ REJECTED!
 
                   Name doesn't match.
                   Price doesn't match.

Step 3: Use Pydantic for Automatic Validation

Pydantic is a Python library that checks data automatically. Think of it as a bouncer at a club:

from pydantic import BaseModel, validator
 
class ProductRecommendation(BaseModel):
    product_id: str
    product_name: str
    price: float
 
    @validator('price')
    def price_must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Price must be positive')
        return v
 
    @validator('product_name')
    def name_not_empty(cls, v):
        if len(v.strip()) == 0:
            raise ValueError('Product name cannot be empty')
        return v

Now when you try to create a recommendation:

# This works βœ“
rec = ProductRecommendation(
    product_id="SKU-123",
    product_name="Monitor Stand",
    price=49.99
)
 
# This fails automatically βœ—
rec = ProductRecommendation(
    product_id="SKU-123",
    product_name="",          # Empty name - REJECTED!
    price=-10                 # Negative price - REJECTED!
)

The bouncer (Pydantic) checks everyone at the door. Bad data doesn't get in.


What If Validation Fails?

Sometimes the LLM will return garbage. You need a backup plan.

Option 1: Retry

Ask again, maybe with a clearer prompt:

def get_recommendation(user, catalog, max_retries=3):
    for attempt in range(max_retries):
        response = llm.generate(prompt)
 
        if validate(response, catalog):
            return response
 
        print(f"Attempt {attempt + 1} failed, retrying...")
 
    # All retries failed
    return fallback_recommendation(user, catalog)

Option 2: Try a Different LLM

Different models fail in different ways:

def get_recommendation_with_fallback(user, catalog):
 
    # Try OpenAI first
    response = openai.generate(prompt)
    if validate(response, catalog):
        return response
 
    # OpenAI failed? Try Gemini
    response = gemini.generate(prompt)
    if validate(response, catalog):
        return response
 
    # Gemini failed too? Try Claude
    response = claude.generate(prompt)
    if validate(response, catalog):
        return response
 
    # Everyone failed - use safe template
    return safe_template(user, catalog)

Visual:

                    Request
                       ↓
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”
                   β”‚OpenAI β”‚ ──→ Validate ──→ βœ— Failed
                   β””β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”
                   β”‚Gemini β”‚ ──→ Validate ──→ βœ— Failed
                   β””β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”
                   β”‚Claude β”‚ ──→ Validate ──→ βœ“ Passed!
                   β””β”€β”€β”€β”€β”€β”€β”€β”˜
                       ↓
                   Send Email

Option 3: Safe Fallback Template

When all else fails, use a pre-written template that can't hallucinate:

def safe_template(user, catalog):
    """
    No LLM involved. Just fill in the blanks.
    Can't hallucinate because there's nothing to make up.
    """
 
    # Pick top 3 products based on simple rules
    top_products = get_popular_products(catalog, limit=3)
 
    return {
        "subject": f"Products you might like, {user.first_name}",
        "body": f"""
        Hi {user.first_name},
 
        Here are some popular items:
 
        1. {top_products[0].name} - ${top_products[0].price}
        2. {top_products[1].name} - ${top_products[1].price}
        3. {top_products[2].name} - ${top_products[2].price}
 
        Best regards
        """,
        "is_personalized": False  # Flag so we know this wasn't AI-generated
    }

Not as personalized, but guaranteed accurate.


The Complete System

Here's how it all fits together:

def generate_email(user, catalog):
    """
    Generate a personalized email with multiple safety layers.
    """
 
    # Layer 1: Try to get AI-generated content
    for provider in [openai, gemini, claude]:
        try:
            response = provider.generate(prompt)
 
            # Layer 2: Parse into structured format
            data = ProductRecommendation(**response)
 
            # Layer 3: Validate against real catalog
            if validate_against_catalog(data, catalog):
                return data
 
        except Exception as e:
            log_error(f"{provider} failed: {e}")
            continue
 
    # Layer 4: All AI attempts failed, use safe fallback
    log_warning("All LLMs failed, using template")
    return safe_template(user, catalog)

Think of it like airport security:

                    LLM Response
                         ↓
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Layer 1: β”‚   Can we parse it as JSON?    β”‚ ──→ No β†’ Try another LLM
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↓ Yes
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Layer 2: β”‚   Does it match our schema?   β”‚ ──→ No β†’ Try another LLM
         β”‚   (Pydantic validation)       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↓ Yes
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
Layer 3: β”‚   Do the facts check out?     β”‚ ──→ No β†’ Try another LLM
         β”‚   (Catalog validation)        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         ↓ Yes
                    βœ… APPROVED
                    Send to customer

Monitoring Hallucinations

You should track how often the LLM fails:

# Simple counter
hallucination_count = 0
total_requests = 0
 
def track_validation(passed):
    global hallucination_count, total_requests
    total_requests += 1
    if not passed:
        hallucination_count += 1
 
    # Alert if too many failures
    failure_rate = hallucination_count / total_requests
    if failure_rate > 0.10:  # More than 10% failing
        send_alert("LLM hallucination rate is high!")

What's normal?

  • 1-5% failure rate: Normal, your fallbacks handle it
  • 5-10% failure rate: Worth investigating, maybe prompt needs work
  • 10%+ failure rate: Something's wrong, check your prompts/model

Key Lessons

Lesson 1: LLMs Are Confident Guessers

They don't know facts. They know patterns. "Sounds right" is not "is right".

Lesson 2: Treat LLM Output Like User Input

You wouldn't let users submit a form and trust whatever they typed. Don't do it with LLMs either.

Lesson 3: Structured Output Is Your Friend

Free-form text is hard to validate. JSON with specific fields is easy to validate.

Lesson 4: Always Have a Fallback

LLMs will fail. Not if, but when. Have a plan B that doesn't involve AI at all.


Quick Reference

Force structured output:

response_format = {
    "type": "json_object",
    "schema": {
        "product_id": "string",
        "product_name": "string",
        "price": "number"
    }
}

Validate with Pydantic:

from pydantic import BaseModel, validator
 
class ProductRec(BaseModel):
    product_id: str
    price: float
 
    @validator('price')
    def must_be_positive(cls, v):
        if v <= 0:
            raise ValueError('Invalid price')
        return v

Check against real data:

def validate(llm_data, real_catalog):
    if llm_data.product_id not in real_catalog:
        return False
    real = real_catalog[llm_data.product_id]
    return llm_data.price == real['price']

Have a fallback:

def get_content(user):
    for provider in [openai, gemini, claude]:
        result = try_provider(provider)
        if result:
            return result
    return safe_template(user)  # No AI, no hallucination

Summary

THE PROBLEM:
LLM made up a product that didn't exist
 
WHY IT HAPPENED:
LLMs predict "plausible" text, not "true" text
 
THE FIX:
1. Force structured output (JSON, not free text)
2. Validate every field (Pydantic)
3. Check against real data (catalog lookup)
4. Have fallbacks (multiple LLMs + templates)
5. Monitor failure rates
 
THE RESULT:
Zero hallucinated products reaching customers

The LLM is still doing the creative work. But now it has an editor checking every word.


Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.