Back to Notes
·7 min read

How Do You Deduplicate 3 Million Records Without Running Out of Memory?

#Python#ETL#Data#Performance#Miller
How Do You Deduplicate 3 Million Records Without Running Out of Memory?

The Problem That Almost Broke Our Pipeline

Every morning at 8:30 AM, our system downloaded fresh vehicle listings from MarketCheck. The files looked innocent enough just CSV files.

Then I looked at the numbers:

Daily files: 3-4 CSV files
Rows per file: 10-15 million each
Total rows: 30-40 MILLION
Valid records after cleaning: 3-4 million

30-40 million rows. Every single day.

And here's the kicker: the same car could appear 5-10 times in a single file. Different dealers listing the same VIN, scraped at different times.

We needed to keep only the most recent listing for each VIN.


My First Attempt (The One That Crashed)

I reached for the obvious tool: Pandas.

import pandas as pd
 
# This looks reasonable, right?
df = pd.read_csv('mc_us_used_20240115.csv.gz')
df = df.sort_values('scraped_at', ascending=False)
df = df.drop_duplicates(subset='vin', keep='first')

Clean. Pythonic. What I learned in every tutorial.

I ran it locally on a test file. Worked fine.

Then I ran it on the real file:

Reading file...
Memory usage: 2GB
Memory usage: 4GB
Memory usage: 8GB
Memory usage: 12GB
Memory usage: 16GB
MemoryError: Unable to allocate 4.2 GB for an array

Our EC2 instance had 32GB of RAM. A single CSV file was trying to eat all of it.


Why Pandas Loads Everything

Here's what most people don't realize about Pandas:

WHEN YOU DO THIS:
df = pd.read_csv('huge_file.csv')
 
PANDAS DOES THIS:
 
┌────────────────────────────────────────────┐
YOUR RAM
│                                             │
│  ┌───────────────────────────────────────┐ │
│  │         ENTIRE CSV FILE               │ │
│  │                                       │ │
│  │  Row 1, Row 2, Row 3, ... Row 40M     │ │
│  │                                       │ │
│  │         ALL IN MEMORY                 │ │
│  │         AT ONCE                       │ │
│  └───────────────────────────────────────┘ │
│                                             │
└────────────────────────────────────────────┘

It loads EVERYTHING first, then lets you work on it.

For 40 million rows with 91 columns each? That's a lot of "everything."


The Chunking Attempt

"Okay," I thought. "I'll read in chunks."

# Read 100,000 rows at a time
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
 
seen_vins = {}
for chunk in chunks:
    for _, row in chunk.iterrows():
        vin = row['vin']
        if vin not in seen_vins:
            seen_vins[vin] = row
        elif row['scraped_at'] > seen_vins[vin]['scraped_at']:
            seen_vins[vin] = row

This was slower but didn't crash immediately.

New problem: it took 4 hours to process one file.

We had 3-4 files per day. And they needed to be processed before 9 AM.

Math doesn't work.


Then I Found Miller

A colleague mentioned Miller (mlr) a command-line tool for CSV processing. I'd never heard of it.

The key difference: Miller streams data. It never loads the whole file.

PANDAS APPROACH:
 
File ──► [Load ALL 40M rows into RAM] ──► Process ──► Output
              Memory explodes
 
 
MILLER APPROACH:
 
File ──► [Read row 1] ──► [Process] ──► [Write] ──► [Forget]
     ──► [Read row 2] ──► [Process] ──► [Write] ──► [Forget]
     ──► [Read row 3] ──► [Process] ──► [Write] ──► [Forget]
     ...
              Constant memory usage

Miller processes row by row, never holding more than a few rows in memory at once.


The Solution That Actually Worked

Here's what I ended up building:

Step 1: Extract Only What We Need

The CSV had 91 columns. For deduplication, I only needed 3:

mlr --icsv --ocsv cut -f vin,scraped_at,id huge_file.csv > minimal.csv

This reduced file size by 95%.

Step 2: Sort by VIN and Timestamp

mlr --csv sort -r vin,scraped_at minimal.csv > sorted.csv

The -r means reverse sort. So for each VIN, the most recent scraped_at comes first.

Step 3: Keep First Occurrence of Each VIN

Here's where it gets clever. After sorting, I just needed the FIRST row for each VIN:

def fetch_latest_vins(file):
    """Stream through sorted file, keep first occurrence of each VIN"""
 
    # Miller command: sort by vin and scraped_at (descending)
    miller_command = f"""
        mlr --csv sort -r vin,scraped_at {file}
    """
 
    process = subprocess.Popen(
        miller_command,
        shell=True,
        stdout=subprocess.PIPE
    )
 
    latest_vins = {}
    last_vin = None
 
    for line in process.stdout:
        row = parse_csv_line(line)
 
        # If this is a new VIN, it's the latest (because we sorted descending)
        if row['vin'] != last_vin:
            latest_vins[row['vin']] = row['id']
            last_vin = row['vin']
 
        # Skip duplicate VINs - we already have the latest
 
    return latest_vins

The Numbers

Before (Pandas):

File size: 15GB uncompressed
Memory needed: 32GB+ (crashed)
Time: Never finished

After (Miller):

File size: 15GB uncompressed
Memory used: ~200MB (constant)
Time: 8 minutes per file

For the full daily pipeline:

Files: 3-4 per day
Total rows: 30-40 million
Clean records: 3-4 million
Total time: ~35 minutes
Memory: Never exceeded 500MB

We went from "impossible" to "done before coffee."


Why Miller Is So Fast

Three reasons:

1. Streaming, Not Loading

Miller reads and processes line by line. It never holds the whole file.

Memory usage over time:
 
Pandas:   (keeps growing until crash)
Miller:   ─────────────────── (flat line, constant)

2. Written in C

Pandas is Python wrapping NumPy wrapping C. Miller is just C. No wrapper overhead.

3. Purpose-Built for CSVs

Pandas does everything DataFrames, time series, plotting, statistics. Miller does one thing: process delimited text files. It's optimized for exactly this use case.


When to Use What

USE PANDAS WHEN:
✓ File fits in memory (< 1GB usually safe)
✓ You need complex analysis, groupby, joins
✓ You're doing data science exploration
✓ You need the result as a DataFrame for further work
 
USE MILLER WHEN:
✓ Files are huge (multi-GB)
✓ You're doing simple transforms: filter, sort, cut, dedupe
✓ Memory is limited
✓ You need to process files in a pipeline (stdin/stdout)
✓ Speed matters more than flexibility
 
USE DATABASE WHEN:
✓ You need to join multiple large datasets
✓ Data needs to persist for queries
✓ Multiple processes need access
✓ You need ACID guarantees

Key Lessons

Lesson 1: Pandas Isn't Always the Answer

I defaulted to Pandas because that's what I knew. But the right tool depends on the job. For huge file transforms, streaming tools like Miller, awk, or sed often beat Pandas.

Lesson 2: Streaming > Loading

If you don't need the whole file in memory, don't put it there. Process line by line when you can.

Lesson 3: Do Less Work

We only needed 3 columns for deduplication. Extracting just those columns before sorting made everything faster.

Lesson 4: Cache Intermediate Results

Computing the VIN→ID mapping once and saving it to JSON meant we didn't redo that work during main processing.


Quick Reference

Install Miller:

# Ubuntu/Debian
apt-get install miller
 
# Mac
brew install miller

Common Miller Commands:

# Extract columns
mlr --csv cut -f col1,col2,col3 input.csv > output.csv
 
# Filter rows
mlr --csv filter '$price > 10000' input.csv
 
# Sort
mlr --csv sort -f year -r price input.csv  # year asc, price desc
 
# Deduplicate (keep first)
mlr --csv uniq -g vin input.csv
 
# Chain operations
mlr --csv cut -f vin,price then filter '$price > 0' then sort -f price input.csv

That's how you process 40 million rows on a machine with 32GB of RAM. No magic just the right tool for the job.


Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.