The Problem That Almost Broke Our Pipeline
Every morning at 8:30 AM, our system downloaded fresh vehicle listings from MarketCheck. The files looked innocent enough just CSV files.
Then I looked at the numbers:
Daily files: 3-4 CSV files
Rows per file: 10-15 million each
Total rows: 30-40 MILLION
Valid records after cleaning: 3-4 million30-40 million rows. Every single day.
And here's the kicker: the same car could appear 5-10 times in a single file. Different dealers listing the same VIN, scraped at different times.
We needed to keep only the most recent listing for each VIN.
My First Attempt (The One That Crashed)
I reached for the obvious tool: Pandas.
import pandas as pd
# This looks reasonable, right?
df = pd.read_csv('mc_us_used_20240115.csv.gz')
df = df.sort_values('scraped_at', ascending=False)
df = df.drop_duplicates(subset='vin', keep='first')Clean. Pythonic. What I learned in every tutorial.
I ran it locally on a test file. Worked fine.
Then I ran it on the real file:
Reading file...
Memory usage: 2GB
Memory usage: 4GB
Memory usage: 8GB
Memory usage: 12GB
Memory usage: 16GB
MemoryError: Unable to allocate 4.2 GB for an arrayOur EC2 instance had 32GB of RAM. A single CSV file was trying to eat all of it.
Why Pandas Loads Everything
Here's what most people don't realize about Pandas:
WHEN YOU DO THIS:
df = pd.read_csv('huge_file.csv')
PANDAS DOES THIS:
┌────────────────────────────────────────────┐
│ YOUR RAM │
│ │
│ ┌───────────────────────────────────────┐ │
│ │ ENTIRE CSV FILE │ │
│ │ │ │
│ │ Row 1, Row 2, Row 3, ... Row 40M │ │
│ │ │ │
│ │ ALL IN MEMORY │ │
│ │ AT ONCE │ │
│ └───────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────┘It loads EVERYTHING first, then lets you work on it.
For 40 million rows with 91 columns each? That's a lot of "everything."
The Chunking Attempt
"Okay," I thought. "I'll read in chunks."
# Read 100,000 rows at a time
chunks = pd.read_csv('huge_file.csv', chunksize=100000)
seen_vins = {}
for chunk in chunks:
for _, row in chunk.iterrows():
vin = row['vin']
if vin not in seen_vins:
seen_vins[vin] = row
elif row['scraped_at'] > seen_vins[vin]['scraped_at']:
seen_vins[vin] = rowThis was slower but didn't crash immediately.
New problem: it took 4 hours to process one file.
We had 3-4 files per day. And they needed to be processed before 9 AM.
Math doesn't work.
Then I Found Miller
A colleague mentioned Miller (mlr) a command-line tool for CSV processing. I'd never heard of it.
The key difference: Miller streams data. It never loads the whole file.
PANDAS APPROACH:
File ──► [Load ALL 40M rows into RAM] ──► Process ──► Output
Memory explodes
MILLER APPROACH:
File ──► [Read row 1] ──► [Process] ──► [Write] ──► [Forget]
──► [Read row 2] ──► [Process] ──► [Write] ──► [Forget]
──► [Read row 3] ──► [Process] ──► [Write] ──► [Forget]
...
Constant memory usageMiller processes row by row, never holding more than a few rows in memory at once.
The Solution That Actually Worked
Here's what I ended up building:
Step 1: Extract Only What We Need
The CSV had 91 columns. For deduplication, I only needed 3:
mlr --icsv --ocsv cut -f vin,scraped_at,id huge_file.csv > minimal.csvThis reduced file size by 95%.
Step 2: Sort by VIN and Timestamp
mlr --csv sort -r vin,scraped_at minimal.csv > sorted.csvThe -r means reverse sort. So for each VIN, the most recent scraped_at comes first.
Step 3: Keep First Occurrence of Each VIN
Here's where it gets clever. After sorting, I just needed the FIRST row for each VIN:
def fetch_latest_vins(file):
"""Stream through sorted file, keep first occurrence of each VIN"""
# Miller command: sort by vin and scraped_at (descending)
miller_command = f"""
mlr --csv sort -r vin,scraped_at {file}
"""
process = subprocess.Popen(
miller_command,
shell=True,
stdout=subprocess.PIPE
)
latest_vins = {}
last_vin = None
for line in process.stdout:
row = parse_csv_line(line)
# If this is a new VIN, it's the latest (because we sorted descending)
if row['vin'] != last_vin:
latest_vins[row['vin']] = row['id']
last_vin = row['vin']
# Skip duplicate VINs - we already have the latest
return latest_vinsThe Numbers
Before (Pandas):
File size: 15GB uncompressed
Memory needed: 32GB+ (crashed)
Time: Never finishedAfter (Miller):
File size: 15GB uncompressed
Memory used: ~200MB (constant)
Time: 8 minutes per fileFor the full daily pipeline:
Files: 3-4 per day
Total rows: 30-40 million
Clean records: 3-4 million
Total time: ~35 minutes
Memory: Never exceeded 500MBWe went from "impossible" to "done before coffee."
Why Miller Is So Fast
Three reasons:
1. Streaming, Not Loading
Miller reads and processes line by line. It never holds the whole file.
Memory usage over time:
Pandas: (keeps growing until crash)
Miller: ─────────────────── (flat line, constant)2. Written in C
Pandas is Python wrapping NumPy wrapping C. Miller is just C. No wrapper overhead.
3. Purpose-Built for CSVs
Pandas does everything DataFrames, time series, plotting, statistics. Miller does one thing: process delimited text files. It's optimized for exactly this use case.
When to Use What
USE PANDAS WHEN:
✓ File fits in memory (< 1GB usually safe)
✓ You need complex analysis, groupby, joins
✓ You're doing data science exploration
✓ You need the result as a DataFrame for further work
USE MILLER WHEN:
✓ Files are huge (multi-GB)
✓ You're doing simple transforms: filter, sort, cut, dedupe
✓ Memory is limited
✓ You need to process files in a pipeline (stdin/stdout)
✓ Speed matters more than flexibility
USE DATABASE WHEN:
✓ You need to join multiple large datasets
✓ Data needs to persist for queries
✓ Multiple processes need access
✓ You need ACID guaranteesKey Lessons
Lesson 1: Pandas Isn't Always the Answer
I defaulted to Pandas because that's what I knew. But the right tool depends on the job. For huge file transforms, streaming tools like Miller, awk, or sed often beat Pandas.
Lesson 2: Streaming > Loading
If you don't need the whole file in memory, don't put it there. Process line by line when you can.
Lesson 3: Do Less Work
We only needed 3 columns for deduplication. Extracting just those columns before sorting made everything faster.
Lesson 4: Cache Intermediate Results
Computing the VIN→ID mapping once and saving it to JSON meant we didn't redo that work during main processing.
Quick Reference
Install Miller:
# Ubuntu/Debian
apt-get install miller
# Mac
brew install millerCommon Miller Commands:
# Extract columns
mlr --csv cut -f col1,col2,col3 input.csv > output.csv
# Filter rows
mlr --csv filter '$price > 10000' input.csv
# Sort
mlr --csv sort -f year -r price input.csv # year asc, price desc
# Deduplicate (keep first)
mlr --csv uniq -g vin input.csv
# Chain operations
mlr --csv cut -f vin,price then filter '$price > 0' then sort -f price input.csvThat's how you process 40 million rows on a machine with 32GB of RAM. No magic just the right tool for the job.
Related Reading
- Processing 200GB of Vehicle Data Daily - The full ETL pipeline that uses this deduplication
- Why Did Our Search Get Slower? - When we outgrew Postgres for searching these records
- Elasticsearch Routing by Make - How we made single-make searches 5x faster
