Back to Notes
·5 min read

The XLSX From Hell: When Shapes Break Everything

#Python#Excel#DataEngineering#Automation
The XLSX From Hell: When Shapes Break Everything

This post describes a real project I worked on. Names, specific systems, and identifying details have been changed. The technical problems and approaches are real, but code snippets are simplified illustrations not production code.


The "Simple" Request

"Can you auto-fill this Excel template with data from the database?"

I looked at the file. Beautiful report template. Corporate colors. Professional layout. And shapes rounded rectangles with KPIs, summary boxes with dynamic values, decorative elements throughout.

"Sure," I said. "Should be straightforward."

It was not straightforward.


What I Didn't Know About Excel Files

An .xlsx file is a ZIP containing XML. Unzip it and you'll find:

  • xl/worksheets/sheet1.xml the cells
  • xl/sharedStrings.xml text content (sometimes)
  • xl/drawings/drawing1.xml shapes and their text
  • Various _rels folders relationships between files

The problem: shapes and cells store text in completely different places.

When someone types {{total}} inside a rounded rectangle shape, that text isn't in the cells. It's buried in drawing XML with a different namespace entirely.


Attempt 1: openpyxl

openpyxl is the standard Python library for Excel. Obvious choice.

Result:

  • Cell text: Replaced
  • Shape text: Still showing {{placeholder}}

openpyxl handles cells beautifully. It doesn't touch drawing content. The shapes were invisible to it.


Attempt 2: Patch the Drawing XML Manually

Fine. Use openpyxl for cells, then edit the drawing XML directly. Unzip, find-replace, rezip.

Result:

  • Cell text: Replaced
  • Shape text: Replaced
  • Shapes: GONE. All shapes vanished.

What happened?


The Relationship Web

Excel files have internal references everywhere. The worksheet says "my drawings are in drawing1.xml." When openpyxl saved the file, it rewrote the relationship files but it didn't know about my manual edits. The link got severed.

The shapes were still in the ZIP. Excel just didn't know they existed anymore.


Attempt 3: Skip openpyxl, Pure XML

Don't use any Excel library. Open ZIP, modify XML directly, close ZIP.

New problem: Some cell text wasn't replacing.

The culprit: shared strings. Excel doesn't always store text in cells. Often it stores an index, and the actual text lives in sharedStrings.xml. My placeholder {{region}} wasn't in the cell it was at index 7 in the shared strings table.

I had to search multiple XML files.


Attempt 4: Proper XML Parsing with lxml

Parse the XML properly, find text nodes, replace content, serialize back.

Result:

  • Shapes shifted position
  • Some text cut off
  • Formatting broken in random places

What now?


The XML Formatting Trap

When you parse XML and serialize it back, parsers "help" by:

  • Normalizing whitespace
  • Reordering attributes
  • Adding XML declarations
  • Reformatting indentation

The XML is semantically identical but byte-different.

Excel is picky. Some positioning depends on exact byte offsets. Some attributes have implicit dependencies. The "helpful" reformatting broke things in subtle ways.

I spent two days debugging why shapes moved 3 pixels left after processing.


The Breakthrough: Don't Parse At All

After a week of fighting XML parsers, I had a realization:

I don't need to understand the XML. I just need to swap bytes.

def fill_template(template_path, data, output_path):
    # Read the ZIP
    with zipfile.ZipFile(template_path, 'r') as zin:
        files = {name: zin.read(name) for name in zin.namelist()}
 
    # Build replacements as bytes
    replacements = {
        f'{{{{{key}}}}}'.encode('utf-8'): escape_xml(str(value)).encode('utf-8')
        for key, value in data.items()
    }
 
    # Surgical replacement - no parsing
    for filename in files:
        if filename.endswith('.xml'):
            content = files[filename]
            for placeholder, value in replacements.items():
                content = content.replace(placeholder, value)
            files[filename] = content
 
    # Write the ZIP - structure unchanged
    with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zout:
        for name, content in files.items():
            zout.writestr(name, content)

That's it. The dumbest possible solution.

Cell text: Works Shape text: Works Shapes: Still there Formatting: Untouched Relationships: Never modified

By never parsing the XML, I never gave any tool the chance to "fix" it. The file structure stayed exactly as Excel created it, with only my placeholder bytes swapped out.


The Gotcha: XML Escaping

One more trap. Values with special characters need escaping:

def escape_xml(text):
    return (text
        .replace('&', '&')
        .replace('<', '&lt;')
        .replace('>', '&gt;')
        .replace('"', '&quot;')
        .replace("'", '&apos;'))

Without this, a value like Sales & Marketing becomes invalid XML and corrupts the file.


What I Learned the Hard Way

Excel files are fragile. They're not just "XML in a ZIP." They're a carefully balanced web where touching one thing breaks another.

Libraries can be too clever. openpyxl, lxml they try to help. For Excel files, that help is often destructive. Sometimes the dumb solution wins.

Test with the real application. The file might open in Python. Might even open in LibreOffice. Doesn't mean Excel accepts it.

Shapes are second-class citizens. Most Excel libraries focus on cells. If your template uses shapes, you're largely on your own.


When to Use This Approach

Do this when:

  • Template has shapes, charts, or complex formatting
  • You just need to swap placeholder text for values
  • Preserving exact formatting is critical

Don't do this when:

  • You need to add/remove rows dynamically
  • You're generating Excel from scratch
  • The template structure changes frequently

The Results

MetricBeforeAfter
Report generation2-3 hours (manual)30 seconds
Formatting issuesFrequentZero
Shape corruptionN/AZero

The finance team was happy. Their beautiful templates stayed beautiful.


Aamir Shahzad

Aamir Shahzad

Author

Software Engineer with 7+ years of experience building scalable data systems. Specializing in Django, Python, and applied AI.