When Matrices Stopped Being Scary (Linear Algebra for AI, Part 3)

In Part 2, I covered vectors and similarity. This is Week 2 where I finally understood what neural networks are actually doing when they process data.

The Question That Haunted Me

Every tutorial says "neural networks have layers." But what IS a layer?

model = Sequential([
    Dense(128, activation='relu'),  # What does this DO?
    Dense(64, activation='relu'),   # And this?
    Dense(10, activation='softmax')
])

I could write this code, but I couldn't explain what was happening inside. Week 2 fixed that.

Matrices Are Transformations (Not Just Grids)

This was the biggest mindset shift.

Don't think: "A matrix is a grid of numbers" Think: "A matrix is a machine that transforms vectors"

MATRIX AS A TRANSFORMATION:
 
    Input vector:  [3, 2]
 
    Matrix:        | 2  0 |
                   | 0  1 |
 
    Output vector: [6, 2]
 
    What happened?
    The x-coordinate doubled.
    The y-coordinate stayed the same.
 
    BEFORE:                AFTER:
 
         |                      |
       2 |    *                2 |         *
         |   /                   |        /
       1 |  /                  1 |       /
         | /                     |      /
         +-------- x             +------------ x
         0  1  2  3              0  1  2  3  4  5  6
 
    The matrix STRETCHED space horizontally!

3Blue1Brown's visualization made this click. A matrix doesn't just contain numbers it defines how to warp and reshape space.

Neural Network Layers Are Just Matrix Multiplications

Here's what a dense layer actually does:

DENSE LAYER:
 
    Input:   x = [0.5, 0.3, 0.8]   (3 features)
 
    Weights: W = | 0.2  0.1  0.4 |
                 | 0.3  0.5  0.1 |  (2 neurons, 3 inputs)
 
    Output:  y = W * x
 
           = | 0.2*0.5 + 0.1*0.3 + 0.4*0.8 |
             | 0.3*0.5 + 0.5*0.3 + 0.1*0.8 |
 
           = | 0.1 + 0.03 + 0.32 |
             | 0.15 + 0.15 + 0.08 |
 
           = | 0.45 |
             | 0.38 |
 
    3 dimensions → 2 dimensions!

Each row in the weight matrix is a "feature detector."

ROW 1: [0.2, 0.1, 0.4]
 
    "I care a lot about feature 3 (0.4),
     a little about feature 1 (0.2),
     barely about feature 2 (0.1)"
 
ROW 2: [0.3, 0.5, 0.1]
 
    "I care most about feature 2 (0.5),
     somewhat about feature 1 (0.3),
     barely about feature 3 (0.1)"

This is why neural networks can learn patterns they're learning WHICH combinations of inputs matter!

Eigenvalues - "What Does This Matrix Want to Do?"

I struggled with eigenvalues until I found the right question:

"If I apply this matrix over and over, what happens?"

Most vectors get rotated AND scaled. But eigenvectors are special they only get scaled:

REGULAR VECTOR (gets rotated):
 
    Matrix A applied to [1, 0]:
 
    BEFORE:          AFTER:
         |                |
         |                |  *
         *----            | /
         |                |/
         +----            +----
 
    Direction changed!
 
 
EIGENVECTOR (only scaled):
 
    Matrix A applied to [1, 1]:
 
    BEFORE:          AFTER:
         |    *           |         *
         |   /            |        /
         |  /             |       /
         | /              |      /
         +----            +-----
 
    Same direction, just longer!
    This vector is special to the matrix.

The eigenvalue tells you HOW MUCH it scales.

A * v = λ * v
 
Where:
    A = the matrix
    v = eigenvector (special direction)
    λ = eigenvalue (scale factor)
 
If λ = 2:  The eigenvector gets doubled
If λ = 0.5: The eigenvector gets halved
If λ = 1:  Nothing changes

Why Eigenvalues Matter for Training

This connected to a real problem I'd seen: exploding and vanishing gradients.

TRAINING = REPEATED MATRIX MULTIPLICATIONS
 
    Gradient flows backward through layers:
 
    Layer 5 → Layer 4 → Layer 3 → Layer 2 → Layer 1
 
    At each step, multiply by weight matrix.
 
 
IF EIGENVALUES > 1:
 
    Iteration 1:  gradient = 10
    Iteration 2:  gradient = 20
    Iteration 3:  gradient = 40
    Iteration 4:  gradient = 80
    ...
    Iteration 50: gradient = INFINITY
 
    "EXPLODING GRADIENTS!" - NaN errors, training crashes
 
 
IF EIGENVALUES < 1:
 
    Iteration 1:  gradient = 10
    Iteration 2:  gradient = 5
    Iteration 3:  gradient = 2.5
    Iteration 4:  gradient = 1.25
    ...
    Iteration 50: gradient = 0.0000001
 
    "VANISHING GRADIENTS!" - Early layers don't learn
 
 
IF EIGENVALUES ≈ 1:
 
    Iteration 1:  gradient = 10
    Iteration 2:  gradient = 10
    Iteration 3:  gradient = 10
    ...
 
    STABLE TRAINING!

This is why weight initialization matters! Techniques like Xavier/Glorot initialization specifically try to keep eigenvalues near 1.

SVD - The Compression Superpower

Singular Value Decomposition was the last piece. Any matrix can be broken into three parts:

SVD:
 
    M = U * Σ * V^T
 
    Where:
    - U: rotation in output space
    - Σ: scaling (diagonal matrix)
    - V: rotation in input space
 
    The key insight: Σ is DIAGONAL.
 
    Σ = | σ1  0   0   0  |
        | 0   σ2  0   0  |
        | 0   0   σ3  0  |
        | 0   0   0   σ4 |
 
    These σ values (singular values) tell you
    how "important" each dimension is.

Here's what blew my mind: real data has structure.

RANDOM DATA:
 
    σ1 = 10.2
    σ2 = 9.8
    σ3 = 10.1
    σ4 = 9.9
 
    All similar. Every dimension matters equally.
    Can't compress.
 
 
REAL-WORLD DATA:
 
    σ1 = 100    ← This dimension captures most variance!
    σ2 = 50     ← This one captures some
    σ3 = 10     ← Diminishing returns
    σ4 = 0.01   ← Basically noise
 
    First 2 dimensions capture 93% of the information!
    Can compress MASSIVELY.

This Is How LoRA Works!

This is where everything connected to something practical: LoRA fine-tuning.

TRADITIONAL FINE-TUNING:
 
    Original weights: W (4096 x 4096 = 16 million parameters)
 
    Fine-tune ALL of them.
 
    Problem: Need to store 16M parameters per fine-tuned model!
 
 
LoRA'S INSIGHT:
 
    The CHANGE to weights (ΔW) has low rank!
 
    ΔW ≈ A * B
 
    Where:
        A: (4096 x 8)  = 32,768 parameters
        B: (8 x 4096)  = 32,768 parameters
 
    Total: 65,536 parameters (0.4% of original!)
 
 
WHY IT WORKS (SVD thinking):
 
    The update ΔW has structure.
    It's not random it's learning specific patterns.
    Those patterns live in a low-dimensional space.
    SVD tells us: low-rank ≈ compressible!

When I read the LoRA paper after learning SVD, it actually made sense. They're exploiting the fact that fine-tuning updates are low-rank.

PCA - Same Idea, Different Application

Principal Component Analysis uses the same insight:

HIGH-DIMENSIONAL DATA:
 
    100 features per sample.
    But maybe only 10 "real" patterns.
 
 
PCA PROCESS:
 
    1. Compute covariance matrix
    2. Find eigenvectors (principal components)
    3. Find eigenvalues (how important each one is)
    4. Keep top K eigenvectors, discard the rest
 
 
RESULT:
 
    100 dimensions → 10 dimensions
    90% of information preserved
    10x compression!

This is the same idea as SVD find the "important directions" and ignore the rest.

My Mental Model Now

COMPLETE PICTURE:
 
    VECTORS:       Data as points in space
                   |
                   v
    MATRICES:      Transformations between spaces
                   (Neural network layers!)
                   |
                   v
    EIGENVALUES:   "What does this transformation naturally do?"
                   (Training stability)
                   |
                   v
    SVD:           "What's the simple structure in this data?"
                   (Compression, LoRA, PCA)

Everything connects. Vectors flow through matrix transformations. Eigenvalues tell us if training will be stable. SVD finds the hidden structure we can exploit.

What Actually Helped Me Learn

3Blue1Brown's Essence of Linear Algebra - Videos 3, 7, and 14 were game-changers.
The LoRA paper - Reading it AFTER understanding SVD was revelatory.
NumPy experiments - I decomposed real weight matrices and looked at the singular values. Seeing that most were tiny made SVD real.

import numpy as np
 
# Grab weights from a trained model
W = model.layers[0].get_weights()[0]
 
# SVD decomposition
U, S, Vt = np.linalg.svd(W)
 
# Look at singular values
print(S[:10])   # Large!
print(S[-10:])  # Tiny!
 
# Most information in top-k

Explaining it to others - Writing these posts forced me to truly understand.

The Practical Payoff

After two weeks:

LoRA makes sense: I understand WHY low-rank works, not just THAT it works.
Training bugs: When gradients explode, I think about eigenvalues and initialization.
Dimensionality reduction: I can reason about PCA, t-SNE, embeddings they're all related.
Reading research: Papers that seemed impenetrable now have structure. "Apply SVD to the weight matrix" - got it!

Key Takeaways

MATRICES:     Transformations, not just grids.
              Each row = a feature detector.
 
EIGENVALUES:  What the matrix "wants to do."
              >1 = explode, <1 = vanish, ≈1 = stable.
 
SVD:          Find the simple structure in any matrix.
              Large singular values = important.
              Small singular values = noise.
 
LoRA:         Fine-tuning updates are low-rank.
              Exploit this for massive compression.
 
PCA:          Find the directions that matter.
              Same eigenvalue/SVD math underneath.

What's Next

I'm continuing to Week 3: Optimization. How gradient descent actually works, why Adam beats SGD, and what momentum really means.

But honestly? These two weeks changed how I think about AI systems. I went from "magic black boxes" to "I see what you're doing there."

If you're building with AI and treating it as magic take two weeks. Learn the math. It's not as hard as it looks, and the payoff is huge.

Part 1: Why I'm Learning Linear Algebra - The motivation that started this journey
Part 2: Vectors, Dot Products & Similarity - Week 1: The foundations
Preventing LLM Hallucinations - Applying this knowledge to production AI systems
PostgreSQL Full-Text Search Limits - When you need vector search instead