In Part 1, I explained why I started learning linear algebra. This is the story of Week 1 when vectors stopped being abstract math and became tools I actually use.
The Confusion
I kept seeing code like this:
similarity = cosine_similarity(embedding1, embedding2)And I'd nod along, pretending I understood. But I didn't. Not really.
What IS an embedding? Why does "similarity" involve cosine? What are we actually measuring?
I decided to stop pretending and actually learn it. Here's what clicked.
Vectors Are Just Lists (But Think of Them as Points)
The first thing 3Blue1Brown taught me: a vector is just a list of numbers. But the magic is in how you visualize it.
A VECTOR IS A POINT IN SPACE:
[3, 2] means:
y
|
2 | * (3, 2)
| /|
1 | / |
| / |
|----/---|---- x
0 1 2 3
Go 3 steps right, 2 steps up.
That's your point.Simple enough in 2D. But here's what broke my brain at first:
AI vectors have HUNDREDS of dimensions.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("I love Python")
print(len(embedding)) # 384 dimensions!384 numbers. Each one captures something about "I love Python." I can't visualize 384 dimensions, but I can understand the principle:
Similar meanings = nearby points.
WORDS IN EMBEDDING SPACE (simplified to 2D):
|
| "Python" * * "JavaScript"
|
| * "Programming"
|
----+---------------------------
|
| "Cat" * * "Dog"
|
| * "Pet"
|
Programming words cluster together.
Animal words cluster together.
THAT'S why embeddings work!Dot Product = "How Much Do These Point the Same Way?"
This was my first real aha moment.
The dot product is dead simple to calculate:
DOT PRODUCT:
[2, 3] . [4, 1]
= 2*4 + 3*1
= 8 + 3
= 11
Just multiply matching positions and add.But WHY do we care? Because it measures similarity.
SAME DIRECTION = HIGH DOT PRODUCT:
Vector A: [1, 0] (pointing right)
Vector B: [1, 0] (pointing right)
A . B = 1*1 + 0*0 = 1 (maximum!)
-----> A
-----> B
They agree completely.
OPPOSITE DIRECTION = NEGATIVE DOT PRODUCT:
Vector A: [1, 0] (pointing right)
Vector B: [-1, 0] (pointing left)
A . B = 1*(-1) + 0*0 = -1 (minimum!)
-----> A
<----- B
They completely disagree.
PERPENDICULAR = ZERO DOT PRODUCT:
Vector A: [1, 0] (pointing right)
Vector B: [0, 1] (pointing up)
A . B = 1*0 + 0*1 = 0 (neutral)
|
| B
|
+-----> A
They're unrelated.This is why attention mechanisms use dot products!
When a transformer asks "which words should I pay attention to?", it's computing dot products between query and key vectors. High dot product = "these are related, pay attention."
# This is basically what attention does
def simple_attention(query, keys):
scores = [dot_product(query, key) for key in keys]
# Higher scores = more attention
return softmax(scores)Why We Normalize (Cosine Similarity)
There's a problem with raw dot products:
LONGER VECTORS GET BIGGER DOT PRODUCTS:
A: [1, 1]
B: [100, 100]
Both point in the same direction!
But B is 100x longer.
A . B = 100 + 100 = 200
A . A = 1 + 1 = 2
Does this mean B is "more similar" to A than A is to itself?
That's nonsense.The fix: normalize first (make all vectors length 1).
This is cosine similarity:
COSINE SIMILARITY:
cos(A, B) = (A . B) / (|A| * |B|)
Where |A| = length of A = sqrt(a1^2 + a2^2 + ...)Now we're measuring the angle between vectors, not their lengths.
COSINE SIMILARITY RESULTS:
cos = 1.0 → Same direction (identical meaning)
cos = 0.0 → Perpendicular (unrelated)
cos = -1.0 → Opposite direction (opposite meaning)This is what I was blindly using in my code!
from sklearn.metrics.pairwise import cosine_similarity
# Now I actually understand what this does
sim = cosine_similarity([embedding1], [embedding2])
# It's measuring the angle between two meaning-vectors!Norms = "How Big Is This?"
The last piece: norms measure the "size" of a vector.
L2 NORM (Euclidean Length):
||[3, 4]|| = sqrt(3^2 + 4^2)
= sqrt(9 + 16)
= sqrt(25)
= 5
This is the Pythagorean theorem!
Just the distance from origin to the point.Why do we care? Loss functions use norms to measure error.
PREDICTION ERROR:
Prediction: [7, 3, 5]
Truth: [5, 8, 6]
Error: [2, -5, -1]
How "wrong" are we?
L2 Loss = ||(error)||^2
= 2^2 + (-5)^2 + (-1)^2
= 4 + 25 + 1
= 30
This is Mean Squared Error!That thing I'd been using for years loss = MSE(predictions, targets) is just the squared L2 norm of the error vector.
Putting It Together
Here's my mental model now:
THE AI PIPELINE:
1. CONVERT TO VECTORS
"I love Python" → [0.2, 0.8, 0.1, ..., 0.5]
Text becomes a point in high-dimensional space.
2. MEASURE SIMILARITY (DOT PRODUCT / COSINE)
"Which stored documents are near this query?"
Dot product finds related content.
3. MEASURE ERROR (NORMS)
"How wrong was the prediction?"
Norms quantify the mistake.
4. UPDATE (Next part!)
"Adjust to make less wrong next time"What Actually Helped Me Learn
I tried reading textbooks. Didn't work. Here's what did:
-
3Blue1Brown's Essence of Linear Algebra - The visual intuition I was missing. Watch the first 4 videos.
-
Playing with real embeddings - I loaded Sentence Transformers and just... experimented. Computed similarities between random sentences. Saw the numbers.
-
Drawing it out - I literally sketched 2D vectors on paper. Made the abstract concrete.
-
Connecting to code I already write - Every time I learned a concept, I found it in my existing codebase. "Oh, THAT'S what that function does!"
The Practical Payoff
After one week of actually learning this:
-
Debugging RAG: When retrieval returns wrong documents, I can check the actual similarity scores and understand WHY.
-
Choosing embedding models: I understand why different models have different dimension sizes and what that means.
-
Reading papers: ML papers stopped being incomprehensible. "Compute the dot product of Q and K" - got it!
Key Takeaways
VECTORS: Data as points in space.
Similar things = nearby points.
DOT PRODUCT: "How much do these agree?"
High = similar, Zero = unrelated, Negative = opposite
COSINE SIM: Dot product, but ignoring length.
Measures direction only.
NORMS: "How big is this?"
Used to measure error in loss functions.None of this is hard. It's just... nobody explains it this way. The math notation makes it look scary, but the concepts are intuitive once you see them visually.
Next up: Part 3 - Matrices, Eigenvalues, and SVD - where I learned how neural networks actually transform data, and why LoRA fine-tuning works.
Related Reading
- Part 1: Why I'm Learning Linear Algebra - The motivation that started this journey
- Part 3: Matrices, Eigenvalues & SVD - How neural networks transform and compress data
- Preventing LLM Hallucinations - Applying this knowledge to production AI systems
