Swizzle by Hand, Swizzle by Algebra

§ 01 · The setup

One tensor, two opinions about it

The running example: one warp of 32 threads transposes a 16×32 tensor of 32-bit floats through shared memory. Shared memory is 32 banks, 4 bytes wide, dealt out round-robin: bank = (addr / 4) % 32. Within one warp access, a bank can serve one address per cycle — if two threads want different addresses in the same bank, the access serializes into extra wavefronts.

Global memory dictates both access patterns, and they disagree about the tensor:

The store thinks in rows. After a coalesced load, thread t holds column t; at register step r the warp writes row r — element (m=r, n=t), all 32 columns at once.
The read thinks in columns. To write the transposed result coalesced, thread t must fetch element (m = t mod 16, n = 2r + ⌊t/16⌋) at step r: threads 0–15 walk column 2r, threads 16–31 walk column 2r+1. (A column has only 16 elements, so the warp takes columns two at a time.)

We get to choose exactly one thing: where in shared memory element (m, n) lives. The whole game is finding an arrangement that keeps both the row-writers and the column-readers out of each other's banks.

Part I — By hand

§ 02 · The pileup

Why columns collide

row-major: offset(m, n) = 32·m + n ⇒ bank = n

Store row-major and the bank of an element is just its column index. Rows are horizontal, banks are vertical — so the store sweeps all 32 banks per step (perfect), while a logical column stacks all 16 of its elements into one bank. Reading the pair {2r, 2r+1} jams sixteen threads onto one bank and sixteen onto its neighbor: a 16-way conflict, thirty banks idle.

§ 03 · The derivation

Deriving the swizzle, step by step

Find the freedom. The store writes one row per step, and a row occupies one 32-slot stripe of shared memory. Permute the slots within each row however you like — the store still touches 32 distinct banks, because a permutation of 32 slots is still 32 distinct slots. So we own sixteen free knobs: one within-row permutation per row. Nothing we do with them can ever hurt the store.
Make the rows disagree. A column collides because all sixteen rows agree to put position n in the same place. So the per-row permutations must disagree on every position. The oldest trick is a shift: rotate row m by m slots, so a column hits banks n+m mod 32 — sixteen different banks. (Padding each row by one element does this by accident: offset = 33m + n gives bank = (n + m) mod 32.) It works, but rotation drags in adds and modulos, padding burns shared memory, and neither composes nicely with the power-of-two world everything else lives in.
Shift with XOR instead. Replace the rotation with a bit-flip: send position n of row m to slot n ⊕ m. For a fixed row this is a permutation (XOR is its own inverse), different rows get different permutations (n⊕m = n⊕m′ forces m = m′), and it costs one instruction. Watch it build: row 1 swaps neighbors (⊕1), row 2 swaps across pairs (⊕2), row 3 does both (⊕3) — the staircase below. A single column now lands in sixteen distinct banks: {n ⊕ m : m = 0…15}. By all appearances, we're done.
Run the actual read — and find the leftover. Our read is not one column; it's a pair. Column 2r occupies banks {2r ⊕ m} and column 2r+1 occupies {2r ⊕ m ⊕ 1} — and flipping the last bit of a set that already contains both parities gives back the same sixteen banks. The two half-warps double-book every bank in one half of shared memory while the other half idles: a 2-way conflict on every read step. Set the workbench below to Classic XOR m and watch the trace.
Diagnose it. The read pattern already spent the column's lowest bit: parity (n₀) is what separates thread 0–15 from thread 16–31 within the same cycle. Our swizzle XORs m₀ into that very bit — it scrambles the one coordinate the read was using to keep its threads apart. The swizzle stepped on the read's toes.
Respect the read's bits. The fix writes itself: fold the row index into column bits the read isn't using for simultaneous threads — bits 1 through 4. Shift the mask up by one:

offset(m, n) = 32·m + (n ⊕ 2m) ⇒ bank = n ⊕ 2m

Now the swizzle moves column pairs instead of columns. Parity survives: column 2r spreads over the sixteen even banks, column 2r+1 over the sixteen odd banks — disjoint by construction. Store: still a permutation per row, still perfect. Read: 32 banks, one hit each, every step. Conflict-free both ways.

So the hand-derivation distills to three rules: permute within rows (store-safe by construction), make the rows disagree (columns spread), and leave alone the bits the read assigns to simultaneous threads (pairs stay apart). Try to break the rules below.

THE SWIZZLE WORKBENCH design your own: row m's slots get XORed by mask(m)

Fold row bit … into column bits … (this is your swizzle)

Presets

Access

r = 0

show the algebra (Part II)

accessed by thread 0–15 accessed by thread 16–31 cell hue = bank ÷ 2 · darker = odd bank · hover for details

Bank hits this step

The workbench restricts each row's permutation to "XOR by a mask", with the mask depending linearly on the row bits — one toggle per (row bit, column bit) pair. That restriction is not innocent, and Part II is about why it's exactly the right one. The store verdict, you'll notice, is green no matter what you do: rule one, enforced by construction.

Part II — By algebra

§ 04 · From knobs to vectors

What we actually just did

Step back from the derivation and look at the moves. Every index in sight is a power of two — 16 rows, 32 columns, 32 threads, 32 banks — so every index is a bit-vector: the element coordinate (m, n) is 9 bits, a thread id is 5, a shared-memory offset is 9 (5 bank bits + 4 segment bits, where a segment is one 128-byte stripe of all 32 banks). And every map we built had the same special shape: each input bit has one fixed job. Flip thread bit 2 of the store and the element's n₂ flips — always, regardless of the other bits. Flip segment bit 0 of our final swizzle and m₀ and n₁ flip together. Nothing carries, nothing depends on context.

A map where flipping one input bit always XORs the same fixed pattern into the output is exactly a linear map over F₂ — the field {0, 1} where addition is XOR and multiplication is AND. Such a map is completely determined by where it sends each single-bit input, and those images, stacked side by side, are the columns of a binary matrix. This is the entire idea of a linear layout: a layout is a matrix; applying it is a matrix–vector product over F₂.

Here are our two distributed layouts, each shown twice: as the matrix, and as the layout it encodes — every tensor cell colored by its owning thread and numbered by its register. Same information, two notations; read each matrix column as "this input bit lands here", and check it against the picture.

A — the store. Color = owning thread (hue walks t₀…t₃₁; odd threads darker), number = register. Threads run across the columns — the Thr→n identity block — and registers run down the rows — the Reg→m block. Flip thread bit t₂ and ownership jumps 4 columns right, exactly the n₂ entry in column t₂. Thread 9's sixteen elements are ringed: one column, registers 0–15.

B — the read. Same encoding. Now the low thread bits run down the rows (t₀…t₃→m), the top thread bit selects column parity (the lone t₄→n₀ entry — threads 0–15 own even columns, 16–31 own odd ones), and registers stride across column pairs (r_i→n_{i+1}). Thread 9's elements are ringed: row 9 of every even column, registers 0–15.

And the memory layout — offset bits in, logical bits out. The bank bits map straight onto the column bits; the segment bits carry the row plus whatever masks you toggled in the workbench. That toggle grid was never a toy: it is, cell for cell, the off-diagonal block of this matrix. The figure below is live — go flip a workbench toggle and watch a dark cell appear here.

M — your current swizzle, live. Blue cells are the fixed identity structure (banks→columns, segments→rows); dark cells are the workbench's C block — which column bits each row bit folds into. All three named swizzles share this [[I, C], [0, I]] shape; they differ only in C.

Why bother with the matrix view? Because the operations we were doing informally become mechanical: composing two layouts is matrix multiplication, inverting a layout is Gaussian elimination over F₂, and — the payoff — "which accesses collide" becomes a question about subspaces.

§ 05 · Conflicts as geometry

A bank conflict is a subspace intersection

Two facts, both immediate from linearity:

Two elements sit in the same bank (in different segments) exactly when their offsets differ only in segment bits — that is, when their logical coordinates differ by a vector in span(M_Seg), the span of the segment columns of M.
Two elements are touched by different threads of the same access exactly when their logical coordinates differ by a vector in the span of that layout's thread columns — span(A_Thr) for the store, span(B_Thr) for the read.

So a conflict is a nonzero vector that lives in both spans: a direction that simultaneously changes which thread is asking and which segment of the same bank answers. The paper's Lemma 9.4 sharpens this into a count: the access takes exactly |span(M_Seg) ∩ span(Thr)| wavefronts — the size of the intersection subspace.

Our whole story compresses into one table:

Swizzle	M_Seg basis	∩ span(B_Thr) = span{m₀…m₃, n₀}	Read wavefronts
Unswizzled	m₀, m₁, m₂, m₃	the whole 16-element span — every seg vector is a thread vector	2⁴ = 16
XOR m	m₀⊕n₀, m₁⊕n₁, m₂⊕n₂, m₃⊕n₃	{0, m₀⊕n₀} — one bad direction survives	2¹ = 2
XOR 2m	m₀⊕n₁, m₁⊕n₂, m₂⊕n₃, m₃⊕n₄	{0} — trivial	2⁰ = 1

Sixteen, two, one: the conflict counts we measured by brute force are powers of two because they were dimensions all along. The store column is omitted because it's always trivial: every nonzero combination of seg vectors carries some m bits, and span(A_Thr) has none — the algebraic form of "within-row permutations can't hurt the store."

Don't take the table's word for it. Flip on "show the algebra" in the workbench: it lists your current M_Seg basis, computes the intersection with span(B_Thr), and prints 2^dim next to the simulated wavefront count. They agree for every one of the 2²⁰ possible toggle configurations — the lemma holds exactly, as an equality.

One more dividend before we move on: the geometry also tells us the rule of the game. A combination of seg vectors ⊕(m_j ⊕ c_j) lands in span(B_Thr) precisely when the mask part ⊕c_j equals 0 or n₀. So your swizzle reads conflict-free iff the four masks, together with n₀, are linearly independent. The classic swizzle fails because its mask set {n₀, n₁, n₂, n₃} contains n₀; the optimal one's {n₁, n₂, n₃, n₄} doesn't — and neither do plenty of others you can find in the workbench. Optimal swizzles come as a whole family. Which raises the real question: how do you construct a member of that family for layouts you can't eyeball?

Part III — The algorithm

§ 06 · The construction, replayed

Optimal swizzling as subspace search

Restate the task in the new language. We must build the memory matrix M — that is, decide which logical directions become Vec bits (contiguous, for vectorized access), Bank bits, and Seg bits — such that:

M_Vec ⊆ A_Reg ∩ B_Reg — only directions that are register directions for both layouts may be vectorized;
span(M_Vec ∪ M_Seg) meets span(A_Thr) and span(B_Thr) trivially — by Lemma 9.4, that's conflict-free in both directions.

The hard part is the seg bits: we need s independent vectors avoiding the union of two thread spans. How big can such a subspace be? The paper's Lemma 9.5 answers in general: inside F₂ᵈ, the largest subspace avoiding span(U) ∪ span(V) has dimension d − max(dim U, dim V). Here: 9 − max(5, 5) = 4 — exactly the four seg bits we need. Conflict-free is achievable, with zero slack. The algorithm of §5.4 is a recipe for manufacturing that subspace:

Vectorization

Take V = basis of A_Reg ∩ B_Reg. Vec bits are the most valuable real estate (wider loads, fewer instructions), so they're claimed first. Here span{m₀…m₃} ∩ span{n₁…n₄} = {0}: A's registers walk rows, B's walk column pairs, no shared direction — v = 0, scalar accesses. The algorithm maximizes what the layouts permit; they permit none.

Space sizes

b = log₂(128/(2ᵛ·w)) = 5 bank bits, s = 9 − 0 − 5 = 4 seg bits. The arithmetic just says: one segment = one full stripe of the 32 banks.

Transaction split

If accesses were vectorized beyond 4 bytes, the hardware would split each warp request into multiple transactions, and the thread bits it splits on get a free pass — they're removed before the conflict analysis (A_Bank, B_Bank). Here 2ᵛ·w = 4 bytes: no split, no free pass, A_Bank = A_Thr, B_Bank = B_Thr.

Sort the thread bits

Split the two thread sets three ways — shared, A-only, B-only:

I = A_Thr ∩ B_Thr = n₀ — a thread direction for both sides. Stepping a segment along n₀ would put two simultaneous threads (of either access) into the same bank. It is set aside, untouchable. This is rule three of the hand-derivation — "leave the read's bits alone" — discovered mechanically.

E = A_Thr ∖ B_Thr = n₁n₂n₃n₄ · F = B_Thr ∖ A_Thr = m₀m₁m₂m₃

Pair and XOR

Pair E with F and XOR: H = {e_i ⊕ f_i}. Why does this dodge both spans? An H vector flips a thread bit that only A uses and a thread bit that only B uses, together. Seen from A, it has an F-component A's threads can't produce; seen from B, an E-component B's threads can't produce. Stepping one segment changes the writing thread and the reading thread — which is exactly the meaning of "same bank, different segment, different thread": no conflict, by construction, for both accesses at once.

Top up and assemble

If the two thread spans don't fill the whole space, any complement basis C is also fair game (those directions are thread-invisible to both sides). Here the spans cover everything: C = ∅. We need s = 4 vectors and have |H| + |C| = 4: take all of H. M_Seg = H. (When |H| + |C| < s, conflict-free is impossible and the algorithm pads Seg from A_Bank, accepting the minimum unavoidable conflicts.)

Banks last

Complete to a basis of F₂⁹ for the bank bits; the canonical pick is n₀n₁n₂n₃n₄. Bank choice can't create conflicts — only Vec and Seg appear in the lemma — so any completion works.

The pairing, at a glance

A_Thrn₀ n₁ n₂ n₃ n₄

B_Thrn₀ m₀ m₁ m₂ m₃

↑ n₀ is in both → protected, never folded (I)

n₁⊕m₀→n₁⊕m₀

n₂⊕m₁→n₂⊕m₁

n₃⊕m₂→n₃⊕m₂

n₄⊕m₃→n₄⊕m₃

H = M_Seg · segment bit j flips m_j and n_j+1 together ⇒ offset = 32·m + (n ⊕ 2m)

And there it is — the same n ⊕ 2m we reached by staring at parities in Part I, only now nothing was stared at. Every move was forced: the intersection told us what to protect, the set differences told us what to pair, the XOR made each pair invisible to both thread spans, and a dimension count certified the result optimal before we ever simulated a single access. Note the algorithm returns one canonical member of the family ("pair in ascending order" is a convention); the workbench's independence rule describes the whole family, and the recipe is simply a constructive proof that the family is non-empty whenever Lemma 9.5 says it can be.

§ 07 · What scales

Why the algebra wins

For this example, intuition got there. But every crutch we leaned on was specific: two tidy layouts, one warp, scalar accesses, a conflict you could see in a 16×32 picture. Production kernels hand the compiler an mma fragment layout on one side and a vectorized 128-bit blocked layout on the other, across eight warps, in fp8 — and "which bits does the read use for simultaneous threads" is no longer something anyone eyeballs. The linear-layout formulation never needed the picture:

Any distributed layout — blocked, mma, wgmma, mfma, sliced — is just another matrix, so the same seven steps run unchanged on inputs no human pattern-matches.
Vectorization is just step 1, claiming A_Reg ∩ B_Reg before the seg search starts, with hardware transaction splits handled by deleting the bits hardware serializes anyway.
When conflict-free is impossible, the dimension count says so and says how close you can get — the algorithm degrades to provably-minimal conflicts instead of silently shipping a 2-way leftover.

That's the real moral of the one-bit shift. The classic swizzle is fine as far as it goes — one point in a family, picked without consulting the read layout. Linear layouts make the family — and the consultation — computable.