First we derive a conflict-free shared-memory layout for a 16×32 transpose with nothing but intuition and bit-flips. Then we rebuild the very same answer with linear layouts over F₂ — and watch the hand-derivation turn into an algorithm that works for layouts no one could eyeball.
The running example: one warp of 32 threads transposes a 16×32 tensor of 32-bit floats
through shared memory. Shared memory is 32 banks, 4 bytes wide, dealt out round-robin:
bank = (addr / 4) % 32. Within one warp access, a bank can serve one address per cycle —
if two threads want different addresses in the same bank, the access serializes into extra
wavefronts.
Global memory dictates both access patterns, and they disagree about the tensor:
We get to choose exactly one thing: where in shared memory element (m, n) lives. The whole game is finding an arrangement that keeps both the row-writers and the column-readers out of each other's banks.
Store row-major and the bank of an element is just its column index. Rows are horizontal, banks are vertical — so the store sweeps all 32 banks per step (perfect), while a logical column stacks all 16 of its elements into one bank. Reading the pair {2r, 2r+1} jams sixteen threads onto one bank and sixteen onto its neighbor: a 16-way conflict, thirty banks idle.
Now the swizzle moves column pairs instead of columns. Parity survives: column 2r spreads over the sixteen even banks, column 2r+1 over the sixteen odd banks — disjoint by construction. Store: still a permutation per row, still perfect. Read: 32 banks, one hit each, every step. Conflict-free both ways.
So the hand-derivation distills to three rules: permute within rows (store-safe by construction), make the rows disagree (columns spread), and leave alone the bits the read assigns to simultaneous threads (pairs stay apart). Try to break the rules below.
The workbench restricts each row's permutation to "XOR by a mask", with the mask depending linearly on the row bits — one toggle per (row bit, column bit) pair. That restriction is not innocent, and Part II is about why it's exactly the right one. The store verdict, you'll notice, is green no matter what you do: rule one, enforced by construction.
Step back from the derivation and look at the moves. Every index in sight is a power of two — 16 rows, 32 columns, 32 threads, 32 banks — so every index is a bit-vector: the element coordinate (m, n) is 9 bits, a thread id is 5, a shared-memory offset is 9 (5 bank bits + 4 segment bits, where a segment is one 128-byte stripe of all 32 banks). And every map we built had the same special shape: each input bit has one fixed job. Flip thread bit 2 of the store and the element's n₂ flips — always, regardless of the other bits. Flip segment bit 0 of our final swizzle and m₀ and n₁ flip together. Nothing carries, nothing depends on context.
A map where flipping one input bit always XORs the same fixed pattern into the output is exactly a linear map over F₂ — the field {0, 1} where addition is XOR and multiplication is AND. Such a map is completely determined by where it sends each single-bit input, and those images, stacked side by side, are the columns of a binary matrix. This is the entire idea of a linear layout: a layout is a matrix; applying it is a matrix–vector product over F₂.
Here are our two distributed layouts, each shown twice: as the matrix, and as the layout it encodes — every tensor cell colored by its owning thread and numbered by its register. Same information, two notations; read each matrix column as "this input bit lands here", and check it against the picture.
And the memory layout — offset bits in, logical bits out. The bank bits map straight onto the column bits; the segment bits carry the row plus whatever masks you toggled in the workbench. That toggle grid was never a toy: it is, cell for cell, the off-diagonal block of this matrix. The figure below is live — go flip a workbench toggle and watch a dark cell appear here.
Why bother with the matrix view? Because the operations we were doing informally become mechanical: composing two layouts is matrix multiplication, inverting a layout is Gaussian elimination over F₂, and — the payoff — "which accesses collide" becomes a question about subspaces.
Two facts, both immediate from linearity:
So a conflict is a nonzero vector that lives in both spans: a direction that simultaneously changes which thread is asking and which segment of the same bank answers. The paper's Lemma 9.4 sharpens this into a count: the access takes exactly |span(M_Seg) ∩ span(Thr)| wavefronts — the size of the intersection subspace.
Our whole story compresses into one table:
| Swizzle | M_Seg basis | ∩ span(B_Thr) = span{m₀…m₃, n₀} | Read wavefronts |
|---|---|---|---|
| Unswizzled | m₀, m₁, m₂, m₃ | the whole 16-element span — every seg vector is a thread vector | 2⁴ = 16 |
| XOR m | m₀⊕n₀, m₁⊕n₁, m₂⊕n₂, m₃⊕n₃ | {0, m₀⊕n₀} — one bad direction survives | 2¹ = 2 |
| XOR 2m | m₀⊕n₁, m₁⊕n₂, m₂⊕n₃, m₃⊕n₄ | {0} — trivial | 2⁰ = 1 |
Sixteen, two, one: the conflict counts we measured by brute force are powers of two because they were dimensions all along. The store column is omitted because it's always trivial: every nonzero combination of seg vectors carries some m bits, and span(A_Thr) has none — the algebraic form of "within-row permutations can't hurt the store."
Don't take the table's word for it. Flip on "show the algebra" in the workbench: it lists your current M_Seg basis, computes the intersection with span(B_Thr), and prints 2dim next to the simulated wavefront count. They agree for every one of the 2²⁰ possible toggle configurations — the lemma holds exactly, as an equality.
One more dividend before we move on: the geometry also tells us the rule of the game. A combination of seg vectors ⊕(m_j ⊕ c_j) lands in span(B_Thr) precisely when the mask part ⊕c_j equals 0 or n₀. So your swizzle reads conflict-free iff the four masks, together with n₀, are linearly independent. The classic swizzle fails because its mask set {n₀, n₁, n₂, n₃} contains n₀; the optimal one's {n₁, n₂, n₃, n₄} doesn't — and neither do plenty of others you can find in the workbench. Optimal swizzles come as a whole family. Which raises the real question: how do you construct a member of that family for layouts you can't eyeball?
Restate the task in the new language. We must build the memory matrix M — that is, decide which logical directions become Vec bits (contiguous, for vectorized access), Bank bits, and Seg bits — such that:
The hard part is the seg bits: we need s independent vectors avoiding the union of two thread spans. How big can such a subspace be? The paper's Lemma 9.5 answers in general: inside F₂ᵈ, the largest subspace avoiding span(U) ∪ span(V) has dimension d − max(dim U, dim V). Here: 9 − max(5, 5) = 4 — exactly the four seg bits we need. Conflict-free is achievable, with zero slack. The algorithm of §5.4 is a recipe for manufacturing that subspace:
Take V = basis of A_Reg ∩ B_Reg. Vec bits are the most valuable real estate (wider loads, fewer instructions), so they're claimed first. Here span{m₀…m₃} ∩ span{n₁…n₄} = {0}: A's registers walk rows, B's walk column pairs, no shared direction — v = 0, scalar accesses. The algorithm maximizes what the layouts permit; they permit none.
b = log₂(128/(2ᵛ·w)) = 5 bank bits, s = 9 − 0 − 5 = 4 seg bits. The arithmetic just says: one segment = one full stripe of the 32 banks.
If accesses were vectorized beyond 4 bytes, the hardware would split each warp request into multiple transactions, and the thread bits it splits on get a free pass — they're removed before the conflict analysis (A_Bank, B_Bank). Here 2ᵛ·w = 4 bytes: no split, no free pass, A_Bank = A_Thr, B_Bank = B_Thr.
Split the two thread sets three ways — shared, A-only, B-only:
I = A_Thr ∩ B_Thr = n₀ — a thread direction for both sides. Stepping a segment along n₀ would put two simultaneous threads (of either access) into the same bank. It is set aside, untouchable. This is rule three of the hand-derivation — "leave the read's bits alone" — discovered mechanically.
E = A_Thr ∖ B_Thr = n₁n₂n₃n₄ · F = B_Thr ∖ A_Thr = m₀m₁m₂m₃
Pair E with F and XOR: H = {e_i ⊕ f_i}. Why does this dodge both spans? An H vector flips a thread bit that only A uses and a thread bit that only B uses, together. Seen from A, it has an F-component A's threads can't produce; seen from B, an E-component B's threads can't produce. Stepping one segment changes the writing thread and the reading thread — which is exactly the meaning of "same bank, different segment, different thread": no conflict, by construction, for both accesses at once.
If the two thread spans don't fill the whole space, any complement basis C is also fair game (those directions are thread-invisible to both sides). Here the spans cover everything: C = ∅. We need s = 4 vectors and have |H| + |C| = 4: take all of H. M_Seg = H. (When |H| + |C| < s, conflict-free is impossible and the algorithm pads Seg from A_Bank, accepting the minimum unavoidable conflicts.)
Complete to a basis of F₂⁹ for the bank bits; the canonical pick is n₀n₁n₂n₃n₄. Bank choice can't create conflicts — only Vec and Seg appear in the lemma — so any completion works.
And there it is — the same n ⊕ 2m we reached by staring at parities in Part I, only now nothing was stared at. Every move was forced: the intersection told us what to protect, the set differences told us what to pair, the XOR made each pair invisible to both thread spans, and a dimension count certified the result optimal before we ever simulated a single access. Note the algorithm returns one canonical member of the family ("pair in ascending order" is a convention); the workbench's independence rule describes the whole family, and the recipe is simply a constructive proof that the family is non-empty whenever Lemma 9.5 says it can be.
For this example, intuition got there. But every crutch we leaned on was specific: two tidy layouts, one warp, scalar accesses, a conflict you could see in a 16×32 picture. Production kernels hand the compiler an mma fragment layout on one side and a vectorized 128-bit blocked layout on the other, across eight warps, in fp8 — and "which bits does the read use for simultaneous threads" is no longer something anyone eyeballs. The linear-layout formulation never needed the picture:
That's the real moral of the one-bit shift. The classic swizzle is fine as far as it goes — one point in a family, picked without consulting the read layout. Linear layouts make the family — and the consultation — computable.