The classic XOR swizzle still bank-conflicts a 16×32 transpose. The optimal-swizzling algorithm from the Linear Layouts paper finds the fix — and it differs from the textbook answer by a single bit shift.
We have a 16×32 tensor of 32-bit floats that one warp (32 threads) needs to transpose.
Shared memory on NVIDIA and AMD gfx9 GPUs is organized as 32 banks of 4 bytes, assigned by
bank = (addr / 4) % 32: consecutive 32-bit elements land in consecutive banks, and if two
threads of a warp touch different addresses in the same bank during one access, the hardware serializes
them into extra wavefronts.
The transpose uses two access patterns, fixed by what global memory wants:
The only thing left to choose is the shared-memory layout: at which offset does element (m, n) live? That choice decides whether the read is fast or sixteen times too slow.
Row-major storage makes the store perfect — each row sweep touches all 32 banks once. But every element of column n sits in bank n. Reading the column pair {2r, 2r+1} puts sixteen threads on one bank and sixteen on its neighbor: a 16-way conflict, sixteen wavefronts where one should do, with thirty banks idle.
The textbook swizzle — XOR each row's addresses by the row index — is what Lei's post illustrates. It keeps the store conflict-free (XOR by a constant permutes a row's banks) and scatters each column across sixteen distinct banks. The post is careful to say the read now has "much less" bank conflicts. Less — not none.
Watch what happens to the pair: column 2r lands in banks {2r ⊕ m} and column 2r+1 in {(2r+1) ⊕ m}, with m ∈ [0,16). Since 2r and 2r+1 share their upper bits, both sets are the same sixteen banks — the half-warps collide pairwise, every bank in that half is hit twice, and the other sixteen banks sit idle. A 2-way conflict on every single read step.
Try it yourself. Set the scope below to XOR m and Read, scrub r, and watch the trace: two hits per bank, sixteen banks dark. Then switch to XOR 2m.
In the shared memory view, rows are 128-byte segments and columns are the physical banks — conflicts appear literally, as two rings stacked in one column. The cell label is the element's column index n (the segment index already equals its row m under all three maps).
The paper's §5.4 algorithm takes the two distributed layouts — write layout A, read layout B — and mechanically produces a memory layout M that maximizes vectorization and provably minimizes bank conflicts for both. Everything happens in F₂⁹: flatten the 16×32 tensor with n minor, so the basis vectors are the column bits n0n1n2n3n4 and the row bits m0m1m2m3.
Written as the sets of matrix columns acting on each hardware dimension:
A (store: thread → column, register → row):
A_Reg = m0m1m2m3
· A_Thr = n0n1n2n3n4
B (read: low thread bits → row, top thread bit → column parity, registers → column pairs):
B_Reg = n1n2n3n4
· B_Thr = m0m1m2m3n0
Note the highlighted vector: the read's top thread bit maps to n0, the column parity. That one detail turns out to be the whole story.
V is a basis of A_Reg ∩ B_Reg = span{m0…m3} ∩ span{n1…n4} = {0}, so v = 0. A's registers run down rows, B's run across column pairs — there is no shared register direction, so accesses are scalar 32-bit. The algorithm can't vectorize what the layouts forbid; it maximizes the rest.
Bank bits b = log₂(128 / (2ᵛ·w)) = log₂(128/4) = 5 (32 banks), segment bits s = d − v − b = 9 − 0 − 5 = 4 (16 segments of 128 bytes — one row of banks each).
Remove the last log₂ max(1, 2ᵛ·w/4) = log₂ 1 = 0 vectors from the thread sets: A_Bank = A_Thr, B_Bank = B_Thr. Each thread's access is exactly 4 bytes, so the hardware never splits a warp request here — unlike the vectorized toy example in the paper's Figure 5, nothing gets a free pass.
I = A_Bank ∩ B_Bank = n0 — a thread bit of both layouts, so it is set aside and protected.
E = A_Bank ∖ B_Bank = n1n2n3n4 · F = B_Bank ∖ A_Bank = m0m1m2m3
Pair them in ascending order and XOR: H = n1⊕m0n2⊕m1n3⊕m2n4⊕m3. Each H vector mixes an A-only thread direction with a B-only one, so stepping along it changes thread identity in both layouts at once — the algebraic recipe for "different threads, different segments."
P = span(A_Bank) ∪ span(B_Bank), and together those bases span all of F₂⁹, so the complementing basis is empty: C = ∅.
|H| + |C| = 4 + 0 = s exactly — conflict-free is achievable with zero slack. Take all of it: M_Seg = H, i.e. segment bit j ↦ nj+1 ⊕ mj.
Complete M_Seg to a basis of F₂⁹; the canonical choice is M_Bank = n0n1n2n3n4 (bank bit i ↦ ni). M_Vec is empty.
Reading the map off: at (bank k, segment σ), the segment bits contribute m = σ plus a column contribution of 2σ, the bank bits contribute k; so the stored element is (m, n) = (σ, k ⊕ 2σ). Inverting:
Full pass over the tensor: 16 store wavefronts + 16 read wavefronts. The data-volume floor. Nothing left on the table.
Why does the textbook n ⊕ m fall short here? Write it as a linear layout and its segment basis is {m0⊕n0, m1⊕n1, m2⊕n2, m3⊕n3} — and the first vector, m0⊕n0, lies inside span(B_Thr) = span{m0…m3, n0}. By the paper's Lemma 9.4 that is precisely a guaranteed extra wavefront on every read: two read threads that differ by exactly that vector land in different segments of the same bank.
The algorithm can never make this mistake, structurally: n0 sits in I, the intersection of both thread sets, so it is excluded from E and F and never gets XORed into anything. Only A-unique thread bits get paired with B-unique ones. Intuitively — the read assigns column parity to a thread bit, so a good swizzle must preserve column parity. Folding the row index into n's bits 1–4 instead of bits 0–3 does exactly that, and that one-bit shift is the entire distance between "much less" conflicts and none.
| Shared-memory map | Bank of (m, n) | Store row | Read column pair | Full pass |
|---|---|---|---|---|
| Unswizzled | n | conflict-free | 16-way | 16 + 256 wavefronts |
| XOR m (classic) | n ⊕ m | conflict-free | 2-way | 16 + 32 wavefronts |
| XOR 2m (algorithm) | n ⊕ 2m | conflict-free | conflict-free | 16 + 16 wavefronts |
That is the quiet argument for doing layouts with linear algebra over F₂ rather than by pattern-matching: the classic swizzle is a fine shape of answer, but the optimal instance depends on both access patterns at once — and the algebra finds it mechanically, for any pair of layouts you hand it.