One Bit Off — deriving the optimal swizzle for a 16×32 transpose

§ 01 · The setup

A transpose through shared memory

We have a 16×32 tensor of 32-bit floats that one warp (32 threads) needs to transpose. Shared memory on NVIDIA and AMD gfx9 GPUs is organized as 32 banks of 4 bytes, assigned by bank = (addr / 4) % 32: consecutive 32-bit elements land in consecutive banks, and if two threads of a warp touch different addresses in the same bank during one access, the hardware serializes them into extra wavefronts.

The transpose uses two access patterns, fixed by what global memory wants:

Store, row-wise. Coming from a coalesced global load, thread t owns column t; register step r walks down the rows. Each step, the warp writes one full row of 32 elements: element (m=r, n=t).
Read, column pairs. To write the transposed result coalesced, thread t must read element (m = t mod 16, n = 2r + ⌊t/16⌋): threads 0–15 take column 2r, threads 16–31 take column 2r+1.

The only thing left to choose is the shared-memory layout: at which offset does element (m, n) live? That choice decides whether the read is fast or sixteen times too slow.

§ 02 · The problem

Unswizzled: a sixteen-way pileup

offset(m, n) = 32·m + n ⇒ bank = n

Row-major storage makes the store perfect — each row sweep touches all 32 banks once. But every element of column n sits in bank n. Reading the column pair {2r, 2r+1} puts sixteen threads on one bank and sixteen on its neighbor: a 16-way conflict, sixteen wavefronts where one should do, with thirty banks idle.

§ 03 · The classic fix — almost

XOR the row in: from 16-way to 2-way

offset(m, n) = 32·m + (n ⊕ m) ⇒ bank = n ⊕ m

The textbook swizzle — XOR each row's addresses by the row index — is what Lei's post illustrates. It keeps the store conflict-free (XOR by a constant permutes a row's banks) and scatters each column across sixteen distinct banks. The post is careful to say the read now has "much less" bank conflicts. Less — not none.

Watch what happens to the pair: column 2r lands in banks {2r ⊕ m} and column 2r+1 in {(2r+1) ⊕ m}, with m ∈ [0,16). Since 2r and 2r+1 share their upper bits, both sets are the same sixteen banks — the half-warps collide pairwise, every bank in that half is hit twice, and the other sixteen banks sit idle. A 2-way conflict on every single read step.

Try it yourself. Set the scope below to XOR m and Read, scrub r, and watch the trace: two hits per bank, sixteen banks dark. Then switch to XOR 2m.

THE BANK SCOPE 16×32 · fp32 · 32 banks × 4 B · one warp

Swizzle

Access

r = 0

View

accessed by thread 0–15 accessed by thread 16–31 cell label = bank · hue = bank ÷ 2 · darker = odd bank

Bank hits this step (height = threads landing on the bank)

In the shared memory view, rows are 128-byte segments and columns are the physical banks — conflicts appear literally, as two rings stacked in one column. The cell label is the element's column index n (the segment index already equals its row m under all three maps).

§ 04 · The construction

Running the optimal-swizzling algorithm

The paper's §5.4 algorithm takes the two distributed layouts — write layout A, read layout B — and mechanically produces a memory layout M that maximizes vectorization and provably minimizes bank conflicts for both. Everything happens in F₂⁹: flatten the 16×32 tensor with n minor, so the basis vectors are the column bits n₀n₁n₂n₃n₄ and the row bits m₀m₁m₂m₃.

Written as the sets of matrix columns acting on each hardware dimension:

A (store: thread → column, register → row):
A_Reg = m₀m₁m₂m₃ · A_Thr = n₀n₁n₂n₃n₄

B (read: low thread bits → row, top thread bit → column parity, registers → column pairs):
B_Reg = n₁n₂n₃n₄ · B_Thr = m₀m₁m₂m₃n₀

Note the highlighted vector: the read's top thread bit maps to n₀, the column parity. That one detail turns out to be the whole story.

Vectorization

V is a basis of A_Reg ∩ B_Reg = span{m₀…m₃} ∩ span{n₁…n₄} = {0}, so v = 0. A's registers run down rows, B's run across column pairs — there is no shared register direction, so accesses are scalar 32-bit. The algorithm can't vectorize what the layouts forbid; it maximizes the rest.

Space sizes

Bank bits b = log₂(128 / (2ᵛ·w)) = log₂(128/4) = 5 (32 banks), segment bits s = d − v − b = 9 − 0 − 5 = 4 (16 segments of 128 bytes — one row of banks each).

Transaction split

Remove the last log₂ max(1, 2ᵛ·w/4) = log₂ 1 = 0 vectors from the thread sets: A_Bank = A_Thr, B_Bank = B_Thr. Each thread's access is exactly 4 bytes, so the hardware never splits a warp request here — unlike the vectorized toy example in the paper's Figure 5, nothing gets a free pass.

Shared vs. unique

I = A_Bank ∩ B_Bank = n₀ — a thread bit of both layouts, so it is set aside and protected.

E = A_Bank ∖ B_Bank = n₁n₂n₃n₄ · F = B_Bank ∖ A_Bank = m₀m₁m₂m₃

Pair them in ascending order and XOR: H = n₁⊕m₀n₂⊕m₁n₃⊕m₂n₄⊕m₃. Each H vector mixes an A-only thread direction with a B-only one, so stepping along it changes thread identity in both layouts at once — the algebraic recipe for "different threads, different segments."

Complement

P = span(A_Bank) ∪ span(B_Bank), and together those bases span all of F₂⁹, so the complementing basis is empty: C = ∅.

Segment bits

|H| + |C| = 4 + 0 = s exactly — conflict-free is achievable with zero slack. Take all of it: M_Seg = H, i.e. segment bit j ↦ n_j+1 ⊕ m_j.

Bank bits

Complete M_Seg to a basis of F₂⁹; the canonical choice is M_Bank = n₀n₁n₂n₃n₄ (bank bit i ↦ n_i). M_Vec is empty.

§ 05 · The result

offset = 32·m + (n ⊕ 2m)

Reading the map off: at (bank k, segment σ), the segment bits contribute m = σ plus a column contribution of 2σ, the bank bits contribute k; so the stored element is (m, n) = (σ, k ⊕ 2σ). Inverting:

offset(m, n) = 32·m + (n ⊕ 2m) ⇒ bank = n ⊕ 2m

The memory layout M as a 9×9 matrix over F₂: columns are the offset bits (Bank k₀–k₄, Segment s₀–s₃), rows are the logical bits (n₀–n₄, m₀–m₃). The identity block sends banks to columns; the segment block carries each H vector — one foot in n, one in m, shifted by a bit.

Verification

Store, step r: the warp writes row r into segment r at banks {n ⊕ 2r : n = 0…31} — all 32 banks, exactly once.
Read, step r: threads 0–15 read (m, 2r) at banks 2r ⊕ 2m — the sixteen even banks, once each — while threads 16–31 read (m, 2r+1) at banks (2r+1) ⊕ 2m — the sixteen odd banks. All 32 banks, every cycle. Conflict-free.
Lemma 9.4 check: span(M_Seg) intersects both span(A_Thr) and span(B_Thr) trivially — every nonzero combination of H vectors carries an m-component outside span{n₀} and an n-component outside span{m₀…m₃, n₀}.
Legacy parameters: this M is exactly the mma swizzle with vec=2, per_phase=1, max_phase=16 — Lei's version is its vec=1 sibling.

Full pass over the tensor: 16 store wavefronts + 16 read wavefronts. The data-volume floor. Nothing left on the table.

§ 06 · Why one bit matters

The shared thread bit n₀

Why does the textbook n ⊕ m fall short here? Write it as a linear layout and its segment basis is {m₀⊕n₀, m₁⊕n₁, m₂⊕n₂, m₃⊕n₃} — and the first vector, m₀⊕n₀, lies inside span(B_Thr) = span{m₀…m₃, n₀}. By the paper's Lemma 9.4 that is precisely a guaranteed extra wavefront on every read: two read threads that differ by exactly that vector land in different segments of the same bank.

The algorithm can never make this mistake, structurally: n₀ sits in I, the intersection of both thread sets, so it is excluded from E and F and never gets XORed into anything. Only A-unique thread bits get paired with B-unique ones. Intuitively — the read assigns column parity to a thread bit, so a good swizzle must preserve column parity. Folding the row index into n's bits 1–4 instead of bits 0–3 does exactly that, and that one-bit shift is the entire distance between "much less" conflicts and none.

Shared-memory map	Bank of (m, n)	Store row	Read column pair	Full pass
Unswizzled	n	conflict-free	16-way	16 + 256 wavefronts
XOR m (classic)	n ⊕ m	conflict-free	2-way	16 + 32 wavefronts
XOR 2m (algorithm)	n ⊕ 2m	conflict-free	conflict-free	16 + 16 wavefronts

That is the quiet argument for doing layouts with linear algebra over F₂ rather than by pattern-matching: the classic swizzle is a fine shape of answer, but the optimal instance depends on both access patterns at once — and the algebra finds it mechanically, for any pair of layouts you hand it.