Linear layouts · bank conflicts · F₂

One Bit Off

The classic XOR swizzle still bank-conflicts a 16×32 transpose. The optimal-swizzling algorithm from the Linear Layouts paper finds the fix — and it differs from the textbook answer by a single bit shift.

Where columns 0 and 1 live in shared memory, by bank (one square per row, colored by bank; darker = odd bank). Unswizzled, each column piles into one bank. XOR m spreads each column — but both columns into the same sixteen banks. XOR 2m sends column 0 to the even banks and column 1 to the odd banks: thirty-two banks, no overlap.
§ 01 · The setup

A transpose through shared memory

We have a 16×32 tensor of 32-bit floats that one warp (32 threads) needs to transpose. Shared memory on NVIDIA and AMD gfx9 GPUs is organized as 32 banks of 4 bytes, assigned by bank = (addr / 4) % 32: consecutive 32-bit elements land in consecutive banks, and if two threads of a warp touch different addresses in the same bank during one access, the hardware serializes them into extra wavefronts.

The transpose uses two access patterns, fixed by what global memory wants:

  • Store, row-wise. Coming from a coalesced global load, thread t owns column t; register step r walks down the rows. Each step, the warp writes one full row of 32 elements: element (m=r, n=t).
  • Read, column pairs. To write the transposed result coalesced, thread t must read element (m = t mod 16, n = 2r + ⌊t/16⌋): threads 0–15 take column 2r, threads 16–31 take column 2r+1.

The only thing left to choose is the shared-memory layout: at which offset does element (m, n) live? That choice decides whether the read is fast or sixteen times too slow.

§ 02 · The problem

Unswizzled: a sixteen-way pileup

offset(m, n) = 32·m + n    ⇒  bank = n

Row-major storage makes the store perfect — each row sweep touches all 32 banks once. But every element of column n sits in bank n. Reading the column pair {2r, 2r+1} puts sixteen threads on one bank and sixteen on its neighbor: a 16-way conflict, sixteen wavefronts where one should do, with thirty banks idle.

§ 03 · The classic fix — almost

XOR the row in: from 16-way to 2-way

offset(m, n) = 32·m + (n ⊕ m)    ⇒  bank = n ⊕ m

The textbook swizzle — XOR each row's addresses by the row index — is what Lei's post illustrates. It keeps the store conflict-free (XOR by a constant permutes a row's banks) and scatters each column across sixteen distinct banks. The post is careful to say the read now has "much less" bank conflicts. Less — not none.

Watch what happens to the pair: column 2r lands in banks {2r ⊕ m} and column 2r+1 in {(2r+1) ⊕ m}, with m ∈ [0,16). Since 2r and 2r+1 share their upper bits, both sets are the same sixteen banks — the half-warps collide pairwise, every bank in that half is hit twice, and the other sixteen banks sit idle. A 2-way conflict on every single read step.

Try it yourself. Set the scope below to XOR m and Read, scrub r, and watch the trace: two hits per bank, sixteen banks dark. Then switch to XOR 2m.

THE BANK SCOPE 16×32 · fp32 · 32 banks × 4 B · one warp
Swizzle
Access
Register step r
r = 0
View
accessed by thread 0–15 accessed by thread 16–31 cell label = bank · hue = bank ÷ 2 · darker = odd bank
Bank hits this step (height = threads landing on the bank)

In the shared memory view, rows are 128-byte segments and columns are the physical banks — conflicts appear literally, as two rings stacked in one column. The cell label is the element's column index n (the segment index already equals its row m under all three maps).

§ 04 · The construction

Running the optimal-swizzling algorithm

The paper's §5.4 algorithm takes the two distributed layouts — write layout A, read layout B — and mechanically produces a memory layout M that maximizes vectorization and provably minimizes bank conflicts for both. Everything happens in F₂⁹: flatten the 16×32 tensor with n minor, so the basis vectors are the column bits n0n1n2n3n4 and the row bits m0m1m2m3.

Written as the sets of matrix columns acting on each hardware dimension:

A (store: thread → column, register → row):
A_Reg = m0m1m2m3  ·  A_Thr = n0n1n2n3n4

B (read: low thread bits → row, top thread bit → column parity, registers → column pairs):
B_Reg = n1n2n3n4  ·  B_Thr = m0m1m2m3n0

Note the highlighted vector: the read's top thread bit maps to n0, the column parity. That one detail turns out to be the whole story.

Vectorization

V is a basis of A_Reg ∩ B_Reg = span{m0…m3} ∩ span{n1…n4} = {0}, so v = 0. A's registers run down rows, B's run across column pairs — there is no shared register direction, so accesses are scalar 32-bit. The algorithm can't vectorize what the layouts forbid; it maximizes the rest.

Space sizes

Bank bits b = log₂(128 / (2ᵛ·w)) = log₂(128/4) = 5 (32 banks), segment bits s = d − v − b = 9 − 0 − 5 = 4 (16 segments of 128 bytes — one row of banks each).

Transaction split

Remove the last log₂ max(1, 2ᵛ·w/4) = log₂ 1 = 0 vectors from the thread sets: A_Bank = A_Thr, B_Bank = B_Thr. Each thread's access is exactly 4 bytes, so the hardware never splits a warp request here — unlike the vectorized toy example in the paper's Figure 5, nothing gets a free pass.

Shared vs. unique

I = A_Bank ∩ B_Bank = n0 — a thread bit of both layouts, so it is set aside and protected.

E = A_Bank ∖ B_Bank = n1n2n3n4  ·  F = B_Bank ∖ A_Bank = m0m1m2m3

Pair them in ascending order and XOR: H = n1⊕m0n2⊕m1n3⊕m2n4⊕m3. Each H vector mixes an A-only thread direction with a B-only one, so stepping along it changes thread identity in both layouts at once — the algebraic recipe for "different threads, different segments."

Complement

P = span(A_Bank) ∪ span(B_Bank), and together those bases span all of F₂⁹, so the complementing basis is empty: C = ∅.

Segment bits

|H| + |C| = 4 + 0 = s exactly — conflict-free is achievable with zero slack. Take all of it: M_Seg = H, i.e. segment bit jnj+1 ⊕ mj.

Bank bits

Complete M_Seg to a basis of F₂⁹; the canonical choice is M_Bank = n0n1n2n3n4 (bank bit ini). M_Vec is empty.

§ 05 · The result

offset = 32·m + (n ⊕ 2m)

Reading the map off: at (bank k, segment σ), the segment bits contribute m = σ plus a column contribution of , the bank bits contribute k; so the stored element is (m, n) = (σ, k ⊕ 2σ). Inverting:

offset(m, n) = 32·m + (n ⊕ 2m)    ⇒  bank = n ⊕ 2m
The memory layout M as a 9×9 matrix over F₂: columns are the offset bits (Bank k0–k4, Segment s0–s3), rows are the logical bits (n0–n4, m0–m3). The identity block sends banks to columns; the segment block carries each H vector — one foot in n, one in m, shifted by a bit.

Verification

  • Store, step r: the warp writes row r into segment r at banks {n ⊕ 2r : n = 0…31} — all 32 banks, exactly once.
  • Read, step r: threads 0–15 read (m, 2r) at banks 2r ⊕ 2m — the sixteen even banks, once each — while threads 16–31 read (m, 2r+1) at banks (2r+1) ⊕ 2m — the sixteen odd banks. All 32 banks, every cycle. Conflict-free.
  • Lemma 9.4 check: span(M_Seg) intersects both span(A_Thr) and span(B_Thr) trivially — every nonzero combination of H vectors carries an m-component outside span{n0} and an n-component outside span{m0…m3, n0}.
  • Legacy parameters: this M is exactly the mma swizzle with vec=2, per_phase=1, max_phase=16 — Lei's version is its vec=1 sibling.

Full pass over the tensor: 16 store wavefronts + 16 read wavefronts. The data-volume floor. Nothing left on the table.

§ 06 · Why one bit matters

The shared thread bit n0

Why does the textbook n ⊕ m fall short here? Write it as a linear layout and its segment basis is {m0⊕n0, m1⊕n1, m2⊕n2, m3⊕n3} — and the first vector, m0⊕n0, lies inside span(B_Thr) = span{m0…m3, n0}. By the paper's Lemma 9.4 that is precisely a guaranteed extra wavefront on every read: two read threads that differ by exactly that vector land in different segments of the same bank.

The algorithm can never make this mistake, structurally: n0 sits in I, the intersection of both thread sets, so it is excluded from E and F and never gets XORed into anything. Only A-unique thread bits get paired with B-unique ones. Intuitively — the read assigns column parity to a thread bit, so a good swizzle must preserve column parity. Folding the row index into n's bits 1–4 instead of bits 0–3 does exactly that, and that one-bit shift is the entire distance between "much less" conflicts and none.

Shared-memory mapBank of (m, n)Store rowRead column pairFull pass
Unswizzlednconflict-free16-way16 + 256 wavefronts
XOR m  (classic)n ⊕ mconflict-free2-way16 + 32 wavefronts
XOR 2m  (algorithm)n ⊕ 2mconflict-freeconflict-free16 + 16 wavefronts

That is the quiet argument for doing layouts with linear algebra over F₂ rather than by pattern-matching: the classic swizzle is a fine shape of answer, but the optimal instance depends on both access patterns at once — and the algebra finds it mechanically, for any pair of layouts you hand it.