The Villa Straylight Papers

Jensen's Razor and The Malevolent Combinatorics of CTA1/CTA2

"While you were micro-tuning online softmax for cash-furnace LLMs, I studied the blade. And now that AI inference costs money again you dare to come to me for help?"


There is always a point at which the terrorist ceases to manipulate the media gestalt. A point at which the violence may well escalate, but beyond which the terrorist has become symptomatic of the media gestalt itself. Terrorism as we ordinarily understand it is innately media-related. The Panther Moderns differ from other terrorists precisely in their degree of self-consciousness, in their awareness of the extent to which media divorce the act of terrorism from the original sociopolitical intent.

"Skip it," Case said.

By The Standards of the Archipelago

"Essay of 3Jane's," the Finn said, producing his Partagas. "Wrote that when she was twelve. Semiotics course."

NVIDIA documented the true names.

Not in marketing materials or even CUTLASS example 77. In doc/reading/tma-modeling-in-depth.md and doc/math/integer-division.md and thirty other files released under BSD-3-Clause, written by engineers who needed to navigate the labyrinth they built. If you want to know what NVIDIA really thinks about something, watch the nvfuser repository. It's as close as you'll get to a source of truth this side of carrying the Jetson phone around in an iPhone case.

The Fundamental Theorem of TMA Correctness. The IterDomain transformation algebra. The divisibility invariants that determine whether your kernel silently corrupts memory or merely crashes. Thirty-five theorems about Euclidean division. Recursive definitions of "TMA-protected" domains.

They proved these by hand. They wrote them in markdown and SVG.

We encoded them as types. The somewhat alarming synergy of frontier LLMs and Lean 4 is what Terrence Tao means when he talks about AI-assisted math. It's mostly the Lean 4.



// the stack //

Lean 4 for the proofs. The polyhedral model is lattices and affine spaces. Lean's type system expresses "this tiling is valid for all M ≥ 128 where M mod 64 = 0" as a type, not an assert.
Haskell for the glue. Algebraic data types for CuTe layouts. QuickCheck against Lean specs. The polyhedral scanner, the MILP solver, the PTX pretty-printer.
The blade for everything else.
// jensen's razor //

The Malevolent Combinatorics of The Polyhedral Villa Straylight


I. The Semiotics of the Villa

The Villa Straylight is a body grown in upon itself, a Gothic folly. Each space in Straylight is in some way secret, this endless series of chambers linked by passages, by stairwells vaulted like intestines, where the eye is trapped in narrow curves, carried past ornate screens, empty alcoves.

NVIDIA's CUDA stack is Villa Straylight.

Not metaphorically. Architecturally. A body grown in upon itself over four decades, each space in some way secret, linked by passages the eye is trapped in. PTX to SASS. CUTLASS to cuBLAS. nvfuser to TensorRT to Myelin. Stairwells vaulted like intestines, where you're carried past ornate screens (the documentation that exists) and empty alcoves (the documentation that doesn't).

Gibson continues:

The architects of Freeside went to great pains to conceal the fact that the interior of the spindle is arranged with the banal precision of furniture in a hotel room. In Straylight, the hull's inner surface is overgrown with a desperate proliferation of structures, forms flowing, interlocking, rising toward a solid core of microcircuitry.

Freeside is CUTLASS examples. The templates that compile. The GEMMs that work on problem sizes divisible by 128. Furniture in a hotel room. "Just tile it, bro."

Straylight is the actual scheduling logic. The IterDomain transformation algebra. The Fundamental Theorem of TMA Correctness. A desperate proliferation of structures rising toward a solid core of silicon.

And at that core:

At the Villa's silicon core is a small room, the only rectilinear chamber in the complex.

The tensor core. The one thing that follows clean, orthogonal rules. M×N×K. The actual hardware, executing warp-synchronous matrix operations with mathematical precision.

Everything else—the entire CUDA ecosystem—is the Gothic folly grown around it.


II. The Shape of the Problem

Here's the thing about NVIDIA's stack past "hello world": it's not complex, it's combinatorially malevolent. You're not writing code. You're specifying a point in a search space so vast that brute force would outlive the heat death of the universe.

Consider what happens when you try to schedule a single TMA load. Not a GEMM. Just loading a tile.

From NVIDIA's own internal documentation—which I obtained, studied, and will now quote liberally:

Step 1: define TMA domain

When a user is ready to schedule the consumer of the TMA expression, the user should already have an idea of how the problem should be viewed.

Already we're in trouble. "Should already have an idea" is doing heavy lifting. The document continues:

Step 2: define box — There are two ways of defining box: partitioning and compositing.

And then we learn about "boxing splits," "partitioned IterDomains," "box IterDomains," "coordinate IterDomains," the difference between them, and why merging discontiguous IterDomains is illegal unless your split size divides the inner extent.

This is page one of the TMA tutorial. We haven't touched swizzling yet.


III. The Fundamental Theorem of TMA Correctness

Deep in the nvfuser documentation, there's a theorem that should terrify you. They call it "The Fundamental Theorem of TMA Correctness," or FTTC for short:

Theorem 6 describes when strong correctness is achievable. It is so powerful that it deserves a fancier name: "The fundamental theorem of TMA correctness." It is powerful because the condition in it is both necessary and sufficient, so we have a clear idea on when we can achieve strong correctness, and when we have no choice but to raise an error.

Let me translate. "Strong correctness" means: when you do an indivisible split on your iteration domain, you create holes—indices that are visited during iteration but don't correspond to valid data. If those holes get read by an unpredicated reduction (like tensor core operations), you need them filled with the correct neutral element (zero for sum, one for product).

The FTTC tells you exactly when this is possible. The conditions involve:

Miss any of this and you don't get "wrong answer." You get silent corruption or, if you're lucky, "illegal memory access."


IV. We See You, NVIDIA

Here's what I find fascinating: NVIDIA wrote this documentation. They know this is a theorem. They formalized the conditions. They proved it. They even use notation like ÷ for true division and / for Euclidean division, with careful distinctions about negative number behavior.

From their integer division document:

We learnt arithmetic from as early as elementary school, and have been used to simplify expressions using rules like a(b+c) = ab+ac, but extra care is needed when dealing with integer divisions... Because unlike real numbers, integer division can be anti-intuitive:

  • (a + b) / c ≠ a/c + b/c
  • (a / b) × c ≠ a / (b / c)

They go on to prove 35+ theorems about Euclidean division, truncation division, and ceil division. They derive properties like:

Theorem 2.10: If b is a multiple of c, then a × (b/c) = (a × b)/c.

And they need these theorems. They use them to prove that merge-then-split equals split-then-merge only when the split is divisible:

Theorem 5: The following two combinations of merge and split are equivalent if and only if the split is divisible.

This is the math behind why you can't just casually tile your loops. The IterDomain transformation algebra is not a group under composition. It's not even a monoid in all cases. Every transformation has preconditions, and violating them doesn't throw an error—it silently corrupts your indexing.


V. The IterDomain Calculus

NVIDIA defines a formal algebra for iteration domains:

Definition 1.1 (IterDomain Transformation): An IterDomain transformation of rank (m, n) is a pair of two mappings ⟨ℤᵐ→ℤⁿ, ℤᵐ→ℤⁿ⟩, called extent mapping and index mapping.

They define inner split, outer split, merge, resize, swizzle, and reorder. They prove equivalence relations between compositions of these transformations. They prove theorems about when transformations commute.

For example, the Split-Split equivalence:

Theorem 2.1 (Equivalence of Split-Split): Let m, n ∈ ℤ, we have:

InnerSplit(m)[0] ∘ InnerSplit(n) = InnerSplit(n)[1] ∘ InnerSplit(m·n)

This theorem is used in TMA scheduling for "rotation"—restructuring your transformation tree to match hardware requirements. If you don't know this theorem exists, you'll cargo-cult from examples until something works.

The nvfuser code I examined shows this in practice. Here's their block quantization kernel:

// Number of threads involved in computing one block scaling factor
constexpr int THREADS_PER_SCALING_FACTOR = 16 / ITEMS_PER_THREAD;

// Compute the max across 16/ITEMS_PER_THREAD threads
reduceAcrossThreads<THREADS_PER_SCALING_FACTOR>(local_max);

The reduction pattern via __shfl_xor_sync expects exactly this decomposition. The swizzle logic for scaling factor storage reveals the exact layout tensor cores expect:

// The allocation domain swizzle logic is:
// m, k -> m, k/4, 4
// m, k/4, 4 -> m/128, 128, k/4, 4 ->
// m/128, 4(m), 32, k/4, 4(k) ->
// m/128, k/4, 32, 4(m), 4(k)

This is NVIDIA showing their hand. The transformation sequence that produces bank-conflict-free access patterns for NVFP4 quantization. Documented in code comments, invisible unless you read the source.


VI. The Correctness Model

Here's where the documentation gets genuinely profound. They define two levels of correctness:

Definition 1 (Weak Correctness): A schedule/lowering strategy is weakly correct if all the valid items in the consumer's allocation domain are filled with the correct value, and there is no error raised in the kernel.

Definition 2 (Strong Correctness): A schedule/lowering strategy is strongly correct if all the valid items in the consumer's allocation domain are filled with the correct value, and all holes are filled with a desired filling value.

Weak correctness means "doesn't crash, produces right answer." Strong correctness means "tensor cores work."

The gap between these matters enormously. TMA has builtin boundary checking:

Theorem 1: TMA provides weak correctness regardless of how we schedule, how we initialize the buffer, and how we generate predicates.

But for strong correctness, you need "TMA-protected" IterDomains:

Definition 1 (TMA-protected IterDomain): An IterDomain is TMA-protected if it satisfies any of the following conditions:

  1. It is a partitioned IterDomain.
  2. It is the outer output of a split of a TMA-protected IterDomain.
  3. It is the output of a merge whose outer input is a TMA-protected IterDomain.
  4. It is the output of a resize whose input is a TMA-protected IterDomain and left_expand >= 0 && right_expand >= 0.
  5. It is the X output of a swizzle whose X input is a TMA-protected IterDomain.

This is a recursive definition. You have to trace your transformation tree back to the TMA domain and check each step. The documentation includes diagrams showing which configurations achieve strong correctness and which don't.


VII. The Semiotics Bespeak a Turning In

The semiotics of the Villa bespeak a turning in, a denial of the bright void beyond the hull. Tessier and Ashpool climbed the well of gravity to discover that they loathed space. They built Freeside to tap the wealth of the new islands, grew rich and eccentric, and began the construction of an extended body in Straylight.

Jensen climbed the well of Moore's Law to discover he loathed general-purpose computing.

The bright void beyond the hull is open standards. Portable code. Hardware abstraction. The things NVIDIA claims to support while building an increasingly Gothic labyrinth of proprietary tooling.

We have sealed ourselves away behind our money, growing inward, generating a seamless universe of self.

This is the Myelin problem.

The polyhedral model is open source. ISL is open source. MLIR's affine dialect is open source. NVIDIA's own documentation about IterDomain transformations, TMA correctness, and swizzle patterns is BSD-3-Clause licensed.

And you will never, ever match cuBLAS.

Because cuBLAS doesn't use the polyhedral model to search. It uses it to verify. The actual kernel selection comes from Myelin, NVIDIA's internal autotuning database—years of MCTS-derived results, benchmarked on actual silicon, encoding which schedule points win on which GPU at which problem size.

That database is the moat. Not the math. The data.


VIII. The Blade

The standard approach is to cargo-cult from CUTLASS examples. To read the nvfuser source, find something similar, and hope. To test on one GPU and pray it works on the next.

I did this for years. It works until it doesn't.

The breaking point was implementing an NVFP4 matmul with mixed-precision accumulation and a fused dequantization epilogue. The schedule space wasn't just large—it had structure. There were invariants. If the K-tile divides by 128, then you can use a certain async copy pattern. If the warp arrangement is 4×1, then these register layouts are legal.

These aren't suggestions. They're logical propositions. And NVIDIA wrote them down. They proved theorems about them. They just didn't give us a language to express them as types.

So I built one.

Lean 4 for the heavy lifting. The polyhedral model is fundamentally about lattices, affine spaces, and integer linear programming. Lean's type system can express "this tiling is valid for all input dimensions ≥ 128 where M mod 64 = 0" as a type. Not a runtime check. Not an assert. A type that the compiler enforces.

When I changed my warp tile from 64×64 to 32×128, Lean didn't give me a segfault three days later. It gave me:

error: cannot synthesize instance
  ValidWarpTile M N
where
  M = 32
  N = 128
  constraint: N ≤ 64 ∨ M ≥ 64

That error message is almost the patch.

TLA+ for concurrency. The nvfuser documentation describes producer-consumer pipelines between warps:

Step 5: schedule the consumer tensor... the transformation and parallelization here defines which CTA gets which tiles, and within each CTA, how do we want to transfer these tiles?

These are distributed systems problems wearing a CUDA costume. I write the spec, run TLC, and it tells me whether my pipeline can deadlock when the tile doesn't divide evenly and one warp finishes early.

Haskell for the glue. Algebraic data types for representing CUTE layouts natively. Strong typing that catches dimension mismatches at compile time. QuickCheck for property-based testing against Lean specs. The polyhedral scanning library, the MILP solver bindings, the CUDA PTX pretty-printer—Haskell holds it together.


IX. The Actual Error Message

I promised you error messages that are patches. Here's a real one, from a TMA scheduling issue:

error: failed to prove 'StrongCorrectnessAchievable'
  given:
    element_stride = 3
    box_size = 8
    tensor_size = 16
  
  condition of FTTC violated:
    element_stride (3) does not divide box_size (8)
    AND box_size (8) < tensor_size (16)
    AND element_stride (3) < box_size (8)
  
  therefore: strong correctness is UNACHIEVABLE
  
  suggestions:
    1. Change element_stride to 1, 2, 4, or 8 (divisors of box_size)
    2. Change box_size to 3, 6, 9, or 12 (multiples of element_stride)
    3. Use weak correctness (requires: no unpredicated reductions)

That's not an error message. That's a consultation from the Fundamental Theorem of TMA Correctness.

Compare to what nvfuser would actually tell you: RuntimeError: invalid memory access or, if you're lucky, an assertion failure somewhere in the indexing code that you'd have to trace back through the transformation tree manually.


X. Jensen's Razor

Here's the principle:

Never attribute to search what can be proven by construction.

NVIDIA's approach: generate candidate schedules, benchmark on silicon, accumulate winners in Myelin. Repeat for years across GPU generations. Build an unassailable moat of empirical data.

The razorgirl approach: encode the invariants as types. Constrain the search space to provably-legal points. Trade breadth for depth. Win on your workload—the weird one, the fused one, the one with batch size 1 and sequence length 7777—by knowing things about it that MCTS never will.

The MCTS doesn't know that my workload always has batch size 1. My types do.

The MCTS doesn't know that my epilogue is always ReLU. My types do.

I can't beat Myelin at its game. So I play a different game: constrained optimization over a space I can actually reason about.


XI. The Ducal Palace at Mantua

'The Ducal Palace at Mantua,' she said, 'contains a series of increasingly smaller rooms. They twine around the grand apartments, beyond beautifully carved doorframes one stoops to enter. They housed the court dwarfs.'

The IterDomain transformation algebra is the Ducal Palace. Every split creates smaller rooms. Every merge creates doorframes you stoop to enter. The TMA-protected domains are the grand apartments; the holes in indivisible splits are where the dwarfs live.

3Jane continues:

'I might aspire to that, I suppose, but in a sense my family has already accomplished a grander version of the same scheme...'

NVIDIA has accomplished a grander version. The Villa Straylight of CUDA. Rooms within rooms within rooms. And at the center, the tensor core, "the only rectilinear chamber in the complex."

But then she says:

'Take your word, thief.'

The word is three tones—a true name. The specification that unlocks everything.

NVIDIA wrote down the true names. The FTTC. The IterDomain algebra. The divisibility invariants. The swizzle patterns. They wrote them in markdown, proved them by hand, and released them under BSD-3-Clause.

They just didn't think anyone would read the blueprints.


XII. razorgirl

We're calling the software razorgirl.

We believe the following:

The polyhedral model offers correctness by construction. Tilings that are proven valid. Schedules that are proven deadlock-free. Memory layouts that are proven conflict-free. Error messages that cite theorem numbers and suggest fixes.


XIII. Coda

We have sealed ourselves away behind our money, growing inward, generating a seamless universe of self. The Villa Straylight knows no sky, recorded or otherwise.

NVIDIA's seamless universe of self has one flaw: they documented it. Not for us—for their own engineers, trying to navigate the labyrinth they built. But documentation is documentation. BSD-3-Clause is BSD-3-Clause.

The bright crabs burrow in the maintenance tunnels, alert for micromechanical decay or sabotage. We're not saboteurs. We're readers. We read the semiotics of the Villa. We traced the stairwells vaulted like intestines. We found the small room at the silicon core, the only rectilinear chamber in the complex.

And we took the word.


I studied the blade. The blade studied me back. It cited Theorem 6, noted that my element stride didn't divide my box size, and suggested I either change my striding or accept weak correctness.

It was right.


// appendix: key documents referenced //

The nvfuser documentation studied includes:

  • doc/reading/tma-modeling-in-depth.md — The Fundamental Theorem of TMA Correctness
  • doc/reading/divisibility-of-split.md — Weak and strong correctness models
  • doc/reading/iterdomain.md — Mathematical theory of IterDomain transformations
  • doc/dev/tma.md — TMA scheduling tutorial with box/tile definitions
  • doc/dev/cute_tv_layout.md — CuTe Thread-Value layout construction
  • doc/math/integer-division.md — 35+ theorems about Euclidean and truncation division
  • doc/math/abstract-algebra.md — Algebraic structures underlying integer arithmetic

All BSD-3-Clause licensed. All waiting to be encoded as types.

"Take your word, thief."

He jacked.

// part i // 0x02 //

The Rectilinear Chamber

Layouts, Coordinate Spaces, and the CuTe Contract

At the Villa's silicon core is a small room, the only rectilinear chamber in the complex.


Companion note: if you read Jensen's Razor, this is the appendix: the invariants, the algebra, and the Lean types that turn "GPU folklore" into checkable statements.

The maxim still stands: Never attribute to search what can be proven by construction.

Modern CUDA is (still) the Villa: a Gothic overgrowth around a small rectilinear room. But the "room" is real—and, crucially, it has rules. CuTe's layout algebra is one of the few places those rules are explicit and composable.

This series does two things:

  1. Restate CuTe's layout model as clean math (coordinate space → offset).
  2. Encode the invariants in Lean 4, so an illegal layout becomes a type error—not a silent corruption.

And we keep it CCCL-forward in the sense that "vocabulary types" (integers, products, mdspan-style views) should feel like core CUDA C++, while the layout algebra stays CuTe-shaped.


1. Layouts as coordinate → offset maps

A layout has two parts:

with the same "profile" (same rank / structure).

The coordinate space is the finite product

Coord(S) = [0,M₀) × … × [0,M₍ₙ₋₁₎)

The semantics of the layout is the dot-product map

eval_L(x₀,…,x₍ₙ₋₁₎) = Σᵢ xᵢ · dᵢ

That's the thing CuTe means by "a layout maps coordinate space(s) defined by Shape into an index space defined by Stride."

Row-major / column-major are just special cases

For a 4×8 matrix:


2. Size and cosize (tight definition)

Two numbers matter constantly:

size
how many logical coordinates exist: size(L) = ∏ᵢ Mᵢ
cosize
how far the layout's image reaches in memory: cosize(L) = 1 + max{ eval_L(c) | c ∈ Coord(S) }

This definition is important: cosize is not "eval at the last linear index + 1" unless you already know the layout is monotone in that linearization.

Cosize is a maximum over the whole coordinate space.


3. Compact vs contiguous

Two properties you want to name cleanly:

Compact (injective)
no two coordinates collide in memory: eval_L(c)=eval_L(c') ⇒ c=c'
Contiguous (a permutation of a block)
compact and it fills exactly [0,size) (no gaps, no overshoot). A convenient sufficient characterization: compact, and cosize(L) = size(L).

(You can strengthen this with "minimum is 0" if you ever allow negative strides; we won't.)


4. Coordinate isomorphism (why 1D indices still show up)

CuTe "thinks in coordinates". Humans (and many APIs) still "think in linear indices".

Given just a shape S, there is a standard mixed-radix bijection between:

with

x₀ = x mod M₀
x₁ = ⌊x / M₀⌋ mod M₁

xᵢ = ⌊x / (∏₍ⱼ<ᵢ₎ Mⱼ)⌋ mod Mᵢ

This is not the layout yet; it's the coordinate system induced by the shape.

Once you have a coordinate, the layout semantics is just the dot product with strides.


5. Lean 4: make the semantics the center

The biggest Lean cleanup is: don't define the meaning of a layout as Nat → Nat first. Define it as a function on bounded coordinates (like CuTe does), then optionally add a linearization layer.

Lean: core data types

import Mathlib.Algebra.BigOperators.Basic
import Mathlib.Data.Fin.Basic
import Mathlib.Data.Finset.Basic

open scoped BigOperators

/-- One dimension ("mode") of a layout. -/
structure Mode where
  shape : Nat
  stride : Nat
  shape_pos : 0 < shape
  stride_pos : 0 < stride

/-- A rank-`n` layout is a tuple of `n` modes. -/
structure Layout (n : Nat) where
  mode : Fin n → Mode

namespace Layout

variable {n : Nat} (L : Layout n)

def shape  (i : Fin n) : Nat := (L.mode i).shape
def stride (i : Fin n) : Nat := (L.mode i).stride

theorem shape_pos  (i : Fin n) : 0 < L.shape i  := (L.mode i).shape_pos
theorem stride_pos (i : Fin n) : 0 < L.stride i := (L.mode i).stride_pos

/-- Total number of logical coordinates: ∏ shapeᵢ. -/
def size : Nat :=
  (Finset.univ : Finset (Fin n)).prod (fun i => L.shape i)

/-- A coordinate is per-dimension, bounded by `shape`. -/
def Coord := ∀ i : Fin n, Fin (L.shape i)

/-- Semantics of a layout: dot-product `coord • stride`. -/
def eval (c : L.Coord) : Nat :=
  (Finset.univ : Finset (Fin n)).sum (fun i => (c i).val * L.stride i)

/-- `cosize` = 1 + max offset over all valid coordinates. -/
noncomputable def cosize : Nat :=
  (Finset.univ : Finset L.Coord).sup (fun c => L.eval c) + 1

/-- Compact = injective on coordinates. -/
def IsCompact : Prop := Function.Injective L.eval

/-- Contiguous = compact and exactly fills a block of length `size`. -/
def IsContiguous : Prop := L.IsCompact ∧ L.cosize = L.size

end Layout

Lean: linearization layer (kept explicit)

namespace MixedRadix

open scoped BigOperators

variable {n : Nat} (shape : Fin n → Nat)

/-- Exclusive prefix product of `shape` up to (but not including) `k`. -/
def prefixProd (k : Fin n) : Nat :=
  (Finset.Iio k).prod shape

/-- Left-to-right mixed-radix decomposition (LayoutLeft-style). -/
def decompose (x : Nat) : Fin n → Nat :=
  fun k => (x / prefixProd shape k) % shape k

/-- Recompose coordinates back into a linear index. -/
def recompose (coord : Fin n → Nat) : Nat :=
  (Finset.univ : Finset (Fin n)).sum (fun k => coord k * prefixProd shape k)

/-- Standard theorem: mixed-radix recompose ∘ decompose = id on [0, ∏ shape). -/
axiom recompose_decompose
  (shape_pos : ∀ k, 0 < shape k)
  (x : Nat)
  (hx : x < (Finset.univ : Finset (Fin n)).prod shape) :
  recompose shape (decompose shape x) = x

end MixedRadix

The point: the meaning of the layout is Coord → Nat. Linear indices are an interface, not the foundation.


6. What's next

Next: Part II — The Sense/Net Pyramid and The Blue Nine

// part ii // 0x03 //

The Sense/Net Pyramid and The Blue Nine

Coalescence as normalization (terminating, but not magic)

Nine Moderns, scattered along two hundred miles of the Sprawl, had simultaneously dialed MAX EMERG from pay phones. Each Modern delivered a short set speech, hung up, and drifted out into the night, peeling off surgical gloves. Nine different police departments and public security agencies were absorbing the information that an obscure subsect of militant Christian fundamentalists had just taken credit for having introduced clinical levels of an outlawed psychoactive agent known as Blue Nine into the ventilation system of the Sense/Net Pyramid. Blue Nine, known in California as Grievous Angel, had been shown to produce acute paranoia and homicidal psychosis in eighty-five percent of experimental subjects.


Companion note: this is still the formal appendix to Jensen's Razor.

We're taking the "layout folklore" and turning it into a terminating normalization pass with a clean invariant.

In Part I we defined a layout's meaning as a map

eval_L : Coord(L) → ℕ

Now we simplify representations without changing meaning.

Coalescence is the operation that removes "fake structure":

This is not aesthetic. Coalesced layouts:


1. The three local rewrite rules

Write a mode as (N):(d).

Coalescence is driven by three local rules:

1) Right unit

(N):(d) (1):(d₁) (N):(d)

2) Left unit

(1):(d₀) (N):(d) (N):(d)

3) Packed-merge

(N₀):(d₀) (N₁):(d₁) (N₀·N₁):(d₀)    if d₁ = N₀·d₀

Rule (3) is the heart: it recognizes that the second mode is exactly "the next block" after the first.


2. Lean: Mode + tryCoalesce

import Mathlib.Data.Nat.Basic

/-- A single mode: (shape, stride), both positive. -/
structure Mode where
  shape : Nat
  stride : Nat
  shape_pos : 0 < shape
  stride_pos : 0 < stride

inductive CoalesceResult
  | merged (m : Mode)
  | unchanged
deriving DecidableEq

/-- Attempt to coalesce two adjacent modes. -/
def tryCoalesce (m0 m1 : Mode) : CoalesceResult :=
  if h1 : m1.shape = 1 then
    .merged m0
  else if h0 : m0.shape = 1 then
    .merged m1
  else if h : m1.stride = m0.shape * m0.stride then
    .merged
      { shape := m0.shape * m1.shape
        stride := m0.stride
        shape_pos := Nat.mul_pos m0.shape_pos m1.shape_pos
        stride_pos := m0.stride_pos }
  else
    .unchanged

3. Soundness: coalescence preserves semantics

For the packed-merge rule, the key arithmetic identity is:

x₀·d₀ + x₁·(N₀·d₀) = (x₀ + x₁·N₀)·d₀
theorem packed_merge_arith
    (N0 d0 x0 x1 : Nat) :
    x0*d0 + x1*(N0*d0) = (x0 + x1*N0)*d0 := by
  calc
    x0*d0 + x1*(N0*d0)
        = x0*d0 + (x1*N0)*d0 := by
            simp [Nat.mul_assoc, Nat.mul_left_comm, Nat.mul_comm]
    _   = (x0 + x1*N0)*d0 := by
            simp [Nat.add_mul]

4. Termination (what "Noetherian" should mean here)

What we actually need is the term-rewriting notion:

A reduction relation is Noetherian iff it has no infinite descending chains.

For coalescence, termination is elementary: each successful rewrite reduces the number of modes. Length is a natural number. Therefore: no infinite chain.


5. A concrete example

Take the (already packed) layout

(8,4):(1,8)

Since 8 = 8·1, the two modes are compatible and coalesce to

(32):(1)

What looked like a 2D tile is actually a single contiguous span.


// exercises //

  1. Coalesce by hand: (2,4):(1,2) → what happens?
  2. Find a layout with many unit modes and show they vanish under normalization.
  3. If you allow stride=0, what breaks? (Hint: compactness.)

Next: Part III — Built Him up From Nothing in France

// part iii // 0x04 //

Built Him up From Nothing in France

Complementation, tilings, and the real FTTC hook

'Yeah. Built him up from go. Yeah...' She turned and they walked on. 'It figures. You know, the guy doesn't have any life going, in private. Not as far as I can tell. You see a guy like that, you figure there's something he does when he's alone. But not Armitage. Sits and stares at the wall, man. Then something clicks and he goes into high gear and wheels for Wintermute.'


Companion note: this part is where the Jensen's Razor rhetoric cashes out.

Complementation is the "division-with-remainder shaped" operation that shows up in real TMA correctness constraints.

Now we define the operation that answers:

"Given a layout A that touches some addresses in a region, what layout B completes the coverage—no overlaps, no gaps?"

That's complementation.


1. What complementation should mean

Fix a target memory region [0, M). Given a layout A, we seek B such that the combined layout C = (A,B) is a tiling of [0,M): every address hit exactly once.


2. Single-mode complement

Start with A = (N):(d). Assume M is a multiple of N·d.

Define:

B = (d, M/(N·d)) : (1, N·d)

Then the combined layout

C = (N, d, M/(N·d)) : (d, 1, N·d)

hits every address in [0,M) exactly once. Pure Euclidean division (twice), no mystery CRT required.


3. Lean: complement₁

/-- Complement of a single-mode layout (N):(d) inside [0,M),
    assuming N*d ∣ M and M>0. Returns two modes. -/
def complement1 (m : Mode) (M : Nat)
    (hM : m.shape * m.stride ∣ M)
    (hM0 : 0 < M) : Mode × Mode :=
  let Nd : Nat := m.shape * m.stride
  let mR : Mode :=
    { shape := m.stride
      stride := 1
      shape_pos := m.stride_pos
      stride_pos := Nat.succ_pos 0 }
  let q : Nat := M / Nd
  have q_pos : 0 < q := by
    have Nd_pos : 0 < Nd := Nat.mul_pos m.shape_pos m.stride_pos
    exact Nat.div_pos hM0 Nd_pos
  let mB : Mode :=
    { shape := q
      stride := Nd
      shape_pos := q_pos
      stride_pos := Nat.mul_pos m.shape_pos m.stride_pos }
  (mR, mB)

4. Where FTTC actually enters

In nvFuser's TMA correctness model, "strong correctness" is governed by a recursive notion of TMA-protected IterDomains. The documentation explicitly defines this closure and builds theorems on top of it, including Theorem 6 characterizing when strong correctness is unachievable.

Complementation is the layout-level move that lines up with the divisibility / hole-filling constraints that appear in TMA lowering.

(You do not want to cargo-cult this. You want it as a type.)


5. Application: thread/value tilings (TV layouts)

32 threads, 4 values per thread, total 128 values.

A = (32):(4) — thread t owns block at 4t
B = (4):(1) — value v steps by 1 within block
(thread, value) ↦ 4·thread + value

The simplest TV layout, provably collision-free.


// exercises //

  1. Compute complement((4):(2), 16) and verify the combined tiling.
  2. Show why the hypothesis N*d ∣ M is necessary (give a counterexample).
  3. (Stretch) Generalize complement to a 2-mode packed layout.

Next: Part IV — Take Your Word, Thief

// part iv // 0x05 //

Take Your Word, Thief

Composition, divisibility invariants, and the tensor-core cathedral

'The Ducal Palace at Mantua,' she said, 'contains a series of increasingly smaller rooms. They twine around the grand apartments, beyond beautifully carved doorframes one stoops to enter. They housed the court dwarfs.'

'Take your word, thief.'

He jacked.


Never attribute to search what can be proven by construction.

Now we add the operation that makes CuTe a real algebra: Composition.


1. Composition: semantics first

At the semantic level, composition is just function composition. Two common forms:

  1. Index substitution: A : Coord_A → ℕ, B : Coord_B → Coord_A, then A ∘ B : Coord_B → ℕ.
  2. Extended-index composition: B produces indices beyond size(A), so A must be interpreted on a larger domain.

2. The divisibility invariant

A huge fraction of "why did this schedule break" bugs reduce to:

certain rearrangements (merge/split reorderings) are valid iff a split size divides an inner extent.

nvFuser: "merge then split equals split then merge iff the split is divisible."

This is the invariant you want Lean to enforce.


3. Lean: declare admissibility as data

/-- A sufficient admissibility predicate for composing a stride-mode into a shape. -/
structure LeftDivisible (shape : Nat → Nat) (d : Nat) : Prop :=
  (witness : Nat)
  (witness_pos : 0 < witness)
  (divides : witness ∣ d)

/-- Composition spec: if admissible, produce a new 1-mode layout. -/
structure CompositionSpec where
  outShape  : Nat
  outStride : Nat
  outShape_pos : 0 < outShape
  outStride_pos : 0 < outStride

The point: admissibility must be explicit.


4. Where the cathedral shows up

Real kernels are stacked mappings:

global tensor → shared tile layout → register fragment layout → MMA atom mapping

When this is correct, you get the cathedral: predictable, repeatable, portable across shapes.

When it's wrong, you get the Villa again: stairwells vaulted like intestines, and an illegal memory access three days later.

Composition is the last primitive you need to build the stack out of proofs.


// exercises //

  1. Write down a precise admissibility predicate for "prefix-product divisibility" composition.
  2. Give an example where admissibility fails and show the two evaluation orders differ.
  3. (Stretch) Connect your predicate to a merge/split equivalence condition.

Fin.

I studied the blade. The blade studied me back. It cited Theorem 6, noted that my element stride didn't divide my box size, and suggested I either change my striding or accept weak correctness.

It was right.