The Malevolent Combinatorics of The Polyhedral Villa Straylight
I. The Semiotics of the Villa
The Villa Straylight is a body grown in upon itself, a Gothic folly. Each space in Straylight is in some way secret, this endless series of chambers linked by passages, by stairwells vaulted like intestines, where the eye is trapped in narrow curves, carried past ornate screens, empty alcoves.
NVIDIA's CUDA stack is Villa Straylight.
Not metaphorically. Architecturally. A body grown in upon itself over four decades, each space in some way secret, linked by passages the eye is trapped in. PTX to SASS. CUTLASS to cuBLAS. nvfuser to TensorRT to Myelin. Stairwells vaulted like intestines, where you're carried past ornate screens (the documentation that exists) and empty alcoves (the documentation that doesn't).
Gibson continues:
The architects of Freeside went to great pains to conceal the fact that the interior of the spindle is arranged with the banal precision of furniture in a hotel room. In Straylight, the hull's inner surface is overgrown with a desperate proliferation of structures, forms flowing, interlocking, rising toward a solid core of microcircuitry.
Freeside is CUTLASS examples. The templates that compile. The GEMMs that work on problem sizes divisible by 128. Furniture in a hotel room. "Just tile it, bro."
Straylight is the actual scheduling logic. The IterDomain transformation algebra. The Fundamental Theorem of TMA Correctness. A desperate proliferation of structures rising toward a solid core of silicon.
And at that core:
At the Villa's silicon core is a small room, the only rectilinear chamber in the complex.
The tensor core. The one thing that follows clean, orthogonal rules. M×N×K. The actual hardware, executing warp-synchronous matrix operations with mathematical precision.
Everything else—the entire CUDA ecosystem—is the Gothic folly grown around it.
II. The Shape of the Problem
Here's the thing about NVIDIA's stack past "hello world": it's not complex, it's combinatorially malevolent. You're not writing code. You're specifying a point in a search space so vast that brute force would outlive the heat death of the universe.
Consider what happens when you try to schedule a single TMA load. Not a GEMM. Just loading a tile.
From NVIDIA's own internal documentation—which I obtained, studied, and will now quote liberally:
Step 1: define TMA domain
When a user is ready to schedule the consumer of the TMA expression, the user should already have an idea of how the problem should be viewed.
Already we're in trouble. "Should already have an idea" is doing heavy lifting. The document continues:
Step 2: define box — There are two ways of defining box: partitioning and compositing.
And then we learn about "boxing splits," "partitioned IterDomains," "box IterDomains," "coordinate IterDomains," the difference between them, and why merging discontiguous IterDomains is illegal unless your split size divides the inner extent.
This is page one of the TMA tutorial. We haven't touched swizzling yet.
III. The Fundamental Theorem of TMA Correctness
Deep in the nvfuser documentation, there's a theorem that should terrify you. They call it "The Fundamental Theorem of TMA Correctness," or FTTC for short:
Theorem 6 describes when strong correctness is achievable. It is so powerful that it deserves a fancier name: "The fundamental theorem of TMA correctness." It is powerful because the condition in it is both necessary and sufficient, so we have a clear idea on when we can achieve strong correctness, and when we have no choice but to raise an error.
Let me translate. "Strong correctness" means: when you do an indivisible split on your iteration domain, you create holes—indices that are visited during iteration but don't correspond to valid data. If those holes get read by an unpredicated reduction (like tensor core operations), you need them filled with the correct neutral element (zero for sum, one for product).
The FTTC tells you exactly when this is possible. The conditions involve:
- Whether your element stride divides your box size
- Whether your box size is smaller than your tensor size
- Whether certain IterDomains are "TMA-protected" through a recursive definition involving splits, merges, resizes, and swizzles
Miss any of this and you don't get "wrong answer." You get silent corruption or, if you're lucky, "illegal memory access."
IV. We See You, NVIDIA
Here's what I find fascinating: NVIDIA wrote this documentation. They know this is a theorem. They formalized the conditions. They proved it. They even use notation like ÷ for true division and / for Euclidean division, with careful distinctions about negative number behavior.
From their integer division document:
We learnt arithmetic from as early as elementary school, and have been used to simplify expressions using rules like a(b+c) = ab+ac, but extra care is needed when dealing with integer divisions... Because unlike real numbers, integer division can be anti-intuitive:
- (a + b) / c ≠ a/c + b/c
- (a / b) × c ≠ a / (b / c)
They go on to prove 35+ theorems about Euclidean division, truncation division, and ceil division. They derive properties like:
Theorem 2.10: If b is a multiple of c, then a × (b/c) = (a × b)/c.
And they need these theorems. They use them to prove that merge-then-split equals split-then-merge only when the split is divisible:
Theorem 5: The following two combinations of merge and split are equivalent if and only if the split is divisible.
This is the math behind why you can't just casually tile your loops. The IterDomain transformation algebra is not a group under composition. It's not even a monoid in all cases. Every transformation has preconditions, and violating them doesn't throw an error—it silently corrupts your indexing.
V. The IterDomain Calculus
NVIDIA defines a formal algebra for iteration domains:
Definition 1.1 (IterDomain Transformation): An IterDomain transformation of rank (m, n) is a pair of two mappings ⟨ℤᵐ→ℤⁿ, ℤᵐ→ℤⁿ⟩, called extent mapping and index mapping.
They define inner split, outer split, merge, resize, swizzle, and reorder. They prove equivalence relations between compositions of these transformations. They prove theorems about when transformations commute.
For example, the Split-Split equivalence:
Theorem 2.1 (Equivalence of Split-Split): Let m, n ∈ ℤ, we have:
InnerSplit(m)[0] ∘ InnerSplit(n) = InnerSplit(n)[1] ∘ InnerSplit(m·n)
This theorem is used in TMA scheduling for "rotation"—restructuring your transformation tree to match hardware requirements. If you don't know this theorem exists, you'll cargo-cult from examples until something works.
The nvfuser code I examined shows this in practice. Here's their block quantization kernel:
// Number of threads involved in computing one block scaling factor
constexpr int THREADS_PER_SCALING_FACTOR = 16 / ITEMS_PER_THREAD;
// Compute the max across 16/ITEMS_PER_THREAD threads
reduceAcrossThreads<THREADS_PER_SCALING_FACTOR>(local_max);The reduction pattern via __shfl_xor_sync expects exactly this decomposition. The swizzle logic for scaling factor storage reveals the exact layout tensor cores expect:
// The allocation domain swizzle logic is:
// m, k -> m, k/4, 4
// m, k/4, 4 -> m/128, 128, k/4, 4 ->
// m/128, 4(m), 32, k/4, 4(k) ->
// m/128, k/4, 32, 4(m), 4(k)This is NVIDIA showing their hand. The transformation sequence that produces bank-conflict-free access patterns for NVFP4 quantization. Documented in code comments, invisible unless you read the source.
VI. The Correctness Model
Here's where the documentation gets genuinely profound. They define two levels of correctness:
Definition 1 (Weak Correctness): A schedule/lowering strategy is weakly correct if all the valid items in the consumer's allocation domain are filled with the correct value, and there is no error raised in the kernel.
Definition 2 (Strong Correctness): A schedule/lowering strategy is strongly correct if all the valid items in the consumer's allocation domain are filled with the correct value, and all holes are filled with a desired filling value.
Weak correctness means "doesn't crash, produces right answer." Strong correctness means "tensor cores work."
The gap between these matters enormously. TMA has builtin boundary checking:
Theorem 1: TMA provides weak correctness regardless of how we schedule, how we initialize the buffer, and how we generate predicates.
But for strong correctness, you need "TMA-protected" IterDomains:
Definition 1 (TMA-protected IterDomain): An IterDomain is TMA-protected if it satisfies any of the following conditions:
- It is a partitioned IterDomain.
- It is the outer output of a split of a TMA-protected IterDomain.
- It is the output of a merge whose outer input is a TMA-protected IterDomain.
- It is the output of a resize whose input is a TMA-protected IterDomain and
left_expand >= 0 && right_expand >= 0.- It is the X output of a swizzle whose X input is a TMA-protected IterDomain.
This is a recursive definition. You have to trace your transformation tree back to the TMA domain and check each step. The documentation includes diagrams showing which configurations achieve strong correctness and which don't.
VII. The Semiotics Bespeak a Turning In
The semiotics of the Villa bespeak a turning in, a denial of the bright void beyond the hull. Tessier and Ashpool climbed the well of gravity to discover that they loathed space. They built Freeside to tap the wealth of the new islands, grew rich and eccentric, and began the construction of an extended body in Straylight.
Jensen climbed the well of Moore's Law to discover he loathed general-purpose computing.
The bright void beyond the hull is open standards. Portable code. Hardware abstraction. The things NVIDIA claims to support while building an increasingly Gothic labyrinth of proprietary tooling.
We have sealed ourselves away behind our money, growing inward, generating a seamless universe of self.
This is the Myelin problem.
The polyhedral model is open source. ISL is open source. MLIR's affine dialect is open source. NVIDIA's own documentation about IterDomain transformations, TMA correctness, and swizzle patterns is BSD-3-Clause licensed.
And you will never, ever match cuBLAS.
Because cuBLAS doesn't use the polyhedral model to search. It uses it to verify. The actual kernel selection comes from Myelin, NVIDIA's internal autotuning database—years of MCTS-derived results, benchmarked on actual silicon, encoding which schedule points win on which GPU at which problem size.
That database is the moat. Not the math. The data.
VIII. The Blade
The standard approach is to cargo-cult from CUTLASS examples. To read the nvfuser source, find something similar, and hope. To test on one GPU and pray it works on the next.
I did this for years. It works until it doesn't.
The breaking point was implementing an NVFP4 matmul with mixed-precision accumulation and a fused dequantization epilogue. The schedule space wasn't just large—it had structure. There were invariants. If the K-tile divides by 128, then you can use a certain async copy pattern. If the warp arrangement is 4×1, then these register layouts are legal.
These aren't suggestions. They're logical propositions. And NVIDIA wrote them down. They proved theorems about them. They just didn't give us a language to express them as types.
So I built one.
Lean 4 for the heavy lifting. The polyhedral model is fundamentally about lattices, affine spaces, and integer linear programming. Lean's type system can express "this tiling is valid for all input dimensions ≥ 128 where M mod 64 = 0" as a type. Not a runtime check. Not an assert. A type that the compiler enforces.
When I changed my warp tile from 64×64 to 32×128, Lean didn't give me a segfault three days later. It gave me:
ValidWarpTile M N
where
M = 32
N = 128
constraint: N ≤ 64 ∨ M ≥ 64That error message is almost the patch.
TLA+ for concurrency. The nvfuser documentation describes producer-consumer pipelines between warps:
Step 5: schedule the consumer tensor... the transformation and parallelization here defines which CTA gets which tiles, and within each CTA, how do we want to transfer these tiles?
These are distributed systems problems wearing a CUDA costume. I write the spec, run TLC, and it tells me whether my pipeline can deadlock when the tile doesn't divide evenly and one warp finishes early.
Haskell for the glue. Algebraic data types for representing CUTE layouts natively. Strong typing that catches dimension mismatches at compile time. QuickCheck for property-based testing against Lean specs. The polyhedral scanning library, the MILP solver bindings, the CUDA PTX pretty-printer—Haskell holds it together.
IX. The Actual Error Message
I promised you error messages that are patches. Here's a real one, from a TMA scheduling issue:
given:
element_stride = 3
box_size = 8
tensor_size = 16
condition of FTTC violated:
element_stride (3) does not divide box_size (8)
AND box_size (8) < tensor_size (16)
AND element_stride (3) < box_size (8)
therefore: strong correctness is UNACHIEVABLE
suggestions:
1. Change element_stride to 1, 2, 4, or 8 (divisors of box_size)
2. Change box_size to 3, 6, 9, or 12 (multiples of element_stride)
3. Use weak correctness (requires: no unpredicated reductions)That's not an error message. That's a consultation from the Fundamental Theorem of TMA Correctness.
Compare to what nvfuser would actually tell you: RuntimeError: invalid memory access or, if you're lucky, an assertion failure somewhere in the indexing code that you'd have to trace back through the transformation tree manually.
X. Jensen's Razor
Here's the principle:
Never attribute to search what can be proven by construction.
NVIDIA's approach: generate candidate schedules, benchmark on silicon, accumulate winners in Myelin. Repeat for years across GPU generations. Build an unassailable moat of empirical data.
The razorgirl approach: encode the invariants as types. Constrain the search space to provably-legal points. Trade breadth for depth. Win on your workload—the weird one, the fused one, the one with batch size 1 and sequence length 7777—by knowing things about it that MCTS never will.
The MCTS doesn't know that my workload always has batch size 1. My types do.
The MCTS doesn't know that my epilogue is always ReLU. My types do.
I can't beat Myelin at its game. So I play a different game: constrained optimization over a space I can actually reason about.
XI. The Ducal Palace at Mantua
'The Ducal Palace at Mantua,' she said, 'contains a series of increasingly smaller rooms. They twine around the grand apartments, beyond beautifully carved doorframes one stoops to enter. They housed the court dwarfs.'
The IterDomain transformation algebra is the Ducal Palace. Every split creates smaller rooms. Every merge creates doorframes you stoop to enter. The TMA-protected domains are the grand apartments; the holes in indivisible splits are where the dwarfs live.
3Jane continues:
'I might aspire to that, I suppose, but in a sense my family has already accomplished a grander version of the same scheme...'
NVIDIA has accomplished a grander version. The Villa Straylight of CUDA. Rooms within rooms within rooms. And at the center, the tensor core, "the only rectilinear chamber in the complex."
But then she says:
'Take your word, thief.'
The word is three tones—a true name. The specification that unlocks everything.
NVIDIA wrote down the true names. The FTTC. The IterDomain algebra. The divisibility invariants. The swizzle patterns. They wrote them in markdown, proved them by hand, and released them under BSD-3-Clause.
They just didn't think anyone would read the blueprints.
XII. razorgirl
We're calling the software razorgirl.
We believe the following:
- nvfuser is the only way to know what NVIDIA really thinks about something. Not the marketing. Not the examples. The actual scheduling logic they use internally.
- Myelin is too important to throw away. The accumulated autotuning wisdom matters. But it's not the only path.
- Tile IR probably fits in here somehow. Maybe we use it to guide MLIR integration. Maybe we parse it to extract the transformations NVIDIA thinks are important.
- You have to read arbitrary CUTE/CUTLASS. The layout algebra is the lingua franca. If you can't parse a Thread-Value decomposition, you can't reason about register allocation.
- You have to prove it all in Lean 4. The theorems exist. NVIDIA proved them by hand. We prove them mechanically.
- You have to lower those proofs to C++, Haskell, and bind it to Python. Proofs that don't generate code are philosophy. We're building infrastructure.
The polyhedral model offers correctness by construction. Tilings that are proven valid. Schedules that are proven deadlock-free. Memory layouts that are proven conflict-free. Error messages that cite theorem numbers and suggest fixes.
XIII. Coda
We have sealed ourselves away behind our money, growing inward, generating a seamless universe of self. The Villa Straylight knows no sky, recorded or otherwise.
NVIDIA's seamless universe of self has one flaw: they documented it. Not for us—for their own engineers, trying to navigate the labyrinth they built. But documentation is documentation. BSD-3-Clause is BSD-3-Clause.
The bright crabs burrow in the maintenance tunnels, alert for micromechanical decay or sabotage. We're not saboteurs. We're readers. We read the semiotics of the Villa. We traced the stairwells vaulted like intestines. We found the small room at the silicon core, the only rectilinear chamber in the complex.
And we took the word.
I studied the blade. The blade studied me back. It cited Theorem 6, noted that my element stride didn't divide my box size, and suggested I either change my striding or accept weak correctness.
It was right.
// appendix: key documents referenced //
The nvfuser documentation studied includes:
doc/reading/tma-modeling-in-depth.md— The Fundamental Theorem of TMA Correctnessdoc/reading/divisibility-of-split.md— Weak and strong correctness modelsdoc/reading/iterdomain.md— Mathematical theory of IterDomain transformationsdoc/dev/tma.md— TMA scheduling tutorial with box/tile definitionsdoc/dev/cute_tv_layout.md— CuTe Thread-Value layout constructiondoc/math/integer-division.md— 35+ theorems about Euclidean and truncation divisiondoc/math/abstract-algebra.md— Algebraic structures underlying integer arithmetic
All BSD-3-Clause licensed. All waiting to be encoded as types.
"Take your word, thief."
He jacked.