Skip to main content

Kernel DSL

Target version: v0.16.

What

A second surface for writing performance-sensitive numeric code, with explicit blocking, axis-aware reductions, and per-axis loops. The kernel DSL is the place where you would write a matmul or a stencil by hand.

kernel matmul[M, K, N](
a: f32[M, K], b: f32[K, N]
) -> f32[M, N] {
# tile the loops
for i in tile(M, 32) {
for j in tile(N, 32) {
for k in tile(K, 8) {
# inner block uses element-wise on f32[32, 32], f32[32, 8]
...
}
}
}
}

The kernel DSL re-introduces for — but only inside kernel declarations, where its semantics are tight loops over a compile-time-known iteration space. Pure-language code outside a kernel keeps the loop primitives.

Why a separate surface

esque's main pitch is that programs read top-down as data transformations. That is exactly the opposite of how a fast matmul is written: a matmul is its loop nest, and the imperative loop reads better than the equivalent fold. Putting that style behind a kernel keyword tells the reader (and the compiler) which mode they are in.

Why not today

Two reasons. First, several supporting pieces have to land first — most importantly the GPU backend (kernels target both CPU and GPU) and the linear-types story (kernel-local buffers in shared memory are linear). Second, the design is genuinely hard, and shipping it before the rest of the language is solid would constrain it.

Relationship to existing primitives

The pure-language loop primitives (tabulate, scan, iterate_until, each) cover most things you need. The kernel DSL is for the small fraction of code where those are not enough: matmul, conv, attention, custom kernels.