x86 backend

internal/backend/cpu/x86. Hand-written instruction selection, register allocation, and encoding for x86-64 Linux. No LLVM, no MC, no third-party encoders.

Instruction selection

isel.go. Tree-pattern matching on MIR instructions. Each MIR op has one or more emit sites that pick an x86 form based on operand shape and types. Highlights:

Element-wise float ops on f32 tensors are emitted as 256-bit AVX2 (vaddps, vmulps, etc.) when the run length is a multiple of 8, 128-bit SSE (addps, mulps) when it is a multiple of 4 but not 8, and a hybrid tail (SSE + scalar) for irregular shapes. Strip-mining is done at isel time so the encoder sees concrete instruction types.
Reductions (+/) emit horizontal-add chains: vhaddps/haddps to collapse SIMD lanes, plus a scalar tail.
OpRodataPtr emits a single lea sym(%rip), reg and the ELF writer fixes it up via R_X86_64_PC32.
OpCall dispatches separately for float-arg and int-arg forms (the System V AMD64 split).

Register allocation

regalloc.go. Linear-scan over MIR.

Caller-saved: RAX, RCX, RDX, RSI, RDI, R8–R11 (the full System V caller-saved set). Free across calls only as scratch. The allocator's general-purpose pool is a subset of these — RCX and R8–R11 — because RAX is reserved for the return value, RDX for high-half results / division, and RDI/RSI are reused for argument shuffling at call sites.
Callee-saved: RBX, R12–R15 (plus RBP, which is reserved as the frame pointer and not in the allocator pool). Used for values live across calls.
SIMD pool: 15 slots, xmm1–xmm15 (xmm0 is reserved for the float return value, per the System V ABI). When CEIR produces more simultaneously-live float values than the pool has slots, the allocator spills to stack slots — this is what TestFoldAdd16 exercises. xmm/ymm share the physical SIMD register file, so allocating one alias marks the other busy.

Encoder

encode.go. Emits raw bytes for each instruction form.

Every instruction form encodes:

REX prefix (if needed).
Optional opcode-extension bytes for VEX/AVX.
Opcode bytes.
ModR/M byte.
SIB byte (for memory operands with index).
Displacement / immediate.

The encoder is intentionally narrow: only the forms the codegen actually emits are implemented. A new instruction form is an explicit addition; there is no general assembler.

ABI

abi.go. System V AMD64.

Class	Registers
Integer args	`RDI`, `RSI`, `RDX`, `RCX`, `R8`, `R9`
Float args	`XMM0`–`XMM7`
Integer return	`RAX`
Float return	`XMM0`
Stack alignment	16 bytes at every call boundary

Runtime intrinsics

runtime.go. Hand-written x86-64 implementations of print_i32, print_f32, and print_str. Each is a small block of bytes that does decimal/string formatting and a Linux write(2) syscall directly. The driver appends them to the user's .o only when an unresolved relocation refers to one of these names and the user has not defined a function with the same name. Each intrinsic returns its argument unchanged so calls compose in expression position; all three carry the @io effect.

Tests

cpu/x86/encode_test.go — encoder unit tests, byte-exact.
cpu/x86/isel_test.go — isel pattern-match tests.
cpu/x86/regalloc_test.go — register allocation, spills.
tests/e2e/* — end-to-end fixtures that compile and run.

Instruction selection​

Register allocation​

Encoder​

ABI​

Runtime intrinsics​

Tests​

Instruction selection

Register allocation

Encoder

ABI

Runtime intrinsics

Tests