x86 backend
internal/backend/cpu/x86. Hand-written instruction selection,
register allocation, and encoding for x86-64 Linux. No LLVM, no MC,
no third-party encoders.
Instruction selection
isel.go. Tree-pattern matching on MIR instructions. Each MIR op
has one or more emit sites that pick an x86 form based on operand
shape and types. Highlights:
- Element-wise float ops on f32 tensors are emitted as
256-bit AVX2 (
vaddps,vmulps, etc.) when the run length is a multiple of 8, 128-bit SSE (addps,mulps) when it is a multiple of 4 but not 8, and a hybrid tail (SSE + scalar) for irregular shapes. Strip-mining is done at isel time so the encoder sees concrete instruction types. - Reductions (
+/) emit horizontal-add chains:vhaddps/haddpsto collapse SIMD lanes, plus a scalar tail. OpRodataPtremits a singlelea sym(%rip), regand the ELF writer fixes it up viaR_X86_64_PC32.OpCalldispatches separately for float-arg and int-arg forms (the System V AMD64 split).
Register allocation
regalloc.go. Linear-scan over MIR.
- Caller-saved:
RAX,RCX,RDX,RSI,RDI,R8–R11(the full System V caller-saved set). Free across calls only as scratch. The allocator's general-purpose pool is a subset of these —RCXandR8–R11— becauseRAXis reserved for the return value,RDXfor high-half results / division, andRDI/RSIare reused for argument shuffling at call sites. - Callee-saved:
RBX,R12–R15(plusRBP, which is reserved as the frame pointer and not in the allocator pool). Used for values live across calls. - SIMD pool: 15 slots,
xmm1–xmm15(xmm0is reserved for the float return value, per the System V ABI). When CEIR produces more simultaneously-live float values than the pool has slots, the allocator spills to stack slots — this is whatTestFoldAdd16exercises.xmm/ymmshare the physical SIMD register file, so allocating one alias marks the other busy.
Encoder
encode.go. Emits raw bytes for each instruction form.
Every instruction form encodes:
- REX prefix (if needed).
- Optional opcode-extension bytes for VEX/AVX.
- Opcode bytes.
- ModR/M byte.
- SIB byte (for memory operands with index).
- Displacement / immediate.
The encoder is intentionally narrow: only the forms the codegen actually emits are implemented. A new instruction form is an explicit addition; there is no general assembler.
ABI
abi.go. System V AMD64.
| Class | Registers |
|---|---|
| Integer args | RDI, RSI, RDX, RCX, R8, R9 |
| Float args | XMM0–XMM7 |
| Integer return | RAX |
| Float return | XMM0 |
| Stack alignment | 16 bytes at every call boundary |
Runtime intrinsics
runtime.go. Hand-written x86-64 implementations of print_i32,
print_f32, and print_str. Each is a small block of bytes that
does decimal/string formatting and a Linux write(2) syscall
directly. The driver appends them to the user's .o only when an
unresolved relocation refers to one of these names and the user
has not defined a function with the same name. Each intrinsic
returns its argument unchanged so calls compose in expression
position; all three carry the @io effect.
Tests
cpu/x86/encode_test.go— encoder unit tests, byte-exact.cpu/x86/isel_test.go— isel pattern-match tests.cpu/x86/regalloc_test.go— register allocation, spills.tests/e2e/*— end-to-end fixtures that compile and run.