Skip to main content

Lexer

internal/lex. A hand-written byte-level UTF-8 lexer.

Token shape

type Token struct {
Kind Kind // TkInt, TkIdent, TkPlus, ...
Lit string // raw lexeme
Span diag.Span // source location
}

Kind is an enum (internal/lex/token.go). Lit is the raw lexeme — for an IntLit like 1_000_i32 it is the literal text including underscores and suffix; semantic interpretation happens in the parser.

Categories

CategoryExamples
Keywordsfn, return, let, if, else, match, true, false, in, as, mut
Identifiersfoo, _bar, MyType. (for, while, break, continue lex as identifiers — not keywords.)
Literals42, 1_000, 42_i32, 'A', 3.14, "...", true
Operators+, .+, |>, ==, .., ..=, =>, ->
Punctuation(, ), [, ], {, }, ;, ,, :, |

Notable design choices

  • Byte-level for ASCII, validated for non-ASCII inside string and char literals. This avoids runtime regex but accepts the full Unicode codepoint range inside '...'.
  • Line comments use #, not //. // is the divide-reduction operator (TkSlashSlash); -/ is subtract-reduction (TkMinusSlash). The line-comment lead-in moved to # in v0.14 so the operator surface could complete.
  • Block comments are nestable: /* /* nested */ still inside */. This is a deliberate divergence from C.
  • Spans are byte offsets into the original source, plus a 1-based line/column for diagnostics. The diagnostic renderer maps them back to the source line.

Why for, while, break, continue are not keywords

esque does not have C-style control flow; the loop primitives (tabulate, scan, iterate, iterate_until, each) replace them. To keep the surface small and let those identifiers serve as ordinary variable names, the lexer treats for, while, break, and continue as plain identifiers.