Tokenization

How Galore handles lexical analysis with TLEX

Tokenization

Galore uses TLEX for lexical analysis. This page covers the basics of token definition in Galore. For advanced lexer features, see the TLEX documentation.

Defining Tokens

%token Directive

Define named tokens with regex patterns:

%token NUMBER /[0-9]+/
%token IDENT /[a-zA-Z_][a-zA-Z0-9_]*/
%token STRING /"([^"\\]|\\.)*"/

Inline Tokens

String literals and regex patterns in rules are automatically tokenized:

// These create tokens automatically:
Stmt -> "if" "(" Expr ")" Stmt ;
Number -> /[0-9]+/ ;

Skipping Whitespace and Comments

Use %skip to define patterns that are consumed but don't produce tokens:

// Skip whitespace
%skip /[ \t\n\r]+/

// Skip line comments
%skip /\/\/.*/

// Skip block comments
%skip /\/\*[\s\S]*?\*\//

Regex Pattern Syntax

By default, Galore uses JavaScript regex syntax:

PatternMeaning
[0-9]+One or more digits
[a-zA-Z_]\w*Identifier
"[^"]*"Double-quoted string (simple)
\/\/.* Line comment
\s+Whitespace

Regex Flags

Append flags after the closing /:

%skip /\/\*.*?\*\//s   // 's' flag: dot matches newlines
%skip /\/\/.*$/m       // 'm' flag: multiline mode

Reusable Patterns

Use %define to create named patterns for composition:

%define DIGIT [0-9]
%define ALPHA [a-zA-Z]

%token NUMBER /{DIGIT}+/
%token IDENT /{ALPHA}({ALPHA}|{DIGIT})*/

Flex-Style Syntax

Switch to flex-style patterns with %resyntax:

%resyntax flex

%token NUMBER [0-9]+
%skip [ \t\n]+

In flex mode, patterns extend to end of line without delimiters.

Token Priority

When multiple patterns match, TLEX uses priority rules:

  1. Longer matches win over shorter matches
  2. Earlier-defined patterns win over later ones
  3. String literals have higher priority than regex patterns
// "if" will match as keyword, not as IDENT
%token IDENT /[a-zA-Z]+/
Stmt -> "if" Expr "then" Stmt ;

Token Handlers

Attach handlers to tokens for custom processing:

%token NUMBER /[0-9]+/ { parseNumber }

Provide the handler when loading the grammar:

import { DSL } from "galore";

const [grammar, tokenFunc] = DSL.load(grammarString, {
  tokenHandlers: {
    parseNumber: (token, tape, owner) => {
      token.value = parseInt(token.value);
      return token;
    }
  }
});

Custom Tokenizers

Provide your own tokenizer instead of using the auto-generated one:

import { newParser } from "galore";
import * as TLEX from "tlex";

// Create custom tokenizer
const myTokenizer = new TLEX.Tokenizer();
myTokenizer.add(/[0-9]+/, { tag: "NUMBER" });
myTokenizer.add(/[a-z]+/, { tag: "IDENT" });
myTokenizer.add(/\s+/, {}, () => null);  // Skip whitespace

const [parser] = newParser(grammar, {
  tokenizer: myTokenizer.next.bind(myTokenizer)
});

TLEX Documentation

For advanced features like:

  • Lexer modes and state management
  • Token priority and conflict resolution
  • Custom token transformations
  • Debugging and visualization

See the TLEX documentation.