Tokenization
How Galore handles lexical analysis with TLEX
Tokenization
Galore uses TLEX for lexical analysis. This page covers the basics of token definition in Galore. For advanced lexer features, see the TLEX documentation.
Defining Tokens
%token Directive
Define named tokens with regex patterns:
%token NUMBER /[0-9]+/
%token IDENT /[a-zA-Z_][a-zA-Z0-9_]*/
%token STRING /"([^"\\]|\\.)*"/
Inline Tokens
String literals and regex patterns in rules are automatically tokenized:
// These create tokens automatically:
Stmt -> "if" "(" Expr ")" Stmt ;
Number -> /[0-9]+/ ;
Skipping Whitespace and Comments
Use %skip to define patterns that are consumed but don't produce tokens:
// Skip whitespace
%skip /[ \t\n\r]+/
// Skip line comments
%skip /\/\/.*/
// Skip block comments
%skip /\/\*[\s\S]*?\*\//
Regex Pattern Syntax
By default, Galore uses JavaScript regex syntax:
| Pattern | Meaning |
|---|---|
[0-9]+ | One or more digits |
[a-zA-Z_]\w* | Identifier |
"[^"]*" | Double-quoted string (simple) |
\/\/.* | Line comment |
\s+ | Whitespace |
Regex Flags
Append flags after the closing /:
%skip /\/\*.*?\*\//s // 's' flag: dot matches newlines
%skip /\/\/.*$/m // 'm' flag: multiline mode
Reusable Patterns
Use %define to create named patterns for composition:
%define DIGIT [0-9]
%define ALPHA [a-zA-Z]
%token NUMBER /{DIGIT}+/
%token IDENT /{ALPHA}({ALPHA}|{DIGIT})*/
Flex-Style Syntax
Switch to flex-style patterns with %resyntax:
%resyntax flex
%token NUMBER [0-9]+
%skip [ \t\n]+
In flex mode, patterns extend to end of line without delimiters.
Token Priority
When multiple patterns match, TLEX uses priority rules:
- Longer matches win over shorter matches
- Earlier-defined patterns win over later ones
- String literals have higher priority than regex patterns
// "if" will match as keyword, not as IDENT
%token IDENT /[a-zA-Z]+/
Stmt -> "if" Expr "then" Stmt ;
Token Handlers
Attach handlers to tokens for custom processing:
%token NUMBER /[0-9]+/ { parseNumber }
Provide the handler when loading the grammar:
import { DSL } from "galore";
const [grammar, tokenFunc] = DSL.load(grammarString, {
tokenHandlers: {
parseNumber: (token, tape, owner) => {
token.value = parseInt(token.value);
return token;
}
}
});
Custom Tokenizers
Provide your own tokenizer instead of using the auto-generated one:
import { newParser } from "galore";
import * as TLEX from "tlex";
// Create custom tokenizer
const myTokenizer = new TLEX.Tokenizer();
myTokenizer.add(/[0-9]+/, { tag: "NUMBER" });
myTokenizer.add(/[a-z]+/, { tag: "IDENT" });
myTokenizer.add(/\s+/, {}, () => null); // Skip whitespace
const [parser] = newParser(grammar, {
tokenizer: myTokenizer.next.bind(myTokenizer)
});
TLEX Documentation
For advanced features like:
- Lexer modes and state management
- Token priority and conflict resolution
- Custom token transformations
- Debugging and visualization
See the TLEX documentation.