in chapter 16 for the lox vm, the scanner implementation takes on a completely different approach compared to jlox. when we implemented jlox, the scanner did a full scan of the source file and then created all the tokens in memory for the parsing phase
in the C implementation, the file is still read but we don’t create a separate list for all the tokens by doing a full read of the file. instead the scanner refers directly to the source and we only create as many tokens as necessary (no more than 2 tokens since lox is a LLR1 type grammar that only requires a single token lookahead to uniquely identify a lexeme). this is a lazier and more memory efficient approach.
for example, here’s the scanner struct and how it’s initialized
typedef struct {
const char* start;
const char* current;
int line;
} Scanner;
Scanner scanner;
void initScanner(const char* source) {
scanner.start = source;
scanner.current = source;
scanner.line = 1;
}
start
refers to the beginning of a lexeme (say, an identifier)current
is the current character being scanned- there’s also some additional metadata like line number for debugging support
and this is the Token
struct for representing a complete lexeme
typedef struct {
TokenType type;
const char* start;
int length;
int line;
} Token;
start
is a pointer to the source – again we’re not allocating additional memory to hold token informationtype
is our special enum to things likeTOKEN_IDENTIFIER
with the scanner and the token structs in place, the compiler drives the actual changes to these objects as it scans as much of the source code as it needs (and constructs tokens) to emit byte code sequences
ObjFunction* compile(const char* source) {
initScanner(source);
Compiler compiler;
initCompiler(&compiler, TYPE_SCRIPT);
parser.hadError = false;
parser.panicMode = false;
int line = -1;
advance();
while (!match(TOKEN_EOF)) {
declaration();
}
ObjFunction* function = endCompiler();
return parser.hadError ? NULL : function;
}
calls to adv
ance and declaration
both will eventually call out to scanToken
which will make use of the scanner to read and construct the next token. for example if the token is a number, the compiler will emit two byte codes via a call to emitConstant(NUMBER_VAL(value));
the entire sequence of bytecodes is built this way, the compiler driving the scanner forward and emitting byte code sequences on the fly.