Rusty compiler: Writing a lexer in Rust
In this post we will be implementing a lexer in Rust. Lexer is a tool used in Compiler design to read characters from the input and group then into tokens. These tokens are then passed on forward to a parser, which will be the topic of the second part of this series.
Our lexer will be able to group the characters into 15 different tokens. Let us enumerate these tokens in lexer.rs
:
Num
corresponds to a number and has a type f64
. In our implementation all digits are identified as 64-bit floating point numbers
True
and False
correspond to boolean values true
and false
. They are given String
type values "true"
and "false"
respectively.
If
, Else
and While
correspond to control flow operators. They are given String
type values "if"
, "else"
and "while"
respectively.
And
and Or
correspond to logical operators. They are given String
type values "&&"
, "||"
respectively.
Eql
, Ne
, Le
, Ge
, Lt
and Gt
correspond to relational operators. They are given String
type values "=="
, "!="
"<="
, ">="
, "<"
and ">"
respectively.
Id
corresponds to a combination of characters beginning with a alphabetic character that did not classify as any other token above.
Now that we have our Token
enum we can move on to implement the actual lex
function that does all the heavy lifting. The function recieves a reference to the input as a parameter and returns a vector of Tokens if there are no errors. In lexer.rs
we first write:
In the snippet above we create a vector of type Vec<Token>
for the Tokens to be saved in. We also create an hashmap for words that already have been assigned a token. To process the input we create an peekable iterator it
. To help debugging we also keep track of the line number with variable _lineno
.
Now to analyzing the input. Using a while
loop, match
expression and the iterator it
we can read the input character by character. Let us first make the lex
function ignore spaces ' '
and tabs '\t'
and increase the line number when it comes across newline \n
. In lexer.rs
:
Next up are the logical operators. If a character matches with "&"
, we iterate onto the next character. If the next character also matches with "&"
we save a And
token with value "&&"
to the result
vector. If the next character does not match to "&"
we save the previous character ("&"
) as a Id
token and keep the next character for the next iteration of the while
loop. Same applies to Or
token. In lexer.rs
:
The relational operators are quite similar to the logical operators when it comes to our lexer. First we match the first character and if the second character also matches the token’s second character, token is saved with value corresponding to it. Otherwise the first character is saved as Id
, Lt
or Gt
token and we move on keeping track of the second character. In lexer.rs
:
Things get a bit more complicated when it comes to numbers. If the character matches to any digit between 0 and 9 it isturned to a floating point digit and saved as Num
token. Whether the number has decimals or not is checked by inspecting if the next character is a "."
followed by more digits. In lexer.rs
:
Next up we have words. If the character is alphabetical, we move the iterator forward and all group the characters a word until the iterator points to a character that is not alphabetical or a digit. We then look whether a token with the word as its value already exists in the words
hashmap. If it does not, we insert token Id
with the word as its value onto the hashmap and push the token to the result
vector. Otherwise we save the token returned from words
to the results
vector. In lexer.rs
:
Lastly we have the characters that have didn’t match with any of the conditions above. In these cases a Id
token with the character as its value is saved to the results
vector. In lexer.rs
:
We finalize the code by writing couple of tests. In lexer.rs
:
We now have a working lexer. Next step is to implement a parser that analyses the tokens returned by the lexer. Stay tuned.