DEV Community

Cover image for Building A Tokenizer in Typescript
Kinanee Samson
Kinanee Samson

Posted on • Updated on

Building A Tokenizer in Typescript

A tokenizer also known as a lexical analyzer, or lexer, in compiler design, is the first phase of the compiler. It is at this stage that the source code is scanned and broken down into a sequence of tokens. The tokens created by the tokenizer are usually designed according to the programming language's syntax being processed and they represent the basic building blocks of the language, such as keywords, identifiers, operators, and literals.

The primary function of the tokenizer in compiler design is to eliminate superfluous information from the source code and transform it into a format that is simpler for the compiler's later stages to process. It is here where Regular expressions, which specify the patterns for identifying the various sorts of tokens in the source code, are typically used to accomplish tokenization. The tokenizer may also handle white space and comments, detect errors, report them, and create symbol tables for the parser.

In this article, we are going to attempt to build a tokenizer function using Typescript! Our Tokenizer will try to tokenize a string of simple arithmetic procedures like addition, subtraction, division and multiplication, this can be used as part of a calculator program. A sample string would be "2+2" or "2 * (2 + 4)"

const tokenRef = {
  number: /\d+/,
  operand: /[\+\-\*\/]/,
  openBracket: /\(/,
  closeBracket: /\)/,
} as const;
Enter fullscreen mode Exit fullscreen mode

We create an object stores a group of regular expressions that would tokenize a mathematical phrase. The expression has different token kinds, including numbers, operators, and parentheses, this regular expressions are going to be used to identity specific tokens that matches our program definition.

const types = {
  number: 'Number',
  operand: 'Operand',
  openBracket: 'OpenBracket',
  closeBracket: 'CloseBracket',
} as const;
Enter fullscreen mode Exit fullscreen mode

The types object is used to provide more detail about the types of tokens that can be recognized by the tokenizer. It will be used to map the regular expressions used by the tokenizer in the tokenRef object to the specific token types that they represent. For example, the number regular expression will be mapped to the Number token type, the operand regular expression will be mapped to the Operand token type, and so on. Observe closely that the two objects have the same keys, we'll create a utility function that will help us get an array of the keys any object we pass to it since we are writing typescript.

const objKeys = function <T extends Object>(obj: T): (keyof T)[] {
  return Object.keys(obj) as (keyof T)[];
};

Enter fullscreen mode Exit fullscreen mode

We want the array to be strongly typed by infering the type of the keys to the items in the array.

const ALLOWED_TOKENS = objKeys(tokenRef);

type token = {
  token: string;
  type: typeof ALLOWED_TYPES[number];
};

const ALLOWED_TYPES = [
  types.number,
  types.operand,
  types.closeBracket,
  types.openBracket,
] as const;
Enter fullscreen mode Exit fullscreen mode

The token type is the type for a token object, the ALLOWED_TYPES constant is used to store an array of token types that the tokenizer will recognize. This provides a clear set of rules for the tokenizer to follow, allowing it to only recognize tokens of the specified types, let's write the function that will produce the tokens.

const generateToken = function* (
  str: string
): Generator<{ token: string; type: typeof ALLOWED_TYPES[number] }> {
  const charArr = Array.from(str);
  let cursor: number = 0;
  const tokenKeys = objKeys(tokenRef);
  while (cursor !== charArr.length) {
    for (let i = 0; i < tokenKeys.length; i++) {
      if (tokenRef[tokenKeys[i]].test(charArr[cursor])) {
        yield { token: charArr[cursor], type: types[tokenKeys[i]] };
        cursor++;
      }
    }
  }
};
Enter fullscreen mode Exit fullscreen mode

Our function above generateToken takes a string as an argument and returns a generator that yields the tokens of the string. The tokenKeys array contains the keys of the tokenRef object. The types object is then used to map the regular expressions to their specific token types. The generateToken function then iterates through the tokens of the string and tests them against the regular expressions for a particular. If a token matches one of the regular expressions, it is yielded with its associated token type. This allows the tokenizer to accurately tokenize a mathematical expression.

const str = `2+2`;
let _tokens: token[] = [];
const tokenIterator = generateToken(str);
_tokens = [...tokenIterator];
console.log(_tokens);
/* [ 
  { token: '2', type: 'Number' },
  { token: '+', type: 'Operand' },
  { token: '2', type: 'Number' },
] */
Enter fullscreen mode Exit fullscreen mode

We have successfully created a tokenizer function, however there are some drawbacks associated with our code, we can improve the efficiency of our code by adding an optimization that checks for overlapping regular expressions and prefers the longest matching expression. We can also add error handling to the generateToken function to ensure that it fails gracefully if an invalid token is encountered. Finally, we can add an optimization that allows the generateToken function to skip over tokens that have already been processed. These optimizations will help the tokenizer to be more accurate and efficient

See you in the next article where we'll optimize the tokenizer and move into the next stage of compiling.

Top comments (0)