Andrey Frolov

Posted on Apr 16, 2022

High Level Overview of Compiler Design

#computerscience #parsers #compilers #engineering

Intro

So you have 10 years’ experience with Senior JavaScript web3.0 steampunk software, you’re an entrepreneur, consulting star, and an engineer… but do you ever consider how your computer "reads", "interprets" and "executes" your program? Here's my attempt to make this crucial part of CS easy to understand.

I realize that I'm simplifying a bit and possibly making some inappropriate comparisons, but I'm just trying to give a mental model to myself and other people. Now the disclaimers are done, we can start digging into the rabbit hole, take my hand and let's fly!

High level overview

So, time for a very high level overview of compilers:

High level language — it can be a JavaScript, Python, C++, the name of your lovely language, etc. So, in a nutshell, it's just a string.

Compiler — based on the previous step, you already have in mind the idea that the compiler input is a string, and the result is assembly language. So, we can think about compilers as a simple function that just takes a string and produces another string.



Assembly code string = compiler(high level language string)

The last question is — what is assembly code?

Assembler — is like a manual for different operating systems; in other words, it's a set of instructions for a machine with some specific architecture (e.g. x86 architecture)

And here is our last move:

Machine code — actually just zeroes and ones

But you can mention that I place in the last block two pieces — Loader/Linker.

Just for your information, let's cover them a little bit.

A linker — is a program that combines the object files, generated by the compiler/assembler, and other bits of code to originate an executable file with an .exe extension. In the object file, the linker searches and append all libraries needed for the execution of files. It regulates the memory space that will hold the code from each module.

And the loader is a special program that takes an input of executable files from the linker, loads it to the main memory, and prepares this code to be executed by the computer. The loader allocates memory space to the program. Even it settles down symbolic reference between objects. It is in charge of loading programs and libraries in the operating system. The embedded computer systems don’t have loaders.

(I copied these two definitions from this site)

Deeper into the rabbit hole

So let's take another step and dig into compiler parts, which are much more interesting for us right now.

And another one image to explore:

Lexical analysis — removes whitespaces, useless symbols, and returns stream of tokens. For example:

Input string



SELECT * FROM "docs";

You can think about the stream as an iterator that you can pull for the next token. Here's some pseudocode:



Lexer.next() -> { kind: Keyword, type: "SELECT" }
Lexer.next() -> { kind: Asterisk  }
Lexer.next() -> { kind: Keyword, type: "FROM" }
Lexer.next() -> { kind: Char, value: "\"" }
Lexer.next() -> { kind: String, value: "docs" }
Lexer.next() -> { kind: Char, value: "\"" }

So the lexer represents valid characters from our language or alphabet.

To understand better, let's start parsing a simple string:



x = a + b * c;

So as humans, we can easily parse and understand such statements, but it's not so easy for the computer. We've already covered the purpose of the parser, and let's parse the above statement. Let's try to simplify this:



id = id + id * id

Where's id is an identifier, and if we try to represent it as an array, it can look something like this:



[
  { type: 'id', value: x },
  { type: 'symbol', value: '=' },
  { type: 'id', value: 'a' },
  { type: 'symbol', value: '+' }, 
  { type: 'id', value: 'b' },
  { type: 'symbol', value: '*' },
  { type: 'id', value: 'c' },
  { type: 'symbol', value: ';' },
]

Parser — uses context-free grammar (we will cover this definition in the next part) to create a parse tree.

There's no to be nervous, we're going to introduce a lot of complex definitions and terms, and if you find you're getting a bit lost, it's ok. We will cover everything related to parsing in the next chapters.

So, based on language grammar parser creates parse tree, which helps us find the right execution order of statement.

Let's take a look at an example of some grammar in BNF notation.



S ⭢ id = E
E ⭢ E + T
  |  T
T ⭢ T * F
  | F
F ⭢ id

Based on this grammar, we can build a parse tree:

Semantic analyzer — validates the parse tree to find grammar errors.

Intermediate code generator — produces three-address code

For example, it can look something like this:



t1 = b * c;
t2 = a + t1;
x = t2;

Code optimizer (CO) — optimizes three-address code.



t1 = b * c;
x = a + t1;

Target code generator — creates a code that the assembler can understand.

Some random pseudo architecture:



mul R1, R2
add R0, R2
mov R2, x

where a = R0, b = R1, c = R2

Feel free to ask questions, express any opinions or concerns, or discuss your point of view. Share, subscribe, and make code, not war. ❤️

If you'll find an error, I'll be glad to fix it or learn from you - just let me know.

You can DM me on Twitter or connect on LinkedIn. I'm always open to new connections, people, and opportunities.

Top comments (2)

Max Korolev • Apr 18 '22

Woah!
Thank you, it was awesome

Andrey Frolov • Apr 18 '22

hooray

DEV Community

High Level Overview of Compiler Design

Intro

High level overview

Deeper into the rabbit hole

Top comments (2)

Read next

You shouldn't use AI for programming

Reverse engineering TP-Link Tapo's REST API - part 1

DSA: Tree - questions

A Preview of Dragonfly Cluster