DEV Community

Andrey Frolov
Andrey Frolov

Posted on

High Level Overview of Compiler Design

Intro

So you have 10 years’ experience with Senior JavaScript web3.0 steampunk software, you’re an entrepreneur, consulting star, and an engineer… but do you ever consider how your computer "reads", "interprets" and "executes" your program? Here's my attempt to make this crucial part of CS easy to understand.

I realize that I'm simplifying a bit and possibly making some inappropriate comparisons, but I'm just trying to give a mental model to myself and other people. Now the disclaimers are done, we can start digging into the rabbit hole, take my hand and let's fly!

High level overview

So, time for a very high level overview of compilers:

Image description

High level language — it can be a JavaScript, Python, C++, the name of your lovely language, etc. So, in a nutshell, it's just a string.

Compiler — based on the previous step, you already have in mind the idea that the compiler input is a string, and the result is assembly language. So, we can think about compilers as a simple function that just takes a string and produces another string.



Assembly code string = compiler(high level language string)


Enter fullscreen mode Exit fullscreen mode

The last question is — what is assembly code?

Assembler — is like a manual for different operating systems; in other words, it's a set of instructions for a machine with some specific architecture (e.g. x86 architecture)

And here is our last move:

Machine code — actually just zeroes and ones

But you can mention that I place in the last block two pieces — Loader/Linker.

Just for your information, let's cover them a little bit.

A linker — is a program that combines the object files, generated by the compiler/assembler, and other bits of code to originate an executable file with an .exe extension. In the object file, the linker searches and append all libraries needed for the execution of files. It regulates the memory space that will hold the code from each module.

And the loader is a special program that takes an input of executable files from the linker, loads it to the main memory, and prepares this code to be executed by the computer. The loader allocates memory space to the program. Even it settles down symbolic reference between objects. It is in charge of loading programs and libraries in the operating system. The embedded computer systems don’t have loaders.

(I copied these two definitions from this site)

Deeper into the rabbit hole

So let's take another step and dig into compiler parts, which are much more interesting for us right now.

And another one image to explore:

Image description

Lexical analysis — removes whitespaces, useless symbols, and returns stream of tokens. For example:

Input string



SELECT * FROM "docs";


Enter fullscreen mode Exit fullscreen mode

You can think about the stream as an iterator that you can pull for the next token. Here's some pseudocode:



Lexer.next() -> { kind: Keyword, type: "SELECT" }
Lexer.next() -> { kind: Asterisk  }
Lexer.next() -> { kind: Keyword, type: "FROM" }
Lexer.next() -> { kind: Char, value: "\"" }
Lexer.next() -> { kind: String, value: "docs" }
Lexer.next() -> { kind: Char, value: "\"" }


Enter fullscreen mode Exit fullscreen mode

So the lexer represents valid characters from our language or alphabet.


To understand better, let's start parsing a simple string:



x = a + b * c;


Enter fullscreen mode Exit fullscreen mode

So as humans, we can easily parse and understand such statements, but it's not so easy for the computer. We've already covered the purpose of the parser, and let's parse the above statement. Let's try to simplify this:



id = id + id * id


Enter fullscreen mode Exit fullscreen mode

Where's id is an identifier, and if we try to represent it as an array, it can look something like this:



[
  { type: 'id', value: x },
  { type: 'symbol', value: '=' },
  { type: 'id', value: 'a' },
  { type: 'symbol', value: '+' }, 
  { type: 'id', value: 'b' },
  { type: 'symbol', value: '*' },
  { type: 'id', value: 'c' },
  { type: 'symbol', value: ';' },
]


Enter fullscreen mode Exit fullscreen mode

Parser — uses context-free grammar (we will cover this definition in the next part) to create a parse tree.

There's no to be nervous, we're going to introduce a lot of complex definitions and terms, and if you find you're getting a bit lost, it's ok. We will cover everything related to parsing in the next chapters.

So, based on language grammar parser creates parse tree, which helps us find the right execution order of statement.

Let's take a look at an example of some grammar in BNF notation.



S ⭢ id = E
E ⭢ E + T
  |  T
T ⭢ T * F
  | F
F ⭢ id



Enter fullscreen mode Exit fullscreen mode

Based on this grammar, we can build a parse tree:

Image description

Semantic analyzer — validates the parse tree to find grammar errors.

Intermediate code generator — produces three-address code

For example, it can look something like this:



t1 = b * c;
t2 = a + t1;
x = t2;


Enter fullscreen mode Exit fullscreen mode

Code optimizer (CO) — optimizes three-address code.



t1 = b * c;
x = a + t1;


Enter fullscreen mode Exit fullscreen mode

Target code generator — creates a code that the assembler can understand.

Some random pseudo architecture:



mul R1, R2
add R0, R2
mov R2, x


Enter fullscreen mode Exit fullscreen mode

where a = R0, b = R1, c = R2


Feel free to ask questions, express any opinions or concerns, or discuss your point of view. Share, subscribe, and make code, not war. ❤️

If you'll find an error, I'll be glad to fix it or learn from you - just let me know.

You can DM me on Twitter or connect on LinkedIn. I'm always open to new connections, people, and opportunities.

Top comments (2)

Collapse
 
nrjshka profile image
Max Korolev

Woah!
Thank you, it was awesome

Collapse
 
frolovdev profile image
Andrey Frolov

hooray