Akira

Posted on Nov 20, 2024

Creating a Toy Solidity compiler and running it in a Toy EVM

#solidity #evm #ethereum #rust

I wanted to try making my own compiler and runtime environment, so I gave it a shot. It doesn’t cover all the syntax or opcodes yet, but I’m satisfied as long as it can run basic functionality. I’ll make some improvements if I feel like it.

https://github.com/akira-19/toythereum

Demo

contract TestContract {
    uint256 a = 1 + 2 * 3 + 4;

    function increment() returns (uint256) {
        a = a + 1;
        return a;
    }

    function compare(uint256 b) returns (bool) {
        return a < b;
    }
}

When you compile the code, you can get the following bytecode.

7f00000000000000000000000000000000000000000000000000000000000000807f0000000000000000000000000000000000000000000000000000000000000040527f00000000000000000000000000000000000000000000000000000000000000047f00000000000000000000000000000000000000000000000000000000000000037f00000000000000000000000000000000000000000000000000000000000000027f00000000000000000000000000000000000000000000000000000000000000017f000000000000000000000000000000000000000000000000000000000000000b7f0000000000000000000000000000000000000000000000000000000000000000557f000000000000000000000000000000000000000000000000000000000000032f807f00000000000000000000000000000000000000000000000000000000000001515f395ff37f00000000000000000000000000000000000000000000000000000000000000807f0000000000000000000000000000000000000000000000000000000000000040527f00000000000000000000000000000000000000000000000000000000000000e07f0000000000000000000000000000000000000000000000000000000000000000351c7f00000000000000000000000000000000000000000000000000000000d09de08a147f0000000000000000000000000000000000000000000000000000000000000196577f00000000000000000000000000000000000000000000000000000000000000e07f0000000000000000000000000000000000000000000000000000000000000000351c7f00000000000000000000000000000000000000000000000000000000c4b8c257147f0000000000000000000000000000000000000000000000000000000000000283577f00000000000000000000000000000000000000000000000000000000000000007f0000000000000000000000000000000000000000000000000000000000000000fd5b7f00000000000000000000000000000000000000000000000000000000000000017f000000000000000000000000000000000000000000000000000000000000000054017f0000000000000000000000000000000000000000000000000000000000000000557f0000000000000000000000000000000000000000000000000000000000000000547f0000000000000000000000000000000000000000000000000000000000000080527f00000000000000000000000000000000000000000000000000000000000000207f0000000000000000000000000000000000000000000000000000000000000080f35b7f0000000000000000000000000000000000000000000000000000000000000004357f000000000000000000000000000000000000000000000000000000000000000054107f0000000000000000000000000000000000000000000000000000000000000080527f00000000000000000000000000000000000000000000000000000000000000207f0000000000000000000000000000000000000000000000000000000000000080f3

demo is here. (I don't know why, but I can embed gif file on the article)
https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8l5vmjmwwkp2iq85s7gd.gif

In the demo above, the following actions are performed:

Run the EVM.
Deploy the compiled bytecode (returns the contract code).
Execute compare with argument b as 12 (the initial value of a is 11, so it returns true, which is 1).
Execute increment (returns 12 in hexadecimal).
Execute compare again with argument b as 12 (since 12 == 12, it returns false, which is 0).

BTW, if you make a mistake in the return type as follows:

contract TestContract {
    uint256 a = 1 + 2 * 3 + 4;

    function compare(uint256 b) returns (uint256) {
        return a < b; // Compilation error occurs here because the return type is mismatched
    }
}

The error message looks like:

EVM

The EVM is a stack-based virtual machine.

Memory

Pre-allocated Memory

While inspecting compiled bytecode, I noticed that mstore often takes 0x40, so I looked into it.

Depending on the version, this might differ, but here is the reference I checked:

https://docs.soliditylang.org/en/v0.8.27/internals/layout_in_memory.html

The first 128 bytes of memory have specific purposes:

0x00 - 0x3f (64 bytes): Scratch space for hashing methods.
0x40 - 0x5f (32 bytes): Current memory size (referred to as the free memory pointer).
0x60 - 0x7f (32 bytes): Zero slot.
The scratch space is reserved for temporary use in hashing methods. If the allocated space is insufficient, it will use free memory.
The free memory pointer specifies the next available memory address. This means that the first mstore operation must store data at 0x80 since the first 128 bytes (up to 0x7f) are already reserved. The typical first instruction in a contract is to mstore 0x80 at address 0x40.
The zero slot is used as the default value for dynamic arrays. It remains filled with 0x00 and can be directly copied for initialization purposes.

Memory Allocation

Memory is allocated in 32-byte chunks.

Since the first 128 bytes are pre-allocated, the first writable memory address is 0x80.

Memory expands dynamically when needed. For example, if a 30-byte string is replaced with a 40-byte string, memory will automatically expand from 0x80-0x9f to 0x80-0xbf.

However, memory expansion incurs additional gas costs, so it should be managed carefully. Also, memory does not get freed automatically, even when it is no longer in use.

https://learnevm.com/chapters/evm/memory#memory-expansion

If the next memory space is already occupied, new memory is allocated at the end of the current memory.

https://medium.com/@PaulElisha1/layout-of-storage-and-memory-management-in-low-level-solidity-feb41c3de140

Storage

Slots

Unlike memory, storage is pre-allocated with 2^256-1 slots, each 32 bytes in size.

For fixed-size data types of 32 bytes or less, data is stored sequentially. For dynamically sized types, the following rules apply:

The slot stores the length of the data.
The actual data is stored in the slot calculated by keccak256(slot).

For example, if slot 3 is used:

Slot (3) stores the data length.
The actual data is stored at keccak256(3), e.g., 55b2c6.... If the data exceeds 32 bytes, subsequent slots after 55b2c6... are used. Arrays follow a similar pattern.

My implementation works only for strings of 32 bytes or less at the moment.

https://docs.alchemy.com/docs/smart-contract-storage-layout

For multidimensional arrays or mappings, slot numbers and key values are factored into the calculations.

https://docs.soliditylang.org/en/latest/internals/layout_in_storage.html

Currently, the implementation maps slots and values using Rust's HashMap, but in practice, a Merkle Patricia Trie is used.

Function Definitions

Functions are identified using a function selector, derived by taking the first 4 bytes of the keccak256 hash of the function signature.

https://docs.soliditylang.org/en/latest/abi-spec.html

For example, for a function like function mint(uint256),

keccak256(mint(uint256)) = 0xa0712d680358d64f694230b7cc0b277c73686bdf768385d55cd7c547d0ffd30e.

The function selector is 0xa0712d68.

When a transaction is sent, the first 4 bytes of the calldata contain the function selector. The contract compares this selector to the stored selector and jumps to the corresponding function if they match.

If you compile a simple contract in Remix and extract the opcodes for the function call, you will see something like this:

- CALLDATALOAD
- PUSH1 0xe0
- SHR
- DUP1
- PUSH4 0x6057361d
- EQ
- PUSH1 0x2a
- JUMPI

Explanation:

CALLDATALOAD, PUSH1 0xe0, and SHR: Extract the first 4 bytes of the calldata (the function selector).
PUSH4 0x6057361d, EQ: Compare the extracted function selector with the stored selector.
If they match, PUSH1 0x2a and JUMPI jump to the function (0x2a is the program counter in the bytecode).

Calldata

The first 4 bytes of the calldata represent the function selector. The arguments are stored in 32-byte chunks following the selector.

For simplicity, this assumes all arguments fit within 32 bytes, so the number of arguments multiplied by 32 bytes follows the function selector.

Solidity Compiler

Syntax

Unfortunately, there are no public or private keywords in my implementation at the moment.

In fact, there are no if statements or for loops either. Additionally, functions cannot call other functions within them.

Types

Not all types are supported at the moment. Currently, only string, uint256, and bool are available.

While string is defined, it only works if it is 256 bits or smaller.

Constant Folding

Constant folding is a process where constant expressions are precomputed at compile time instead of being evaluated at runtime on the EVM.

For example, in the statement uint256 a = 1 + 2;, instead of executing 1 + 2 on the EVM, the compiler calculates it during compilation and outputs uint256 a = 3;.

This is straightforward for simple cases, but it gets tricky when the expression includes unknown values, such as function arguments.

For instance, in uint256 a = 1 + arg + 2; (where arg is a function argument), the ideal approach would be to compute uint256 a = 3 + arg; at compile time and execute only 3 + arg on the EVM. However, due to complexity, if an unknown value is encountered, the entire expression is left for the EVM to calculate.

Deployment Code

In the EVM, there are two types of bytecode: deployment bytecode and deployed contract bytecode.

The deployed contract bytecode is included within the deployment bytecode and is eventually deployed using the codecopy opcode (technically, it is deployed at the point of RETURN).

When compiled with solc, the bytecode contains additional code before and after the contract code. In toythereum, the deployment process follows these steps:

Temporarily store all contract code without compiling functions until the entire contract code is scanned.
Compile non-function code (e.g., initializing storage variables).
Add deployment-specific instructions such as codecopy (use placeholders for contract code length as it is unknown at this stage).
Compile the functions (this becomes the contract code).
Calculate the length of the compiled functions and insert the length into the placeholders in the deployment instructions.

As a result, the final bytecode consists of two parts:

The first part handles initialization (e.g., storage setup) and deployment.
The second part contains the actual contract bytecode.

DEV Community

Creating a Toy Solidity compiler and running it in a Toy EVM

Demo

EVM

Memory

Pre-allocated Memory

Memory Allocation

Storage

Slots

Function Definitions

Calldata

Solidity Compiler

Syntax

Types

Constant Folding

Deployment Code

References

Top comments (0)

Read next

"Unlocking the Power of LLMs: Enhancing Indic Language Understanding"

E-commerce Category Schema: MongoDB, Translations, Breadcrumbs

Configure self-hosted runner for GitHub Actions

"Unlocking Stellar Secrets: Tidal Forces and Filament Formation in Space"