Introduction
Hi everybody! In this blog post, I will document my experience through exploring the 6502 assembly language as part of my work for SPO600. Lab 1 is about implementing bitmap code, calculating the execution time and optimizing some parts of it. This is the Part 1 of Lab 1.
6502 Emulator
To write and test my assembly programs, I started by accessing the 6502 Emulator.
📌 Initial Code
We will start off with this code. Essentially, it fills the entire bitmap display with one color. The Y register is used for storing index and the X register is used to stop the loop by comparing with a value.
lda #$00 ; set a pointer in memory location $40 to point to $0200
sta $40 ; ... low byte ($00) goes in address $40
lda #$02
sta $41 ; ... high byte ($02) goes into address $41
lda #$07 ; colour number
ldy #$00 ; set index to 0
loop: sta ($40),y ; set pixel colour at the address (pointer)+Y
iny ; increment index
bne loop ; continue until done the page (256 pixels)
inc $41 ; increment the page
ldx $41 ; get the current page number
cpx #$06 ; compare with 6
bne loop ; continue until done all pages
There are two loops in the code. The inner loop fills one page with the color. In the first run of the loop, it would fill the entire $2000
page.
Note that this value is divided into two (
$00
and$20
) and they are stored in starting memory locations$40
and$41
as a start.
loop: sta ($40),y ; set pixel colour at the address (pointer)+Y
iny ; increment index
bne loop ; continue until done the page (256 pixels)
After this loop, it will still go through the rest of the loop until we reach page $06
(the higher byte == immediate value $06
)
inc $41 ; increment the page
ldx $41 ; get the current page number
cpx #$06 ; compare with 6
bne loop ; continue until done all pages
📌 Calculating Performance
Now, we will calculate how long it takes for the code to load, assuming a 1 MHz clock speed. The 6502 Reference Sheet is used to count the number of cycles.
Below is a table detailing how the performance is calculated.
Instruction | Cycles | Count | Alt Cycle | Alt Count | Total Cycles |
---|---|---|---|---|---|
LDA #$00 |
2 | 1 | - | - | 2 |
STA $40 |
3 | 1 | - | - | 3 |
LDA #$02 |
2 | 1 | - | - | 2 |
STA $41 |
3 | 1 | - | - | 3 |
LDA #$07 |
2 | 1 | - | - | 2 |
LDY #$00 |
2 | 1 | - | - | 2 |
STA ($40),Y |
6 | 1,024 | - | - | 6,144 |
INY |
2 | 1,024 | - | - | 2,048 |
BNE loop |
3 | 1,020 | 2 | 4 | 3,068 |
INC $41 |
5 | 4 | - | - | 20 |
LDX $41 |
3 | 4 | - | - | 12 |
CPX #$06 |
2 | 4 | - | - | 8 |
BNE loop |
3 | 3 | 2 | 1 | 11 |
Total | 11,325 |
Execution Timing Details
Parameter | Value |
---|---|
Total Cycles | 11,325 cycles |
Clock Speed | 1 MHz |
Cycle Time | 1 µs per cycle |
Execution Time | |
   • Seconds (s) | 0.011325 s |
   • Milliseconds (ms) | 11.325 ms |
   • Microseconds (µs) | 11,325 µs |
Memory Usage
I will also count the number of bytes for each operation from this link and calculate the memory usage.
Component | Bytes |
---|---|
Program Code | 29 |
Pointers/Variables | 2 |
Total Memory Usage | 31 |
Program Code
Instruction | Bytes |
---|---|
LDA #$00 |
2 |
STA $40 |
3 |
LDA #$02 |
2 |
STA $41 |
3 |
LDA #$07 |
2 |
LDY #$00 |
2 |
STA ($40),Y |
2 |
INY |
1 |
BNE loop (inner) |
2 |
INC $41 |
3 |
LDX $41 |
3 |
CPX #$06 |
2 |
BNE loop (outer) |
2 |
Subtotal | 29 |
Pointers/Variables
Pointer/Variable | Address | Bytes |
---|---|---|
Pointer Low Byte | $40 |
1 |
Pointer High Byte | $41 |
1 |
Subtotal | 2 |
📌 Let's Optimize it!
Now, the provided code works fine but I believe we can make it run faster! There are three things that we are going to change:
- Change Addressing Mode
Indirect Addressing (STA ($40), Y
) takes 6 cycles per instruction. We can change that to Absolute Addressing instead.
- Loop Adjustment
Instead of writing one byte per loop and then changing the page, we can write four bytes in each iteration. This will reduce the number of iterations
- Remove Pointer
Same idea as the first reason, by using absolute addressing, we don't need to manage a pointer via addresses $40
and $41
.
With all these changes in mind, this is what the final optimized code looks like:
LDA #$07 ; Load accumulator with color number $07
LDY #$00 ; Initialize Y register to 0
fill_screen:
STA $0200,Y ; Store $07 at $0200 + Y
STA $0300,Y ; Store $07 at $0300 + Y
STA $0400,Y ; Store $07 at $0400 + Y
STA $0500,Y ; Store $07 at $0500 + Y
INY ; Increment Y
BNE fill_screen ; Branch to fill_screen if Y != $00 (256 iterations)
Calculating Performance for Optimized Code
Instruction | Cycles | Count | Alt Cycle | Alt Count | Total Cycles |
---|---|---|---|---|---|
LDA #$07 |
2 | 1 | - | - | 2 |
LDY #$00 |
2 | 1 | - | - | 2 |
STA $0200,Y |
4 | 256 | - | - | 1,024 |
STA $0300,Y |
4 | 256 | - | - | 1,024 |
STA $0400,Y |
4 | 256 | - | - | 1,024 |
STA $0500,Y |
4 | 256 | - | - | 1,024 |
INY |
2 | 256 | - | - | 512 |
BNE fill_screen |
3 | 255 | - | - | 765 |
Total | 6,363 |
Execution Timing Details
Parameter | Value |
---|---|
Total Cycles | 6,363 cycles |
Clock Speed | 1 MHz |
Cycle Time | 1 µs per cycle |
Execution Time | |
   • Seconds (s) | 0.006363 s |
   • Milliseconds (ms) | 6.363 ms |
   • Microseconds (µs) | 6,363 µs |
As you can see the changes I made lessened the execution time by almost half!!
In the next blog post, we will be modifying the color to change as well as do some experiments.
Top comments (0)