Exploring 6502 Assembly Language - Lab 1 (Part One)

Introduction

Hi everybody! In this blog post, I will document my experience through exploring the 6502 assembly language as part of my work for SPO600. Lab 1 is about implementing bitmap code, calculating the execution time and optimizing some parts of it. This is the Part 1 of Lab 1.

6502 Emulator

To write and test my assembly programs, I started by accessing the 6502 Emulator.

📌 Initial Code

We will start off with this code. Essentially, it fills the entire bitmap display with one color. The Y register is used for storing index and the X register is used to stop the loop by comparing with a value.

    lda #$00    ; set a pointer in memory location $40 to point to $0200
    sta $40     ; ... low byte ($00) goes in address $40
    lda #$02    
    sta $41     ; ... high byte ($02) goes into address $41
    lda #$07    ; colour number
    ldy #$00    ; set index to 0
 loop:  sta ($40),y ; set pixel colour at the address (pointer)+Y
    iny     ; increment index
    bne loop    ; continue until done the page (256 pixels)
    inc $41     ; increment the page
    ldx $41     ; get the current page number
    cpx #$06    ; compare with 6
    bne loop    ; continue until done all pages

There are two loops in the code. The inner loop fills one page with the color. In the first run of the loop, it would fill the entire $2000 page.

Note that this value is divided into two ($00 and $20) and they are stored in starting memory locations $40 and $41 as a start.

 loop:  sta ($40),y ; set pixel colour at the address (pointer)+Y
    iny     ; increment index
    bne loop    ; continue until done the page (256 pixels)

After this loop, it will still go through the rest of the loop until we reach page $06 (the higher byte == immediate value $06)

     inc $41        ; increment the page
     ldx $41        ; get the current page number
     cpx #$06    ; compare with 6
     bne loop    ; continue until done all pages

📌 Calculating Performance

Now, we will calculate how long it takes for the code to load, assuming a 1 MHz clock speed. The 6502 Reference Sheet is used to count the number of cycles.

Below is a table detailing how the performance is calculated.

Instruction	Cycles	Count	Alt Cycle	Alt Count	Total Cycles
`LDA #$00`	2	1	-	-	2
`STA $40`	3	1	-	-	3
`LDA #$02`	2	1	-	-	2
`STA $41`	3	1	-	-	3
`LDA #$07`	2	1	-	-	2
`LDY #$00`	2	1	-	-	2
`STA ($40),Y`	6	1,024	-	-	6,144
`INY`	2	1,024	-	-	2,048
`BNE loop`	3	1,020	2	4	3,068
`INC $41`	5	4	-	-	20
`LDX $41`	3	4	-	-	12
`CPX #$06`	2	4	-	-	8
`BNE loop`	3	3	2	1	11
Total					11,325

Execution Timing Details

Parameter	Value
Total Cycles	11,325 cycles
Clock Speed	1 MHz
Cycle Time	1 µs per cycle
Execution Time
• Seconds (s)	0.011325 s
• Milliseconds (ms)	11.325 ms
• Microseconds (µs)	11,325 µs

Memory Usage

I will also count the number of bytes for each operation from this link and calculate the memory usage.

Component	Bytes
Program Code	29
Pointers/Variables	2
Total Memory Usage	31

Program Code

Instruction	Bytes
`LDA #$00`	2
`STA $40`	3
`LDA #$02`	2
`STA $41`	3
`LDA #$07`	2
`LDY #$00`	2
`STA ($40),Y`	2
`INY`	1
`BNE loop (inner)`	2
`INC $41`	3
`LDX $41`	3
`CPX #$06`	2
`BNE loop (outer)`	2
Subtotal	29

Pointers/Variables

Pointer/Variable	Address	Bytes
Pointer Low Byte	`$40`	1
Pointer High Byte	`$41`	1
Subtotal		2

📌 Let's Optimize it!

Now, the provided code works fine but I believe we can make it run faster! There are three things that we are going to change:

Change Addressing Mode

Indirect Addressing (STA ($40), Y) takes 6 cycles per instruction. We can change that to Absolute Addressing instead.

Loop Adjustment

Instead of writing one byte per loop and then changing the page, we can write four bytes in each iteration. This will reduce the number of iterations

Remove Pointer

Same idea as the first reason, by using absolute addressing, we don't need to manage a pointer via addresses $40 and $41.

With all these changes in mind, this is what the final optimized code looks like:

        LDA #$07        ; Load accumulator with color number $07
        LDY #$00        ; Initialize Y register to 0

fill_screen:
        STA $0200,Y     ; Store $07 at $0200 + Y
        STA $0300,Y     ; Store $07 at $0300 + Y
        STA $0400,Y     ; Store $07 at $0400 + Y
        STA $0500,Y     ; Store $07 at $0500 + Y
        INY             ; Increment Y
        BNE fill_screen ; Branch to fill_screen if Y != $00 (256 iterations)

Calculating Performance for Optimized Code

Instruction	Cycles	Count	Alt Cycle	Alt Count	Total Cycles
`LDA #$07`	2	1	-	-	2
`LDY #$00`	2	1	-	-	2
`STA $0200,Y`	4	256	-	-	1,024
`STA $0300,Y`	4	256	-	-	1,024
`STA $0400,Y`	4	256	-	-	1,024
`STA $0500,Y`	4	256	-	-	1,024
`INY`	2	256	-	-	512
`BNE fill_screen`	3	255	-	-	765
Total					6,363

Execution Timing Details

Parameter	Value
Total Cycles	6,363 cycles
Clock Speed	1 MHz
Cycle Time	1 µs per cycle
Execution Time
• Seconds (s)	0.006363 s
• Milliseconds (ms)	6.363 ms
• Microseconds (µs)	6,363 µs