Small Code Size on RP2040 Forth
Every since watching this nifty video on boot sector games, I've been obsessed with the idea of optimizing for code size, i.e., making the compiled code as few bytes as possible.
2022-06-28 - Boot Sector Games
The short summary of the video is that there are some games that have been written which fit entirely within the 512 KB boot sector of a floppy disk, and run without an operating system first being loaded. Not great games, but working games.
Something which challenged me recently was trying to reduce the compiled code size of a small program I wrote in Mecrisp Stellaris Forth. The program was supposed to read the function bits of the RP2040 GPIO CTRL registers, and output the current function that is set for each GPIO pin, in a human readable format. There are nine function options for each pin, and there are 29 pins. The exact function in a slot varies, however, for each pin, e.g., function 1 for GPIO 0 is SPIO RX, but for GPIO 2 it is SPIO SCK, and for GPIO 8 it is SPI1 RX.
Screenshot of GPIO functions
Aiming for small code size, it was out of the question to simply have a table of strings, with a full description for each slot, which would require over 1KB of space just for the string table. There is quite a bit of pattern and partial repetition, however, so that it is possible to build the string for a given slot using logic and comparisions, based on the bits in the pin number. This saves a lot of code space in terms of string data, but that benefit is offset somewhat by extra comparisons and bitwise operations that must be performed.
In my initial solution, I tried to generate long, easy to read strings, such as "UART0_RX", but in the end the compiled code size came out to somewhere around 900-1000 bytes, which seemed much too large for a program with such a humble purpose. The compromise I settled on, to avoid abondoning the "human-readable" goal completely, was to have abbreviated strings limited to three characters each. So, "UART0_RX" becomes "U0R". That is not self-explanatory, but it is much easier to associate this with "UART0_RX" than if you were just looking at the raw number code from the register.
Here is the program I came up with:
\ ******************************************************************
\ gpio_diag.fs is part of the mf-rp2040 project
\ SPDX-FileCopyrightText: 2022 Christopher Howard <christopher@librehacker.com>
\ SPDX-License-Identifier: GPL-3.0-or-later
\
\ Currently provides just the gpio_fns word.
\
\ public words:
\ gpio_fns
\ ******************************************************************
gpio_diag
marker gpio_diag
: pr1/0 if ." 1" else ." 0" then ;
$40014004 constant IO_BANK0_GPIO0_CTRL
\ ******************************************************************
\ This must be a separate word because a do loop cannot jump around
\ this much code (error "jump too far")
\
\ Legend:
\ SXR: SPI RX
\ SXT: SPI TX
\ SXC: SPI CSn
\ SXS: SPI SCK
\ UXR: UART RX
\ UXT: UART TX
\ UXC: UART CTS
\ UXS: UART RTS
\ IXC: I2C SCL
\ IXD: I2C SDA
\ PXA: PWM A
\ PXB: PWM B
\ SIO: SIO
\ PIX : PIO
\ CIX: CLOCK GPIN
\ COX: CLOCK GPOUT
\ NFC: NO FUNCTION
\ UOD: USB OVCUR DET
\ UVD: USB VBUS DET
\ UVE: USB VBUS EN
\ ******************************************************************
: gpio_fns' ( pin fn# -- )
case 1 of ." S" dup %1000 and pr1/0
%11 and case 0 of ." R"
endof 1 of ." C" endof %10 of ." S" endof ." T"
endcase endof
2 of ." U" dup 3 > over 12 < and over dup 19 > swap 28 <
and or pr1/0 %11 and case 0 of ." T" endof
1 of ." R" endof %10 of ." C" endof ." S"
endcase endof
3 of ." I" dup %10 and pr1/0
%1 and if ." C" else ." D" then endof
4 of ." P" dup 1 rshift %111 and $30 + emit
%1 and if ." B" else ." A" then endof
5 of drop ." SIO" endof
6 of drop ." PI0" endof
7 of drop ." PI1" endof
8 of ." C"
case 20 of ." I0" endof
21 of ." O0" endof
22 of ." I1" endof
23 of ." O1" endof
24 of ." O2" endof
25 of ." O3" endof
." NFC" endcase endof
9 of ." U" 3 mod case
0 of ." OD" endof
1 of ." VD" endof
." VE" endcase endof endcase ;
\ ******************************************************************
\ Print out the function currently set for each pin
\ ******************************************************************
: gpio_fns ( -- )
30 0 do
i . i 8 * IO_BANK0_GPIO0_CTRL + @ %11111 and i swap
gpio_fns' cr loop ;
behead pr1/0
behead IO_BANK0_GPIO0_CTRL
behead gpio_fns'
The two main words are GPIO_FNS' and GPIO_FNS, which compile to 582 bytes and 80 bytes, with a few more bytes for the other words, as well as the overhead for each entry in the dictionary. That was not a satisfying result, as I feel like this sort of program shouldn't be needing more than 200 or 300 bytes. But I couldn't think of any practical ways to reduce the code size further. I asked on #mecrisp, but nobody there had any ideas either, although somebody asked if Mecrisp Forth had some kind of BETWEEN word which could replace this code:
3 > over 12 < and over dup 19 > swap 28 < and or
But sadly, no such word was available. I could of course define such a word, but since (at present) I need to use it in only one place, it would actually increase the code size to factor it out, due to the dictionary entry overhead.
Something I was really wondering about was if there was some other way to print out the characters which would be more code-size efficient. However, it seemed that using the ." and " words took up less space than all the other methods I knew of. To explain how that works, see this disassembly:
Here is a word which prints out just the letter A, using the ." and " words:
: foo ." A" ; ok.
see foo
20020392: B500 push { lr }
20020394: F7E2 bl 20002878 --> .' A'
20020396: FA70
20020398: 4101
2002039A: BD00 pop { pc }
ok.
Not counting the push and pop, that is four bytes for a branch instruction (no doubt the code which handles the following string). Then one byte for the length of the string (01h) and then the character itself (41h). Here is the same word defined using a simple emit:
: foo $41 emit ; Redefine foo. ok.
see foo
200203A6: 3F04 subs r7 #4
200203A8: 603E str r6 [ r7 #0 ]
200203AA: 2641 movs r6 #41
200203AC: B500 push { lr }
200203AE: F7E2 bl 20002480 --> emit
200203B0: F867
200203B2: BD00 pop { pc }
ok.
Not counting the push or pop, we have four bytes for the branch instruction to the EMIT word, and there are six more bytes worth of instructions for dropping $41 on the top of the stack. So, the first approach will be better in terms of code size, even for a single-character string.
I should think that this would be a sensible place for the Mecrisp compiler to do an optimization: say, just put the number in some other scratch register, and call a special version of EMIT which uses that value. But I imagine that optimization is easier to talk about than to code into the compiler.
Another optimization I was wondering about, was if it would be possible to replace the case statement with something that has a smaller footprint. The case statement works by comparing Top Of Stack (TOS) with a case value. If it doesn't match, branch to the next comparison. If it does, keep running code, and eventually there is a branch instruction which takes you to the end of the whole construction. For example:
20020598: 2E06 cmp r6 #6
2002059A: D106 bne 200205AA
2002059C: CF40 ldmia r7 { r6 }
2002059E: CF40 ldmia r7 { r6 }
200205A0: F7E2 bl 20002878 --> .' PI0'
200205A2: F96A
200205A4: 5003
200205A6: 3049
200205A8: E066 b 20020678
However, in my situation, the case numbers are all in a whole number sequence. So, I should be able to just store an array of vectors to the different code blocks, then use the appropriate vector. I believe this would lower code size overall, since I would basically be adding only a single address for each two CMP and BNE instructions which I removed.
That sounds good, but I don't know how to accomplish that without either rewriting the whole GPIO_FNS' word in assembly, or coming up with a fancy Forth compile-time word which somehow builds the array and all the other branching code. Both ideas sound daunting, but perhaps this could be my next mini-project...
Alaskalinuxuser, 2022-07-05
I just looked up bootchess, it is only 487kb. Amazing how small a program can be!