Pico Bootloader
I recently bought a few Raspberry Pi Pico and Pico 2 boards and wanted to understand how to write my own firmware without using any SDK. No SDK means I had to write my own bootloader to make it working, and it was a fascinating journey that I will relate in this post.
Overview
The Pico board is interesting because its microcontroller datasheet 1 is really nice, making it beginner-friendly for those who want to get into the technical details.
The RP2040 microcontroller used in the Pico features two ARM Cortex-M0+ cores and supports execute-in-place from an external flash memory, which is where the firmware is stored.
Memory
There are three main memory components:
- ROM: read-only memory containing the first stage bootloader and some utility functions.
- SRAM: volatile memory used at runtime.
- Flash: non-volatile external memory used to store firmware.
Flash memory is located outside of the RP2040 package and requires a specific component, namely the SSI controller, to communicate with it.
XIP
There are two ways to run the firmware:
- Preload it from flash into SRAM, and then execute it from there.
- Execute it directly from flash, a process known as eXecute-In-Place (XIP).
The first method is faster, but takes longer to set up because we need to copy the entire firmware at boot. It is also limited by the SRAM size (264kB), which is significantly smaller than the flash size (2MB).
The second method is slower because flash memory is inherently slower than SRAM and is physically farther away. However, XIP can benefit from a dedicated 16kB cache that is as fast as SRAM, and there are methods to reduce SPI instruction overhead, as we will see.
Memory mapping
Memory and registers are mapped to the virtual memory space.
| Region name | Base address |
|---|---|
| ROM | 0x00000000 |
| XIP | 0x10000000 |
| SRAM | 0x20000000 |
| Cortex-M0+ Registers | 0xe0000000 |
For example, 0xe0000ed00 maps to the CPUID register, which contains the core identifier.
XIP provides a high level of abstraction to access flash with memory mapping.
SSI \leftrightarrow SPI
At a lower level, the SSI controller has transmit and receive FIFOs for storing dataframes to be sent and received via SPI.
On the RP2040, each of these FIFOs has a capacity of 16\times32-bit dataframes.
QSPI
Serial Peripheral Interface (SPI) is a synchronous serial communication standard that defines how the SSI controller and flash memory communicate with each other.
Quad SPI (QSPI) is a variant with four output lines, allowing for a larger bandwidth.
The W25Q16JV flash memory used by the Pico supports six different read instructions 2:
| Name | Code |
|---|---|
| Read Data | 03h |
| Fast Read | 0Bh |
| Fast Read Dual Output | 3Bh |
| Fast Read Quad Output | 6Bh |
| Fast Read Dual I/O | BBh |
| Fast Read Quad I/O | EBh |
Read Data (03h)
The simplest form of reading proceeds as follows:
- Transmission is initiated by driving
CS(Chip Select) low. - The 8-bit instruction code is transmitted on
IO0. - The 24-bit address is transmitted on
IO0. - Data from this address is received on
IO1. - Transmission is terminated by driving
CShigh.
03hTransmission is made one bit per clock cycle through the serial channel.
The address is automatically incremented after each byte, so multiple contiguous bytes can be read with a single instruction until CS is driven high.
Fast Read (0Bh)
The Fast Read instruction can operate at a higher frequency (50MHz \to 133MHz), but requires waiting eight "dummy" clock cycles after the address before data is sent.
0BhData transmission starts after 40 clock cycles instead of 32, but cycles are shorter.
Fast Read Dual Output (3Bh)
Pins can operate in half-duplex mode, allowing the IO0 line to transmit output as well. This results in two output bits being transferred per clock cycle.
3BhFast Read Quad Output (6Bh)
QSPI devices can send four bits per clock cycle thanks to the two additional pins.
6BhFast Read Quad I/O (EBh)
The address can also be transmitted four bits at a time.
EBhThe continuation code A0h 1 is sent after the address to allow omitting the instruction code on the next read instruction — after CS is driven high and then low — which saves clock cycles and is particularly suitable for XIP.
EBh (continuous mode)This instruction is considerably faster than others, as subsequent reads will only take 20 clock cycles to transmit a 32-bit word, thanks to the continuous mode.
Bootloader
The bootloader is the program that runs when powering on the Pico.
It is made of two parts:
- The bootrom: stored in the ROM, so cannot be modified.
- The second stage bootloader: loaded from flash memory and programmable.
The role of the second stage bootloader is either to load the firmware into SRAM or to configure the SSI controller to enable XIP. The optimal SSI configuration depends on the specific flash device and how it is connected to the RP2040 on the PCB. For W25Q16JV, the best read performance is achieved using the EBh instruction.
Different bootloaders are available on the pico-sdk repository 3:
generic_03h: works with any flash supporting the03hread instruction.w25q080: the best configuration for the Pico.
In the next sections, we will discuss how the generic bootloader works and how to configure SSI so that XIP performs as well as possible.
Generic bootloader
Overview:
- Configure SSI to use the widely-available
03hinstruction - Set up vector table
- Run firmware
Disable SSI
SSI configuration requires disabling the SSI controller by setting the SSIENR register to 0.
.set XIP_SSI_BASE, 0x18000000
.set SSI_SSIENR, 0x08
ldr r3, =XIP_SSI_BASE
// Disable SSI
movs r1, #0
str r1, [r3, #SSI_SSIENR]*((uint32_t*)0x18000008) = 0;At the end, SSI must be re-enabled by setting SSIENR = 1.
Baud rate
The clock used for SPI is disabled by default because the BAUDR register is set to 0.
This register contains the clock divider SCKDV, which is used to set the serial clock 1:
Where:
- f_{\text{sys}} = 125MHz
- f_{\text{max}} = 50MHz when using the
03hinstruction
The SDK defines SCKDV = 4, therefore f_{\text{sclk}} = 31.25MHz \le 50MHz.
SSI configuration
Here is an overview of the generic configuration:
| Name | Value | Description |
|---|---|---|
| SPI_FRF | [STD] DUAL QUAD | SPI frame format |
| DFS_32 | 31 | Dataframe size = n+1 |
| TMOD | TX_RX TX RX [EEPROM] | Transfer mode |
| NDF | 0 | Number of dataframes = n+1 |
| XIP_CMD | 0x03 | SPI command to use with XIP |
| WAIT_CYCLES | 0 | Number of dummy cycles |
| INST_L | NONE 4b [8b] 16b | Instruction length |
| ADDR_L | 6 | Address length = 4n |
| TRANS_TYPE | [1C1A] 1C2A 2C2A | Instruction & address format |
We want to read a single 32-bit dataframe containing the instruction to execute.
In the EEPROM transfer mode, the SSI controller sends an address and receives data in response. It may seem confusing at first that there are other transfer modes, but keep in mind that XIP is just one example of SSI usage. For example, we could use SRAM as external memory for another microcontroller and transmit data from SSI.
We will discuss the TRANS_TYPE value later.
ROM helper
The bootloader in the SDK works well, but it is worth noting that the SSI configuration may be reduced to a single call to the _flash_enter_cmd_xip ROM helper function.
However, the address of the function is not fixed. We need to locate it in the function lookup table by passing the function code 'C' 'X' to the built-in lookup function 1.
.set ROM_BASE, 0x00000000
.set func_table, 0x14
.set table_lookup, 0x18
.set flash_enter_cmd_xip, ('X' << 8) + 'C'
movs r2, #ROM_BASE
ldrh r0, [r2, #func_table]
ldr r1, =flash_enter_cmd_xip
// r0 = table_lookup(r0, r1)
ldrh r2, [r2, #table_lookup]
blx r2
// flash_enter_cmd_xip()
blx r0The parameters and return value of table_lookup can be deduced from sources 4:
r0: pointer to the function tabler1: the function code- Returned function pointer is in
r0
Note that this is primarily for learning purposes, as the resulting ROM is only 8 bytes smaller at the cost of an extra overhead due to the lookup.
Vector table
The vector table 5 is used by exceptions and contains:
- The reset value of the Main Stack Pointer
msp - Pointers to exception handlers
The bootloader first sets the vector table address — the address just after the bootloader — into the VTOR register, then it sets msp and triggers the Reset exception to start firmware.
.set XIP_BASE, 0x10000000
.set PPB_BASE, 0xe0000000
.set VTOR, 0xed08
vector_table:
ldr r0, =(XIP_BASE + 0x100)
ldr r1, =(PPB_BASE + VTOR)
str r0, [r1]
ldmia r0, {r0, r1}
msr msp, r0
bx r1uint32_t* vec_table = 0x10000100;
uint32_t* vtor = 0xe000ed08;
*vtor = (uint32_t) vec_table;
*msp = vec_table[0];
goto vec_table[1];Firmware
We can test our bootloader with a firmware that blinks a LED if booted successfully 6.
You can find a tiny example written in Rust here. Please note that some components may not function properly because the system is not fully initialized.
Memory layout
The bootrom expects the second stage bootloader to:
- Start at the beginning of flash memory
- Be exactly 252 bytes long
- End with a valid 4-byte CRC32 checksum of the first 252 bytes
Our bootloader assumes the vector table is at 0x10000100 and contains:
- An initial value for
msp - Pointers to exception handlers 5
Linking
A linker script 7 allows setting up the memory layout of a program.
In our case, we want the bootloader to be placed at the origin of flash memory, immediately followed by the vector table and then the firmware code.
The content of the .boot2 and .vector_table sections are explicitely defined in our code.
MEMORY {
BOOT2 (rx) : ORIGIN = 0x10000000, LENGTH = 0x100
FLASH (rx) : ORIGIN = 0x10000100, LENGTH = 2048K - 0x100
SRAM (rwx) : ORIGIN = 0x20000000, LENGTH = 256K
}
SECTIONS {
.boot2 ORIGIN(BOOT2) :
{
KEEP(*(.boot2));
} > BOOT2
.vector_table ORIGIN(FLASH) :
{
LONG(ORIGIN(SRAM) + LENGTH(SRAM));
KEEP(*(.vector_table.*));
} > FLASH
.text :
{
LONG(SIZEOF(.vector_table));
*(.text .text*);
} > FLASH
}Note that this example is minimal and we might want to add other sections like .data or .bss.
Flash
To flash the firmware on the board, it is more convenient to use the SWD port rather than copying the .uf2 file to the USB Mass Storage Device.
If you have a debug probe, I recommend using a tool like probe-rs 8 or openocd 9.
cargo flash --chip rp2040 --bin blinkyXIP optimization
Fast Read
| Name | Value |
|---|---|
| XIP_CMD | 0x0b |
| WAIT_CYCLES | 8 |
| SCKDV | 2 |
The SCKDV divider can be lowered because the 0Bh instruction now supports up to 133MHz.
Note that 2 is the lowest divider allowed by the SSI controller 1.
| Name | Value |
|---|---|
| XIP_CMD | 0x0b |
| WAIT_CYCLES | 8 |
| SCKDV | 2 |
However, this configuration doesn't work.
We can inspect XIP memory and compare with expected results using probe-rs 8:
$ probe-rs read --chip rp2040 b32 0x10000000 8
0cb50000 99210000 59210200 19490a00
59210000 f4490900 01509900 01609900
$ arm-none-eabi-objdump -sj .boot2 bin/firmware
bin/firmware: file format elf32-littlearm
Contents of section .boot2:
10000000 00b50c4b 00219960 02215961 0a491960 ...K.!.`.!Ya.I.`
10000010 00215960 0949f422 99500121 996001bc .!Y`.I.".P.!.`..
[...]We are close to the expected results, except that we lost the first byte in each 32-bit word. This is precisely what we were trying to prevent by inserting eight cycles before the data.
In fact, WAIT_CYCLES appears to be ignored in standard SPI mode, so 0Bh may not be supported.
Quad Output
Quad output requires setting the QE (Quad Enable) bit in Status Register-2 to 1.
However, this is the default for the variant on the Pico. We can confirm this by identifying the part number using its top-side marking and checking the datasheet 2.
| Name | Value |
|---|---|
| SPI_FRF | STD DUAL [QUAD] |
| XIP_CMD | 0x6b |
| WAIT_CYCLES | 8 |
We should also adjust RX_SAMPLE_DLY to prevent early sampling caused by the round-trip delay.
dly=0) vs Delayed sampling (dly=4)In this (fictive) example, data arrives four cycles later, so we need to set DLY = 4.
Quad I/O
Quad I/O requires sending the continuation code 0xa0 after the address. Unfortunately, this cannot be done using control registers only, so we need to do it manually.
First, we manually write 0xeb, a dummy address, and mode 0xa0 as raw data in the TX FIFO.
movs r1, #XIP_CMD_V
str r1, [r3, #DR0]
movs r1, #0x000000a0 // Address|Mode
str r1, [r3, #DR0]We need to wait until both dataframes are sent.
1:
// Wait for TX FIFO empty
ldr r1, [r3, #SR]
movs r0, #SR_TFE
tst r1, r0
beq 1b
// Wait for SSI inactive
movs r0, #SR_BUSY
tst r1, r0
bne 1bThen, we configure XIP_CMD to 0xa0 and INST_L to 0. This special setting instructs the SSI controller to place this value after the address, as specified in the datasheet 1.
| Name | Value |
|---|---|
| SPI_FRF | STD DUAL [QUAD] |
| XIP_CMD | 0xa0 |
| WAIT_CYCLES | 4 |
| INST_L | [NONE] 4b 8b 16b |
| ADDR_L | 8 |
| TRANS_TYPE | 1C1A [1C2A] 2C2A |
The ADDR_L register is set to 8 because we need to cover the mode too.
The TRANS_TYPE register is set to 1C2A, which means the address format follows SPI_FRF.
Electrical considerations
Disclaimer: My knowledge in electronics is rather limited. I tried to be as exact as possible, but this section may lack precision.
Drive strength
Drive strength defines how much current a GPIO pin can support at the expected voltage.
The QSPI clock drive strength can be configured through the QSPI_SCLK register.
QSPI_x registersAccording to the W25Q16JV specification, the typical drive for high frequencies is 8mA 2.
Clock slew
Slew rate measures how quickly a signal voltage changes between logic levels.
A slow slew rate reduces electromagnetic interference (EMI) and undesired peak frequencies 10, but it impacts negatively the duty cycle and may cause timing violations.
If timing requirements are not met — which may happen at high-frequency — the slew rate can be increased by setting the SLEWFAST bit in the QSPI_SCLK control register.
| Name | Value | Description |
|---|---|---|
| DRIVE | 2mA 4mA [8mA] 12mA | Drive strength |
| SLEWFAST | Slow [Fast] | Slew rate control |
Schmitt trigger
A Schmitt trigger converts a noisy analog signal into a clean digital output by introducing hysteresis. Hysteresis is when thresholds T for rising and falling transitions are distinct.
At an intermediate voltage, the output keeps its current state until a threshold is reached.
A Schmitt trigger enhances signal stability by filtering out small fluctuations. However, it draws current and introduces latency, which can cause timing violations at high frequencies.
If timing requirements are not met, Schmitt triggers on the QSPI data pins can be disabled by unsetting the SCHMITT bit in the QSPI_SDn registers.
| Name | Value | Description |
|---|---|---|
| SCHMITT | [Disabled] Enabled | Enable schmitt trigger |
Benchmarking
We can benchmark XIP by measuring the latency of uncached reads using the SysTick timer 1.
volatile uint32_t* SYST_CSR = 0xe000e010;
volatile uint32_t* SYST_RVR = 0xe000e014;
volatile uint32_t* SYST_CVR = 0xe000e018;
const uint32_t CSR_CLKSOURCE = 1 << 2;
const uint32_t CSR_ENABLE = 1;
volatile uint32_t* XIP_NOCACHE_ADDR = 0x13000000;
const uint32_t RVR = 0xffffff;
*SYST_RVR = RVR;
*SYST_CSR |= CSR_ENABLE | CSR_CLKSOURCE;
uint32_t n_cycles;
#pragma GCC unroll 0
for (size_t i = 0; i != 2; ++i) {
*SYST_CVR = 1;
const uint32_t _ = *XIP_NOCACHE_ADDR;
n_cycles = RVR - (*SYST_CVR);
}We configure the SysTick timer to start at RVR and operate at the sys_clk frequency.
The first iteration loads instructions into the cache to remove fetch latency in the second. We also need to disable loop unroll, as it would prevent instructions from being cached.
| Instruction | #cycles |
|---|---|
03h | 275 |
0Bh | N/A |
3Bh | 123 |
6Bh | 107 |
BBh | 75 |
EBh | 51 |
We see that EBh read is about 5 times faster than 03h read with our configuration.
Note that we might expect different results for different clock sources. These results are for the default on-chip Ring Oscillator (ROSC), which runs at a nominal 6.5MHz.
Instead of conclusion
By examining the bootloader code in depth, I learned many exciting aspects of it: QSPI, linker scripts, and even some electrical details. It was also a great opportunity to use ARM assembly and memory-mapped I/Os in a practical way.
I'm a bit surprised by the benchmark results, which I cannot explain analytically. But maybe it is due to the clock source being quite unstable.
The next step is to use more components of the system without the SDK to continue learning about this architecture. It could also be very instructive to write a program to flash the firmware on the board using the SWD port, rather than relying on any tool.
Again, I'm a little annoyed that the clock source isn't properly configured, as it could be much faster and more precise. I think I'll look into it again soon.