Pico Bootloader

I recently bought a few Raspberry Pi Pico and Pico 2 boards and wanted to understand how to write my own firmware without using any SDK. No SDK means I had to write my own bootloader to make it working, and it was a fascinating journey that I will relate in this post.

Overview

The Pico board is interesting because its microcontroller datasheet 1 is really nice, making it beginner-friendly for those who want to get into the technical details.

The RP2040 microcontroller used in the Pico features two ARM Cortex-M0+ cores and supports execute-in-place from an external flash memory, which is where the firmware is stored.

Memory

There are three main memory components:

Flash memory is located outside of the RP2040 package and requires a specific component, namely the SSI controller, to communicate with it.

SSI controller & Flash device
SSI controller & Flash device

XIP

There are two ways to run the firmware:

The first method is faster, but takes longer to set up because we need to copy the entire firmware at boot. It is also limited by the SRAM size (264kB), which is significantly smaller than the flash size (2MB).

The second method is slower because flash memory is inherently slower than SRAM and is physically farther away. However, XIP can benefit from a dedicated 16kB cache that is as fast as SRAM, and there are methods to reduce SPI instruction overhead, as we will see.

Memory mapping

Memory and registers are mapped to the virtual memory space.

Region nameBase address
ROM0x00000000
XIP0x10000000
SRAM0x20000000
Cortex-M0+ Registers0xe0000000

For example, 0xe0000ed00 maps to the CPUID register, which contains the core identifier.

XIP provides a high level of abstraction to access flash with memory mapping.

Memory mapping
Memory mapping

SSI \leftrightarrow SPI

At a lower level, the SSI controller has transmit and receive FIFOs for storing dataframes to be sent and received via SPI.

Transmit & Receive FIFOs
Transmit & Receive FIFOs

On the RP2040, each of these FIFOs has a capacity of 16\times32-bit dataframes.

QSPI

Serial Peripheral Interface (SPI) is a synchronous serial communication standard that defines how the SSI controller and flash memory communicate with each other.

Quad SPI (QSPI) is a variant with four output lines, allowing for a larger bandwidth.

SPI devices (left) & QSPI devices (right)
SPI devices (left) & QSPI devices (right)

The W25Q16JV flash memory used by the Pico supports six different read instructions 2:

NameCode
Read Data03h
Fast Read0Bh
Fast Read Dual Output3Bh
Fast Read Quad Output6Bh
Fast Read Dual I/OBBh
Fast Read Quad I/OEBh

Read Data (03h)

The simplest form of reading proceeds as follows:

  1. Transmission is initiated by driving CS (Chip Select) low.
  2. The 8-bit instruction code is transmitted on IO0.
  3. The 24-bit address is transmitted on IO0.
  4. Data from this address is received on IO1.
  5. Transmission is terminated by driving CS high.
Timing diagram of 03h
Timing diagram of 03h

Transmission is made one bit per clock cycle through the serial channel.

The address is automatically incremented after each byte, so multiple contiguous bytes can be read with a single instruction until CS is driven high.

Fast Read (0Bh)

The Fast Read instruction can operate at a higher frequency (50MHz \to 133MHz), but requires waiting eight "dummy" clock cycles after the address before data is sent.

Timing diagram of 0Bh
Timing diagram of 0Bh

Data transmission starts after 40 clock cycles instead of 32, but cycles are shorter.

Fast Read Dual Output (3Bh)

Pins can operate in half-duplex mode, allowing the IO0 line to transmit output as well. This results in two output bits being transferred per clock cycle.

Timing diagram of 3Bh
Timing diagram of 3Bh

Fast Read Quad Output (6Bh)

QSPI devices can send four bits per clock cycle thanks to the two additional pins.

Timing diagram of 6Bh
Timing diagram of 6Bh

Fast Read Quad I/O (EBh)

The address can also be transmitted four bits at a time.

Timing diagram of EBh
Timing diagram of EBh

The continuation code A0h 1 is sent after the address to allow omitting the instruction code on the next read instruction — after CS is driven high and then low — which saves clock cycles and is particularly suitable for XIP.

Timing diagram of EBh (continuous mode)
Timing diagram of EBh (continuous mode)

This instruction is considerably faster than others, as subsequent reads will only take 20 clock cycles to transmit a 32-bit word, thanks to the continuous mode.

Bootloader

The bootloader is the program that runs when powering on the Pico.

It is made of two parts:

  1. The bootrom: stored in the ROM, so cannot be modified.
  2. The second stage bootloader: loaded from flash memory and programmable.

The role of the second stage bootloader is either to load the firmware into SRAM or to configure the SSI controller to enable XIP. The optimal SSI configuration depends on the specific flash device and how it is connected to the RP2040 on the PCB. For W25Q16JV, the best read performance is achieved using the EBh instruction.

Different bootloaders are available on the pico-sdk repository 3:

In the next sections, we will discuss how the generic bootloader works and how to configure SSI so that XIP performs as well as possible.

Generic bootloader

Overview:

  1. Configure SSI to use the widely-available 03h instruction
  2. Set up vector table
  3. Run firmware

Disable SSI

SSI configuration requires disabling the SSI controller by setting the SSIENR register to 0.

.set XIP_SSI_BASE, 0x18000000
.set SSI_SSIENR, 0x08

ldr r3, =XIP_SSI_BASE

// Disable SSI
movs r1, #0
str r1, [r3, #SSI_SSIENR]
*((uint32_t*)0x18000008) = 0;

At the end, SSI must be re-enabled by setting SSIENR = 1.

Baud rate

The clock used for SPI is disabled by default because the BAUDR register is set to 0.

This register contains the clock divider SCKDV, which is used to set the serial clock 1:

f_{\text{sclk}} = \frac{f_{\text{sys}}}{\text{SCKDV}} \le f_{\text{max}}

Where:

The SDK defines SCKDV = 4, therefore f_{\text{sclk}} = 31.25MHz \le 50MHz.

SSI configuration

Here is an overview of the generic configuration:

NameValueDescription
SPI_FRF[STD] DUAL QUADSPI frame format
DFS_3231Dataframe size = n+1
TMODTX_RX TX RX [EEPROM]Transfer mode
NDF0Number of dataframes = n+1
XIP_CMD0x03SPI command to use with XIP
WAIT_CYCLES0Number of dummy cycles
INST_LNONE 4b [8b] 16bInstruction length
ADDR_L6Address length = 4n
TRANS_TYPE[1C1A] 1C2A 2C2AInstruction & address format

We want to read a single 32-bit dataframe containing the instruction to execute.

In the EEPROM transfer mode, the SSI controller sends an address and receives data in response. It may seem confusing at first that there are other transfer modes, but keep in mind that XIP is just one example of SSI usage. For example, we could use SRAM as external memory for another microcontroller and transmit data from SSI.

We will discuss the TRANS_TYPE value later.

Control registers
Control registers

ROM helper

The bootloader in the SDK works well, but it is worth noting that the SSI configuration may be reduced to a single call to the _flash_enter_cmd_xip ROM helper function.

However, the address of the function is not fixed. We need to locate it in the function lookup table by passing the function code 'C' 'X' to the built-in lookup function 1.

.set ROM_BASE, 0x00000000
.set func_table, 0x14
.set table_lookup, 0x18
.set flash_enter_cmd_xip, ('X' << 8) + 'C'

movs r2, #ROM_BASE

ldrh r0, [r2, #func_table]
ldr r1, =flash_enter_cmd_xip

// r0 = table_lookup(r0, r1)
ldrh r2, [r2, #table_lookup]
blx r2

// flash_enter_cmd_xip()
blx r0

The parameters and return value of table_lookup can be deduced from sources 4:

Note that this is primarily for learning purposes, as the resulting ROM is only 8 bytes smaller at the cost of an extra overhead due to the lookup.

Vector table

The vector table 5 is used by exceptions and contains:

  1. The reset value of the Main Stack Pointer msp
  2. Pointers to exception handlers
Vector table
Vector table

The bootloader first sets the vector table address — the address just after the bootloader — into the VTOR register, then it sets msp and triggers the Reset exception to start firmware.

.set XIP_BASE, 0x10000000
.set PPB_BASE, 0xe0000000
.set VTOR, 0xed08

vector_table:
    ldr r0, =(XIP_BASE + 0x100)
    ldr r1, =(PPB_BASE + VTOR)
    str r0, [r1]
    ldmia r0, {r0, r1}
    msr msp, r0
    bx r1
uint32_t* vec_table = 0x10000100;
uint32_t* vtor = 0xe000ed08;

*vtor = (uint32_t) vec_table;
*msp = vec_table[0];
goto vec_table[1];

Firmware

We can test our bootloader with a firmware that blinks a LED if booted successfully 6.

You can find a tiny example written in Rust here. Please note that some components may not function properly because the system is not fully initialized.

Memory layout

The bootrom expects the second stage bootloader to:

Our bootloader assumes the vector table is at 0x10000100 and contains:

Memory Layout
Memory Layout

Linking

A linker script 7 allows setting up the memory layout of a program.

In our case, we want the bootloader to be placed at the origin of flash memory, immediately followed by the vector table and then the firmware code.

The content of the .boot2 and .vector_table sections are explicitely defined in our code.

MEMORY {
  BOOT2 (rx)  : ORIGIN = 0x10000000, LENGTH = 0x100
  FLASH (rx)  : ORIGIN = 0x10000100, LENGTH = 2048K - 0x100
  SRAM  (rwx) : ORIGIN = 0x20000000, LENGTH = 256K
}

SECTIONS {
  .boot2 ORIGIN(BOOT2) :
  {
    KEEP(*(.boot2));
  } > BOOT2

  .vector_table ORIGIN(FLASH) :
  {
    LONG(ORIGIN(SRAM) + LENGTH(SRAM));
    KEEP(*(.vector_table.*));
  } > FLASH

  .text :
  {
    LONG(SIZEOF(.vector_table));
    *(.text .text*);
  } > FLASH
}

Note that this example is minimal and we might want to add other sections like .data or .bss.

Flash

To flash the firmware on the board, it is more convenient to use the SWD port rather than copying the .uf2 file to the USB Mass Storage Device.

If you have a debug probe, I recommend using a tool like probe-rs 8 or openocd 9.

cargo flash --chip rp2040 --bin blinky

XIP optimization

Fast Read

NameValue
XIP_CMD0x0b
WAIT_CYCLES8
SCKDV2

The SCKDV divider can be lowered because the 0Bh instruction now supports up to 133MHz.

\begin{aligned} f_{\text{sclk}} &= \frac{f_{\text{SYS}}}{\text{SCKDV}} = \frac{125}{2} \\\\ &= 62.5 \le 133 \end{aligned}

Note that 2 is the lowest divider allowed by the SSI controller 1.

NameValue
XIP_CMD0x0b
WAIT_CYCLES8
SCKDV2

However, this configuration doesn't work.

We can inspect XIP memory and compare with expected results using probe-rs 8:

$ probe-rs read --chip rp2040 b32 0x10000000 8
0cb50000 99210000 59210200 19490a00
59210000 f4490900 01509900 01609900

$ arm-none-eabi-objdump -sj .boot2 bin/firmware

bin/firmware:     file format elf32-littlearm

Contents of section .boot2:
 10000000 00b50c4b 00219960 02215961 0a491960  ...K.!.`.!Ya.I.`
 10000010 00215960 0949f422 99500121 996001bc  .!Y`.I.".P.!.`..
 [...]

We are close to the expected results, except that we lost the first byte in each 32-bit word. This is precisely what we were trying to prevent by inserting eight cycles before the data.

In fact, WAIT_CYCLES appears to be ignored in standard SPI mode, so 0Bh may not be supported.

Quad Output

Quad output requires setting the QE (Quad Enable) bit in Status Register-2 to 1.

However, this is the default for the variant on the Pico. We can confirm this by identifying the part number using its top-side marking and checking the datasheet 2.

NameValue
SPI_FRFSTD DUAL [QUAD]
XIP_CMD0x6b
WAIT_CYCLES8

We should also adjust RX_SAMPLE_DLY to prevent early sampling caused by the round-trip delay.

Early sampling (dly=0) vs Delayed sampling (dly=4)
Early sampling (dly=0) vs Delayed sampling (dly=4)

In this (fictive) example, data arrives four cycles later, so we need to set DLY = 4.

Quad I/O

Quad I/O requires sending the continuation code 0xa0 after the address. Unfortunately, this cannot be done using control registers only, so we need to do it manually.

First, we manually write 0xeb, a dummy address, and mode 0xa0 as raw data in the TX FIFO.

movs r1, #XIP_CMD_V
str r1, [r3, #DR0]
movs r1, #0x000000a0  // Address|Mode
str r1, [r3, #DR0]

We need to wait until both dataframes are sent.

1:
    // Wait for TX FIFO empty
    ldr r1, [r3, #SR]
    movs r0, #SR_TFE
    tst r1, r0
    beq 1b
    // Wait for SSI inactive
    movs r0, #SR_BUSY
    tst r1, r0
    bne 1b

Then, we configure XIP_CMD to 0xa0 and INST_L to 0. This special setting instructs the SSI controller to place this value after the address, as specified in the datasheet 1.

NameValue
SPI_FRFSTD DUAL [QUAD]
XIP_CMD0xa0
WAIT_CYCLES4
INST_L[NONE] 4b 8b 16b
ADDR_L8
TRANS_TYPE1C1A [1C2A] 2C2A

The ADDR_L register is set to 8 because we need to cover the mode too.

The TRANS_TYPE register is set to 1C2A, which means the address format follows SPI_FRF.

Electrical considerations

Disclaimer: My knowledge in electronics is rather limited. I tried to be as exact as possible, but this section may lack precision.

Drive strength

Drive strength defines how much current a GPIO pin can support at the expected voltage.

Low & High drive strength IV curves
Low & High drive strength IV curves

The QSPI clock drive strength can be configured through the QSPI_SCLK register.

QSPI_x registers
QSPI_x registers

According to the W25Q16JV specification, the typical drive for high frequencies is 8mA 2.

Clock slew

Slew rate measures how quickly a signal voltage changes between logic levels.

Ideal clock vs Real clock
Ideal clock vs Real clock

A slow slew rate reduces electromagnetic interference (EMI) and undesired peak frequencies 10, but it impacts negatively the duty cycle and may cause timing violations.

Expected timing vs Timing violation due to low duty cycle
Expected timing vs Timing violation due to low duty cycle

If timing requirements are not met — which may happen at high-frequency — the slew rate can be increased by setting the SLEWFAST bit in the QSPI_SCLK control register.

NameValueDescription
DRIVE2mA 4mA [8mA] 12mADrive strength
SLEWFASTSlow [Fast]Slew rate control

Schmitt trigger

A Schmitt trigger converts a noisy analog signal into a clean digital output by introducing hysteresis. Hysteresis is when thresholds T for rising and falling transitions are distinct.

Transfer function of a Schmitt trigger
Transfer function of a Schmitt trigger

At an intermediate voltage, the output keeps its current state until a threshold is reached.

A Schmitt trigger enhances signal stability by filtering out small fluctuations. However, it draws current and introduces latency, which can cause timing violations at high frequencies.

If timing requirements are not met, Schmitt triggers on the QSPI data pins can be disabled by unsetting the SCHMITT bit in the QSPI_SDn registers.

NameValueDescription
SCHMITT[Disabled] EnabledEnable schmitt trigger

Benchmarking

We can benchmark XIP by measuring the latency of uncached reads using the SysTick timer 1.

volatile uint32_t* SYST_CSR = 0xe000e010;
volatile uint32_t* SYST_RVR = 0xe000e014;
volatile uint32_t* SYST_CVR = 0xe000e018;

const uint32_t CSR_CLKSOURCE = 1 << 2;
const uint32_t CSR_ENABLE = 1;

volatile uint32_t* XIP_NOCACHE_ADDR = 0x13000000;
const uint32_t RVR = 0xffffff;


*SYST_RVR = RVR;
*SYST_CSR |= CSR_ENABLE | CSR_CLKSOURCE;

uint32_t n_cycles;

#pragma GCC unroll 0
for (size_t i = 0; i != 2; ++i) {
  *SYST_CVR = 1;
  const uint32_t _ = *XIP_NOCACHE_ADDR;
  n_cycles = RVR - (*SYST_CVR);
}

We configure the SysTick timer to start at RVR and operate at the sys_clk frequency.

The first iteration loads instructions into the cache to remove fetch latency in the second. We also need to disable loop unroll, as it would prevent instructions from being cached.

Instruction#cycles
03h275
0BhN/A
3Bh123
6Bh107
BBh75
EBh51

We see that EBh read is about 5 times faster than 03h read with our configuration.

Note that we might expect different results for different clock sources. These results are for the default on-chip Ring Oscillator (ROSC), which runs at a nominal 6.5MHz.

Instead of conclusion

By examining the bootloader code in depth, I learned many exciting aspects of it: QSPI, linker scripts, and even some electrical details. It was also a great opportunity to use ARM assembly and memory-mapped I/Os in a practical way.

I'm a bit surprised by the benchmark results, which I cannot explain analytically. But maybe it is due to the clock source being quite unstable.

The next step is to use more components of the system without the SDK to continue learning about this architecture. It could also be very instructive to write a program to flash the firmware on the board using the SWD port, rather than relying on any tool.

Again, I'm a little annoyed that the clock source isn't properly configured, as it could be much faster and more precise. I think I'll look into it again soon.