$ whoami

Levi Puckett█

firmware engineer

Most of what I've built is covered by NDAs. I can't show you the code, the products, or the names of the clients — and screenshots of firmware aren't exactly riveting anyway.

What I can do is walk you through a couple of the interesting problems I've faced, and how I solved them.

──

edge inference

A disposable device — used once, then discarded. This constrains the design: the smaller the MCU, the lower the bill-of-materials cost, so the memory budget was as tight as it could be while still running the job. We had 128kB of RAM and five sensor channels sampling at 250Hz, which we needed to process in near-real-time. The inference pipeline included a 10-second-wide Fourier Transform and a neural network to output a confidence score.

sensor

5ch @ 250Hz

──▶

DMA

ring buffer

──▶

Fourier

1s window

10s window

──▶

Neural Net

feed-forward

──▶inference

▸

reference → implementation

The algorithm existed first as a Python reference model. The task was to produce a C implementation that ran on the device with exactly the same outputs.

I used a unit test framework to validate the C implementation incrementally — each stage of the pipeline tested in isolation against the Python reference. Once the tests passed, I built a hardware-in-the-loop rig: pipe canned sensor data into the physical device over serial, collect the output, and assert it matches what the Python model produces for the same input.

┌──────────────┐
│  /\    /\    │
│ /  \__/  \_  │
│              │
└──────┬───────┘
       │
     ──┴──

── canned data ──▶

◀─── result ────────


  ┌─────────┐
──┤         ├──
──┤   MCU   ├──
──┤         ├──
  └─────────┘

▸

fitting in memory

The first major hurdle was that the large FFT window didn't fit in RAM. The solution was to split it into several smaller FFTs and average the results — a reasonable approximation which reduced memory usage significantly.

We were still running into space and time challenges, so I went further and replaced the FFT with the Goertzel algorithm — a specialized DFT that computes only the specific bins you care about. The unit test setup showed that it was equivalent to a full DFT implementation, but it reduced RAM usage and CPU cycles for the long DFT.

▸

reclaiming cpu

With the algorithm fitting in memory, the remaining target was CPU headroom. I migrated all the peripherals to use DMA to reduce software polling loops, and squeezed other optimizations out of the application via the compiler options and reducing idle processes. This freed roughly 20% of CPU cycles which could now be used to run the inference pipeline closer to real-time.

FFT · DMA · unit testing · hardware-in-the-loop

──

industry secrets

A battery-powered peripheral meant to be taken home and used for several weeks at a time. The device ran a proprietary algorithm — one the client considered too sensitive to share with anyone, including us. We received a compiled binary blob and nothing else.

The challenge: integrate this algorithm into firmware we'd own and maintain, without any visibility into what it was doing.

▸

co-located binaries

After considering several alternatives (the MPU, or a second MCU), the solution we landed on was co-located binaries — two independent processes sharing a single MCU. A “frontend” (our code) and a “backend” (the blob), each given its own memory regions, peripherals, and interrupt vectors, with a strictly defined interface between them.

MCU

frontend

our code

▸ ADC/timer IRQs

▸ UI peripherals

▸ comms stack

─── call ───▶

API boundary

◀─── result ──

⚠ watchdog

backend

secret blob

▸ timer IRQs

▸ algo processing

▸

the interface contract

We defined a limited set of API calls the frontend could make into the backend, each with an explicit timing allowance — the frontend could not block indefinitely on a response from an opaque binary.

Watchdog timers enforced those contracts at runtime: if the CPU spent too long inside the backend blob, the watchdog would fire. Peripherals and interrupt vectors were partitioned between the two sides so neither could interfere with the other's hardware.

Making all of this work meant getting very comfortable with the linking process. I wrote a custom linker script to lay out both binaries in memory — assigning address ranges, aligning sections, and keeping the frontend and backend regions cleanly separated.

   flash                  sram
┌───────────┐         ┌───────────┐
│           │         │           │
│ frontend  │         │ frontend  │
│           │         │           │
├───────────┤         ├───────────┤
│           │         │           │
│  backend  │         │  backend  │
│   blob    │         │           │
│           │         │           │
└───────────┘         └───────────┘

▸

dev infrastructure

Two independent binaries meant the tooling had to keep up. I built automation to independently flash either side to allow for quickly iterating when only the frontend was changing.

Before the real backend arrived, I built test stubs: stub implementations of the backend interface with known, predictable behaviour. These let us develop and validate the frontend in isolation, and gave us confidence in the API contract itself before the opaque binary was ever in the picture.

custom linker scripts · watchdog timers · peripheral partitioning · test stubs

──

always learning

My hobby work is available on GitHub. When I spend time building something, it's usually to get to know a technology better.

▸wyagreimplemented Git in Python to understand how it works under the hood

▸feedmebuilt and trained neural networks from scratch, then extended into an evolution simulator

▸cryptographyexplored a cryptanalysis attack on a toy cipher

▸sorting visualizerC++ sorting algorithms with a visualization component

── Levi Puckett ──github.com/levijpuckett