Tracking Down a Rogue EINVAL Error in FirmwarePermalink

Dion Dokter writing on the Tweede golf blog about what must have been an incredibly frustrating bug:

We had not gotten further in a long time. At this point, we had already spent around 20 days on this issue collectively and we had no clear direction to go in. Maybe we could take a more brute-force approach, but that was infeasible due to the modem jail.

Why was the RPC call made with length 0 and a null pointer?

We tried using watchpoints at the start of the adventure, but not all hardware supports it. So when we didn’t get it to work, we assumed this microcontroller also had no support for it. But simply updating the debugging stack to the newest versions did the trick. So if you’re ever debugging something and the experience is kind of bad, then make sure to download the newest versions! Even when that means going outside of your OS package manager.

Once Wouter got things going, the whole debugging experience became much nicer.

Things went fast from here. He was able to find where exactly the rpc length was written. He noticed that the length was lowered from the 44 input to the function to 0. Why?

It’s a long read but they did track down the source of the bug in the end.

Posts like this can be good to keep in the back of your mind whenever you’re tackling a gnarly bug. Perhaps there’s an insight here that would help shortcut the process if you ever find yourself in a similar position.

Branchless UTF-8 DecodingPermalink

Charles Eckman:

In a Recurse chat, Nathan Goldbaum asked:

I know how to decode UTF-8 using bitmath and some LUTs (see https://github.com/skeeto/branchless-utf8), but if I want to to go from a codepoint to UTF-8, is there a way to do it without branches?

To start with, is there a way to write this C function, which returns the number of bytes needed to store the UTF-8 bytes for the codepoint without any branches? Or would I need a huge look-up-table?

A neat problem to explore.

No if statements, loops, or other conditionals. So, branchless, right?

…well, no. If we peek at the (optimized) code in Compiler Explorer, we can see the x86_64 assembly has two different kinds of branches.

I tinkered with this initial implementation a little. If you compile it with target-cpu=x86-64-v3 (circa 2013 CPUs) the test and je from leading_zeros call are eliminated, leaving just the bounds check ja.

Using the Rust Type System to Prevent Bugs in the Fuchsia Network StackPermalink

Daroc Alden at LWN covering Joshua Liebow-Feeser at RustConf:

Netstack3 encompasses 63 crates and 60 developer-years of code. It contains more code than the top ten crates on crates.io combined. Over the last year, the code has finally become ready to deploy. But deploying it to production requires care — networking code can be hard to test, and the developers have to assume it has bugs. For the past eleven months, they have been running the new networking stack on 60 devices, full time. In that time, Liebow-Feeser said, most code would have been expected to show “mountains of bugs”. Netstack3 had only three; he attributed that low number to the team’s approach of encoding as many important invariants in the type system as possible.

A remarkable result. Great to see signs that Fuchsia is still alive too.

Debugging an arm64 Segfault in the PostgreSQL JIT CompilerPermalink

Anthonin Bonnefoy, Mitch Ward, and Gillian McGarvey on the Datadog blog:

Ultimately, by successfully isolating and debugging the crashes down to the assembly level, we were able to identify an issue in Postgres JIT compilation, and more specifically a bug in LLVM for Arm64 architectures. This post describes our investigation into the root cause of these crashes, which took us on a deep dive into JIT compilation and led us to an upstream fix resolution.

A crashing database must be a pretty stressful scenario when working at the sort of scale Datadog is operating at. Great detective work getting to the bottom of the issue. Also interesting to learn that they’re running the db on ARM servers.

With JIT disabled, the query ran without triggering a segfault. As an immediate mitigation, we disabled JIT for the entire cluster, which stopped the crashes completely and without any noticeable impact on query latencies. It was time to relax and grab a cup of coffee.

If turning the JIT off had no noticeable impact on the query latencies you have to wonder if it’s worth the trouble of using it?

NixOS Paper CutsPermalink

Jono Finger:

This is my perspective on using Nix (the OS, the package manager, and the language) as a main driver for the past 2 years. I have gone to conferences, engaged the community, donated, submitted bug reports, converted my home servers, and probably spent hundreds of hours in Nix configs. I consider myself well versed, but certainly no expert.

TLDR: In its current state (2025), I don’t generally recommend desktop use of Nix(OS), even for seasoned Linux users.

That’s a pretty strong summary, but I’m not sure I’d take this post as general advice. It’s detailed, and documents Jono’s experience, but with any niche system like NixOS you’re going to run into paper cuts. The specific paper cuts will vary by person based on what they do with their computer, as will their threshold for tolerating them.

To be clear, I love Nix and have learned a lot from it. I am not giving up on it, but its time for me to take a break and scale back my all-in attitude.

In this case it seems Jono reached their paper cut threshold, which is totally reasonable. Some people will push through because they want the benefits despite the friction, others will drop off earlier. If you’re thinking of trying out NixOS I think this post is worth a read, but I wouldn’t let you stop you from trying it out.