Debugging an arm64 Segfault in the PostgreSQL JIT CompilerPermalink

Anthonin Bonnefoy, Mitch Ward, and Gillian McGarvey on the Datadog blog:

Ultimately, by successfully isolating and debugging the crashes down to the assembly level, we were able to identify an issue in Postgres JIT compilation, and more specifically a bug in LLVM for Arm64 architectures. This post describes our investigation into the root cause of these crashes, which took us on a deep dive into JIT compilation and led us to an upstream fix resolution.

A crashing database must be a pretty stressful scenario when working at the sort of scale Datadog is operating at. Great detective work getting to the bottom of the issue. Also interesting to learn that they’re running the db on ARM servers.

With JIT disabled, the query ran without triggering a segfault. As an immediate mitigation, we disabled JIT for the entire cluster, which stopped the crashes completely and without any noticeable impact on query latencies. It was time to relax and grab a cup of coffee.

If turning the JIT off had no noticeable impact on the query latencies you have to wonder if it’s worth the trouble of using it?