We Had to Build It Ourselves

Disaggregated inference is a great idea. Separate the routing from the compute. Let the scheduler see the full cluster. Move KV context instead of recomputing it. The architecture is sound.

We thought it would be even greater if it was built simply.

NVIDIA’s Dynamo is the reference implementation — six services, a ZMQ event plane, Kubernetes for orchestration, and Python-based inference backends in the hot path. It wraps vLLM, SGLang, or TRT-LLM and coordinates them. It’s well-engineered, well-documented, and for AI factories with dedicated platform teams, it works.

But every time we looked at it, something felt off.

Not wrong — heavy. Too many services between a request arriving and a GPU doing useful work. Too many processes to deploy, monitor, restart. Too many places where latency hides. We’ve spent two decades operating production infrastructure, compiling custom Linux kernels, building network security, chasing down failures in systems that looked fine on paper. We know what operational weight feels like. And this carried more of it than we were comfortable with.

We wanted the architecture without the weight. Two binaries and nothing in between.

So we started writing

We opened an empty Rust workspace.

The forward pass came first. Then attention. Then the KV cache. Then continuous batching. Then the token sampler. Then the JIT kernel compiler — NVRTC for NVIDIA, hipRTC for AMD. Every piece written from scratch in Rust, calling directly into CUDA and HIP. No Python runtime sitting between us and the hardware. No borrowed engine underneath.

It’s a strange feeling, writing your own attention kernels when perfectly good ones exist in open source. There were mornings where it felt like stubbornness. But there’s a clarity that comes from owning the full path — from knowing exactly what happens between the moment a request arrives and the moment a token leaves the GPU. That clarity compounds. Every optimization, every debug session, every production incident — you’re never reading someone else’s abstractions. You’re reading your own code.

The routing layer grew out of necessity. Requests need to land on the GPU that already holds the right KV context. That means radix-tree lookup across the cluster, cost-function dispatch, session affinity, backpressure. We put all of it in one binary. All state in-memory. No etcd. No NATS. No Kafka. The routing binary is the control plane. Nothing else needs to coordinate.

For a single GPU, the compute binary serves HTTP on its own. One process, one card. For clusters, the two binaries talk over ORPC — our binary RPC protocol over QUIC, designed for exactly this workload. We’ll cover ORPC, our QUIC stack, XDP, and the HTTP framework in a dedicated post.

The full decode pass captures into a CUDA graph at startup. At runtime, it replays. The CPU steps aside entirely.

The parts nobody talks about

Every inference engine announcement focuses on throughput and model support. Nobody talks about what holds it all together.

We built an HTTP framework. HTTP/1.1, HTTP/2, HTTP/3 — our own transport, our own TLS termination, our own io_uring async runtime. No tokio. Both binaries share the same serving crate. Identical behavior whether you’re running standalone or through the routing layer.

We built a hardware abstraction layer. The same model code, the same scheduler, the same KV cache runs on NVIDIA and AMD. One feature flag at compile time selects the backend. Not a fork. Not a separate codebase. We decided this on day one. We’ve never had a reason to revisit it.

None of these show up in a feature comparison. All of them are why the system works the way it does.

What it costs

We own everything. That means we implement everything.

When a new model architecture lands, we don’t pip install. We write the attention variant, the positional encoding, the sampling strategy, the memory layout. A wrapped engine gets broad model support the day a paper drops. We earn it over weeks.

That’s real. We accepted it because the alternative — inheriting complexity we can’t see through, can’t debug, can’t change — felt worse. Fewer processes. Fewer failure modes. A smaller surface to operate and secure. And when something breaks at any layer, we can read every line between the request and the kernel.

What comes next

The inference engine is one of three systems we built from scratch. The other two — a unified AI data layer and a speech runtime — will get their own posts.

Benchmarks are coming. Measured on our production hardware, under real multi-turn agentic workloads. Not synthetic numbers on machines nobody deploys.

Architecture deep dives on the KV cache hierarchy, the transport layer, and ORPC will follow. This was the introduction. The details come next.