OpenURMA is a research project that provides the first open-source, clean-room implementation of the Unified Bus (UB) protocol, a 2025 specification designed to overcome the performance bottlenecks of modern datacenter networking. While traditional RDMA (Remote Direct Memory Access) technologies like RoCE and InfiniBand treat network interface cards (NICs) as peripheral devices connected via PCIe, this approach creates significant latency and memory overhead. OpenURMA demonstrates that by moving the NIC onto the CPU’s on-chip bus and changing how connection state is managed, it is possible to achieve significantly faster data transfers for AI-scale workloads.
The Problem with Current RDMA
Modern datacenters rely on a "Queue Pair" abstraction that forces NICs to act as PCIe peripherals. This design creates two major issues. First, the NIC must store connection information for every possible pair of applications and remote endpoints, which consumes hundreds of megabytes of memory at scale. When this data exceeds the NIC's on-chip memory, the system must constantly fetch it from host memory, causing delays. Second, every network operation must cross the PCIe boundary four times, which consumes roughly 1.5 microseconds of a 2-microsecond round trip. These structural costs mean that even as network wires get faster, the interface remains a bottleneck.
How Unified Bus Changes the Architecture
Unified Bus (UB) replaces the peripheral-based model with a more efficient design centered on three key architectural shifts:
Decoupled State Management: Instead of binding applications to specific remote endpoints, UB separates application identity (the "Jetty") from transport reliability (the "TP Channel"). This allows connection state to grow additively rather than multiplicatively, reducing the memory footprint from hundreds of megabytes to roughly 110 kilobytes for large-scale deployments.
On-Chip Bus Integration: Because the connection state is now small enough to fit in on-chip memory, the NIC controller can move from behind the PCIe bus directly onto the CPU’s on-chip bus.
Load/Store Data Path: With the controller on the on-chip bus, the CPU can access remote memory using standard load and store instructions. This eliminates the need for complex work-queue entries and PCIe-based signaling, collapsing the four-traversal round trip into a single on-chip bus crossing.
Performance Results
OpenURMA provides a multi-tier evaluation infrastructure, including synthesisable RTL for the Alveo U50 FPGA, a cycle-level SystemC simulator, and a gem5 full-system scaffold. When comparing the performance of a 64-byte remote fetch, the UB load/store path achieved an end-to-end latency of approximately 500 nanoseconds. This is 4.37 times faster than the matched RoCEv2 baseline (2186 nanoseconds). Additionally, the implementation is highly efficient, utilizing only about 14% of the Alveo U50’s available logic resources (LUTs) while sustaining 2.80 times higher throughput than the baseline.
Opt-in Ordering
Beyond latency and throughput, UB introduces "opt-in ordering." Traditional RDMA enforces strict, total ordering on all operations, which can cause performance issues if a single slow path blocks others. UB allows applications to choose their ordering requirements across four axes. Because this gating logic reuses the counters already provisioned by the Jetty/TP Channel split, it adds zero pipeline cycles to operations that do not require strict ordering, allowing workloads to pay only for the reliability features they actually need.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!