Back to AI Research

AI Research

OpenURMA: A Clean-Room Open Implementation of the U... | AI Research

Key Takeaways

  • OpenURMA is a research project that provides the first open-source, clean-room implementation of the Unified Bus (UB) protocol, a 2025 specification designed...
  • Modern datacenter RDMA is bottlenecked at the network interface, not the wire.
  • Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand.
  • UB ships in Huawei's closed Ascend 950 silicon.
  • The contribution is the implementation, harness, and controlled comparison closed silicon does not admit.
Paper AbstractExpand

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

OpenURMA is a research project that provides the first open-source, clean-room implementation of the Unified Bus (UB) protocol, a 2025 specification designed to overcome the performance bottlenecks of modern datacenter networking. While traditional RDMA (Remote Direct Memory Access) technologies like RoCE and InfiniBand treat network interface cards (NICs) as peripheral devices connected via PCIe, this approach creates significant latency and memory overhead. OpenURMA demonstrates that by moving the NIC onto the CPU’s on-chip bus and changing how connection state is managed, it is possible to achieve significantly faster data transfers for AI-scale workloads.

The Problem with Current RDMA

Modern datacenters rely on a "Queue Pair" abstraction that forces NICs to act as PCIe peripherals. This design creates two major issues. First, the NIC must store connection information for every possible pair of applications and remote endpoints, which consumes hundreds of megabytes of memory at scale. When this data exceeds the NIC's on-chip memory, the system must constantly fetch it from host memory, causing delays. Second, every network operation must cross the PCIe boundary four times, which consumes roughly 1.5 microseconds of a 2-microsecond round trip. These structural costs mean that even as network wires get faster, the interface remains a bottleneck.

How Unified Bus Changes the Architecture

Unified Bus (UB) replaces the peripheral-based model with a more efficient design centered on three key architectural shifts:

  • Decoupled State Management: Instead of binding applications to specific remote endpoints, UB separates application identity (the "Jetty") from transport reliability (the "TP Channel"). This allows connection state to grow additively rather than multiplicatively, reducing the memory footprint from hundreds of megabytes to roughly 110 kilobytes for large-scale deployments.

  • On-Chip Bus Integration: Because the connection state is now small enough to fit in on-chip memory, the NIC controller can move from behind the PCIe bus directly onto the CPU’s on-chip bus.

  • Load/Store Data Path: With the controller on the on-chip bus, the CPU can access remote memory using standard load and store instructions. This eliminates the need for complex work-queue entries and PCIe-based signaling, collapsing the four-traversal round trip into a single on-chip bus crossing.

Performance Results

OpenURMA provides a multi-tier evaluation infrastructure, including synthesisable RTL for the Alveo U50 FPGA, a cycle-level SystemC simulator, and a gem5 full-system scaffold. When comparing the performance of a 64-byte remote fetch, the UB load/store path achieved an end-to-end latency of approximately 500 nanoseconds. This is 4.37 times faster than the matched RoCEv2 baseline (2186 nanoseconds). Additionally, the implementation is highly efficient, utilizing only about 14% of the Alveo U50’s available logic resources (LUTs) while sustaining 2.80 times higher throughput than the baseline.

Opt-in Ordering

Beyond latency and throughput, UB introduces "opt-in ordering." Traditional RDMA enforces strict, total ordering on all operations, which can cause performance issues if a single slow path blocks others. UB allows applications to choose their ordering requirements across four axes. Because this gating logic reuses the counters already provisioned by the Jetty/TP Channel split, it adds zero pipeline cycles to operations that do not require strict ordering, allowing workloads to pay only for the reliability features they actually need.

Comments (0)

No comments yet

Be the first to share your thoughts!