QoS In Infiniband Via RDMA-CM
Note: this entire post is a lightly edited checkpoint of a conversation with some LLM. Part 1 here.
Introduction
RDMA-CM (RDMA Connection Manager) is the control-plane layer for RDMA. It exists to make connection setup look and feel like sockets: applications can bind to addresses, listen on ports, and accept or connect without worrying about raw InfiniBand identifiers. Underneath, RDMA-CM exchanges the connection metadata, programs the QPs, and hands back an endpoint ready for verbs. This is distinct from the verbs API, which only covers the data plane.
Background: QoS in InfiniBand
InfiniBand provides Quality of Service using Service Levels (SLs). An SL is a small integer carried in each packet’s header. The subnet manager maps SLs onto Virtual Lanes (VLs), which are independent link-level flows with separate buffering and flow control. By assigning flows to different SLs, the fabric can prioritize some traffic or keep classes of traffic from interfering with each other. SL is selected when a QP is created, and remains fixed for that QP’s traffic.
RDMA-CM Basics
RDMA-CM exposes a socket-like API:
-
The central handle is an
rdma_cm_id
, analogous to a socket fd. -
Each cm_id belongs to a port space, such as
RDMA_PS_TCP
(connection-oriented) orRDMA_PS_UDP
(datagrams). -
Applications use
rdma_bind_addr
,rdma_listen
,rdma_connect
, etc., in the same way they would with sockets.
What looks like an IP+port pair in RDMA-CM is just a wrapper. With IPoIB enabled, the “IP” is the interface address (e.g. 10.94.3.105
), which RDMA-CM resolves to a GID. Without IPoIB, you still can create connections, but the address is directly a GID (usually expressed in IPv6 link-local form). In both cases, the binding resolves to a specific HCA port and GID.
The handshake itself is a control-plane exchange: on InfiniBand it uses CM MADs over QP1, on RoCE/iWARP it uses UDP/TCP messages. This is how QP numbers, PSNs, MTU, and SL hints are exchanged.
RDMA-CM and Service IDs
The key to QoS integration is that RDMA-CM maps the familiar notion of port numbers into InfiniBand Service IDs. A Service ID is a 64‑bit identifier that the Subnet Manager understands. In RDMA_PS_TCP
, the port number you bind is encoded into the Service ID automatically. This provides a clean classification hook: flows created on different port ranges correspond to different Service IDs.
Subnet Manager QoS Policies & Setup (OpenSM)
OpenSM can enforce QoS by mapping Service IDs (derived from RDMA-CM port numbers) to Service Levels (SLs) and then to Virtual Lanes (VLs).
1) Enable QoS in OpenSM
Edit /etc/opensm/opensm.conf
(or the file you pass with -f
) and set:
qos TRUE
qos_max_vls 8 # adjust to your ASIC/link support (e.g., 4 or 8)
qos_policy_file /etc/opensm/qos-policy.conf
You may additionally pin SL→VL and VL arbitration globally here if you want static tables in the options file (advanced fabrics often keep these in the policy file instead).
Restart or start OpenSM with QoS active:
systemctl restart opensm # distro services
# or manually
opensm -Q -f /etc/opensm/opensm.conf
# (-Q/--qos is equivalent to qos TRUE; -Y sets an explicit policy file path)
2) Define QoS policy (port→ServiceID→SL)
Create /etc/opensm/qos-policy.conf
with policy rules. A pragmatic pattern is to carve the RDMA-CM port space into classes and assign SLs per Service ID range. Example with three classes:
# --- Classes ---------------------------------------------------
# Control plane / RPC (low bandwidth, low latency)
service_id 0x0000000000001000-0x0000000000001fff sl 4
# Bulk telemetry / background
service_id 0x0000000000002000-0x0000000000002fff sl 1
# Data path / latency-critical (highest priority)
service_id 0x0000000000003000-0x0000000000003fff sl 6
Notes:
-
RDMA-CM encodes
RDMA_PS_TCP
+ port into a 64-bit Service ID. Use contiguous Service ID ranges to represent your port ranges. Confirm the exact mapping in your stack by printing the CMservice_id
on connection or using SA queries (see verification below). -
If you use IPoIB, classification still hinges on Service ID, not L3 QoS.
3) Map SLs to VLs (isolation / priority)
Still in qos-policy.conf
, define SL→VL and VL arbitration so the fabric actually separates traffic:
# Map each SL to a VL. Keep hot classes on distinct VLs.
sl2vl 0 0
sl2vl 1 1
sl2vl 4 2
sl2vl 6 3
# Basic VL arbitration (weights). Keep highest-priority VLs with more credits.
vlarb_high 0:64 1:64 2:96 3:128
vlarb_low 0:32 1:32 2:48 3:64
Guidelines:
-
Keep the number of active VLs modest (4–8). Ensure all HCAs/links support the count you choose.
-
Use distinct VLs for classes that must not interfere. Coalesce low-priority classes on the same VL if you’re VL-limited.
5) Apply & persist
Ensure your distro loads the same opensm.conf
and qos-policy.conf
on boot. On systems with multiple HCAs/SM instances, anchor each OpenSM to a GUID/port and reuse the same policy file.
Verification / Troubleshooting
-
Check SM picked up QoS: review
opensm.log
for policy parsing, SL2VL, and VLArb tables. -
Resolve paths with SA:
-
Get a PathRecord with an explicit Service ID to see the returned SL (and MTU/rate). Using
saquery
(part of infiniband-diags):saquery --sgid-to-dgid <sgid>:<dgid> --service-id 0x0000000000003010 --pkey 0x7fff --sl
-
-
Confirm on endpoints: log the CM
service_id
and the final QP SL your app sees during connection events. -
Port/VL counters: use
perfquery
/ibqueryerrors
to confirm traffic hits the expected VLs; congestion on the wrong VL hints at SL2VL mismatch.
Putting It Together
Trace the path: RDMA-CM port
→ Service ID
→ Subnet Manager policy
→ SL
→ VL
The path from application to QoS is:
RDMA-CM port
→ Service ID
→ Subnet Manager policy
→ SL
→ VL
The application simply binds to a port; RDMA-CM turns that into a Service ID; the SM maps Service IDs to SLs; the fabric maps SLs to VLs. QoS enforcement is transparent to the app.
Practical Notes
-
With IPoIB: CM addresses look like normal IPs, but they resolve to GIDs under the hood.
-
Without IPoIB: you can still use RDMA-CM with GID-based addresses; port numbers still become Service IDs.
-
If no Service Record matches, the default SL (often 0) is used.
Conclusion
RDMA-CM is the bridge between socket-like connection setup and InfiniBand’s fabric-level QoS. It hides GIDs and QP details from the application, while exposing a port number abstraction that the Subnet Manager can map to Service Levels. This is what allows administrators to classify traffic by port ranges and enforce fabric QoS without requiring applications to manipulate QP attributes directly.