QoS in Infiniband | Ankush Jain

These are some sloppy notes on my attempts to figure out how to configure QoS in infiniband-based stacks. They are in rough shape, my conclusion so far seems to be that libfabric hard sets a service level, but doesn’t allow for it to be configurable. I don’t know if it’s possible to call ibv_modify_qp or something to do so after verbs has initialized.

Edit: post now ends on an optimistic note.

Basic IB Concepts

Simple Case: One Subnet

Subnet Manager (SM): manages the subnet (thx). Like DHCP + ARP + more things.
lid: like IP address, assigned by subnet manager, you can ping it, send to it, whatever.

$ sudo ibstat
CA 'qib0'
		...
        Port 1:
                State: Active
                Physical state: LinkUp
                Base lid: 475
                SM lid: 10
				...

SM is lid=10, and this node has a HCA (NIC) with lid=0.

Complex Case: Multiple Subnets

Here routing comes into the picture. Won’t go into details, but simple/sloppy version:

guid: u64: like MAC address, unique per device.
gid: u128: globally unique ID, sm_id: u64 + guid: u64

Switches maintain routes indexed by sm_id, hosts send packets to gids, switches look up the relevant destination and forward. Not relevant to us going forward.

Miscellaneous

Service Level (SL): a 4-bit priority carried by a packet
Virtual Lane (VL): hardware-level egress lanes over a link, these actually provide QoS Isolation
SL2VL: a table mapping SLs to VLs. SLs confer QoS properties only if they are mapped to separate VLs, and as per specific configuration in VLArb (the VL arbitration tables)
Subnet Management Packet (SMP): used by SM to configure and query fabric components (ports and switches)
Management Datagram (MAD): management message format, UMAD is the linux interface to send MAD packets
Subnet Administrator (SA): a management service that answers queries (part of the Subnet Manager?)
OpenSM: an open-source software-based subnet manager (like dnsmasq for DHCP). Fabrics may have proprietary ones.
VL15: a dedicated VL for SMP traffic (QP0: a special queue pair dedicated for it?)

How Many VLs Do I Have?

$ smpquery portinfo <lid> <portnum>
ibwarn: [1006476] mad_rpc_open_port: can't open UMAD port ((null):0)
smpquery: iberror: failed: Failed to open '(null)' port '0'

$ sudo smpquery portinfo <lid> <portnum>
CapMask:.........................0x7610868
                                IsTrapSupported
                                IsAutomaticMigrationSupported
                                IsSLMappingSupported
                                IsSystemImageGUIDsupported
                                IsCommunicatonManagementSupported
                                IsDRNoticeSupported
                                IsCapabilityMaskNoticeSupported
                                IsLinkRoundTripLatencySupported
                                IsClientRegistrationSupported
                                IsOtherLocalChangesNoticeSupported
VLCap:...........................VL0-1
VLHighLimit:.....................0
VLArbHighCap:....................16
VLArbLowCap:.....................16
VLStallCount:....................0
OperVLs:.........................VL0

Takeaways:

SLs/VLs are supported (CapMask has IsSLMappingSupported).
VLCap suggests that two virtual lanes are possible (VL0 and VL1)
OperVLs suggests that only one VL is configured, operational, active, VL1 is not configured.

What Is The Configured QoS State?

$ sudo smpquery sl2vl 475 1
# SL2VL table: Lid 475
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|

$ sudo smpquery vlarb 475 1
# VLArbitration tables: Lid 475 port 1 LowCap 16 HighCap 16
# Low priority VL Arbitration Table:
VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL    : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

Takeaways:

sl2vl says that we have 16 SLs, but all are mapped to VL0, so no QoS.
vlarb is all VL0. There are two levels of priority classes. First level is vlarb_high and vlarb_low. Second level is intra-vlarb, between (sl, weight) entries. The relative weight of vlarb-high/low is in VLHighLimit.

Configuring QoS in OpenSM

You edit /etc/opensm/opensm.conf with something like this:

# enable QoS
qos TRUE

# all prefixes are qos_
# specific sub-prefixes: qos_ca_, qos_rtr_, qos_sw0_, qos_swe_
# (for specific config for CAs, routers, switch port 0's, and switches)

qos_max_vls 2
# send this many from high-priority first (255 ~ infinite priority)
qos_high_limit 255
# intra-high priority
qos_vlarb_high 1:192, 2:128, 3:64
qos_vlarb_low 0:64
# SLs [0, 4) are VL0, [4, 8) are VL1
qos_sl2vl 0,0,0,0,1,1,1,1

The policy for relative weights between the high-priority table and the low-priority table is not clear to me. That is, what exactly does qos_vlarb_high do.

Note: https://www.mail-archive.com/[email protected]/msg04092.html is a good description of the qos_high_limit. It seems to me that the high class preempts the low class up to the high limit (preempts not mid-packet, but otherwise).

There’s also ULP-based QoS I think (ULP: Upper Layer Protocol). Can maybe target service IDs for MPI, Lustre etc.

Service IDs

Lustre etc. can register a Service Record with SA, containing (Service ID, server LID, service name)
RDMA-CM (Connection Manager) is involved somehow
RDMA-CM divides the SID space into port spaces. RDMA_PS_TCP is a 16-bit namespace.

So SID can be RDMA_PS_TCP << 16 | service_port, or RDMA_PS_IB << 16 | qpn.

$ sudo saquery -S
<nothing>
$ sudo saquery
<bunch of node records>
# On Lustre root
$ cat /sys/module/ko2iblnd/parameters/service
987

So it seems that Lustre’s SID is RDMA_PS_TCP << 16 | 987

$ rdma resource show cm_id | grep LISTEN
link qib0/1 cm-idn 0 state LISTEN ps TCP comm [ko2iblnd] src-addr 10.94.xxx.yyy:987 dst-addr 0.0.0.0:0
# P=987, 0x106 is the `RDMA_PS_TCP`
$ P=987; printf '0x%016x\n' $(( (0x106<<16) | P ))
0x00000000010603db # Service ID for Lustre

This also shows 987 for Lustre. This maps Lustre to SL1.

# /var/cache/opensm/qos-policy.conf

qos-ulps
  default : 0
  any, service-id 0x00000000010603DB : 1
end-qos-ulps

QoS in PSM

Use env variable IPATH_SL.

if (!psmi_getenv("IPATH_SL", "IB outging ServiceLevel number (default 0)",
		PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
		(union psmi_envvar_val) PSMI_SL_DEFAULT,
		&env_sl)) {
opts.outsl = env_sl.e_long;
}

// This seems to be older code for PSM v1.07 to 1.10. Head is 1.16
// No need to set VL in head, will use 
#if (PSM_VERNO >= 0x0107) && (PSM_VERNO <= 0x010a)
{
  union psmi_envvar_val env_vl;
  if (!psmi_getenv("IPATH_VL", "IB outging VirtualLane (default 0)",
		   PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
		   (union psmi_envvar_val)0,
		   &env_vl)) {
opts.outvl = env_vl.e_long;
  }
}
#endif

QoS in libfabric/Mercury

Traffic classes are in fi_tx_attr->tclass (u32): https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

Mercury (since v2.4):

hg_opts.na_init_info.traffic_class = (uint8_t)(desired_sl << 5); // SL -> tclass

Maybe something here: https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html

Things I’ve established so far:

I’m interested in verbs; ofi_rxm
ofi_rxm just passes tclass to the core provider (which is verbs)
I can’t find out what verbs does with it. It seems to use both ibv APIs and rdma-cm APIs. Looks like I may be looking for rdma_set_option(RDMA_OPTION_ID_TOS). But nothing calls that.

Update: 20250923

QoS in libfabric verbs provider: The verbs provider implements QoS through InfiniBand’s native Service Level (SL) mechanism. However, unlike other libfabric providers (e.g., CXI), verbs has no traffic class (tclass) support. The SL is essentially fixed at endpoint creation time and cannot be modified at runtime without recreating the address handle, which would require modifying the libfabric source code.

SL set in AH: prov/verbs/src/verbs_dgram_av.c:65 - where ah_attr.sl gets the SL value during address handle creation
SL source: prov/verbs/src/verbs_ep.c:840 - where endpoint gets SL from subnet manager default port_attr.sm_sl
No tclass support: prov/verbs/src/verbs_ep.c:358 - vrb_ep_setopt() function only handles CUDA options, no tclass
Compare CXI: prov/cxi/src/cxip_ep.c:1148 - shows how other providers handle FI_OPT_CXI_SET_TCLASS
Traffic class constants: include/rdma/fabric.h:357 - the FI_TC_* enums that verbs doesn’t use

Update: 20250923 +10 mins

So libfabric has betrayed us, but can we assign the flow a service level anyway, using other classification mechanisms, like that ULP stuff?

Our flow is that we use Mercury to bind to a verbs; ofi+rxm endpoint, and exchange those addresses with friends. What does that address look like?

Our mercury address: ofi+verbs;ofi_rxm://10.94.3.29:56504

Spicy.

$ rdma resource show cm_id | grep LISTEN
...
link qib0/1 cm-idn 17325 state LISTEN ps TCP pid 1344261 comm controller_main src-addr 10.94.3.29:56504 dst-addr 0.0.0.0:0

So it seems that this does go through RDMA-CM. So if we ask Mercury to bind on specific port ranges, I think we can still assign them a QoS using a port range in the ULP thing.