QoS in Infiniband
These are some sloppy notes on my attempts to figure out how to configure QoS in infiniband-based stacks. They are in rough shape, my conclusion so far seems to be that libfabric hard sets a service level, but doesn’t allow for it to be configurable. I don’t know if it’s possible to call ibv_modify_qp or something to do so after verbs has initialized.
Edit: post now ends on an optimistic note.
Basic IB Concepts
Simple Case: One Subnet
- Subnet Manager (
SM): manages the subnet (thx). Like DHCP + ARP + more things. -
lid: like IP address, assigned by subnet manager, you can ping it, send to it, whatever.
$ sudo ibstat
CA 'qib0'
...
Port 1:
State: Active
Physical state: LinkUp
Base lid: 475
SM lid: 10
...
SM is lid=10, and this node has a HCA (NIC) with lid=0.
Complex Case: Multiple Subnets
Here routing comes into the picture. Won’t go into details, but simple/sloppy version:
-
guid: u64: like MAC address, unique per device. -
gid: u128: globally unique ID,sm_id: u64+guid: u64
Switches maintain routes indexed by sm_id, hosts send packets to gids, switches look up the relevant destination and forward. Not relevant to us going forward.
Miscellaneous
- Service Level (
SL): a 4-bit priority carried by a packet - Virtual Lane (
VL): hardware-level egress lanes over a link, these actually provide QoS Isolation -
SL2VL: a table mapping SLs to VLs. SLs confer QoS properties only if they are mapped to separate VLs, and as per specific configuration inVLArb(the VL arbitration tables) - Subnet Management Packet (
SMP): used by SM to configure and query fabric components (ports and switches) - Management Datagram (
MAD): management message format, UMAD is the linux interface to send MAD packets - Subnet Administrator (
SA): a management service that answers queries (part of the Subnet Manager?) - OpenSM: an open-source software-based subnet manager (like
dnsmasqfor DHCP). Fabrics may have proprietary ones. -
VL15: a dedicated VL for SMP traffic (QP0: a special queue pair dedicated for it?)
How Many VLs Do I Have?
$ smpquery portinfo <lid> <portnum>
ibwarn: [1006476] mad_rpc_open_port: can't open UMAD port ((null):0)
smpquery: iberror: failed: Failed to open '(null)' port '0'
$ sudo smpquery portinfo <lid> <portnum>
CapMask:.........................0x7610868
IsTrapSupported
IsAutomaticMigrationSupported
IsSLMappingSupported
IsSystemImageGUIDsupported
IsCommunicatonManagementSupported
IsDRNoticeSupported
IsCapabilityMaskNoticeSupported
IsLinkRoundTripLatencySupported
IsClientRegistrationSupported
IsOtherLocalChangesNoticeSupported
VLCap:...........................VL0-1
VLHighLimit:.....................0
VLArbHighCap:....................16
VLArbLowCap:.....................16
VLStallCount:....................0
OperVLs:.........................VL0
Takeaways:
- SLs/VLs are supported (
CapMaskhasIsSLMappingSupported). -
VLCapsuggests that two virtual lanes are possible (VL0andVL1) -
OperVLssuggests that only one VL is configured, operational, active,VL1is not configured.
What Is The Configured QoS State?
$ sudo smpquery sl2vl 475 1
# SL2VL table: Lid 475
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
$ sudo smpquery vlarb 475 1
# VLArbitration tables: Lid 475 port 1 LowCap 16 HighCap 16
# Low priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
Takeaways:
-
sl2vlsays that we have 16 SLs, but all are mapped to VL0, so no QoS. -
vlarbis allVL0. There are two levels of priority classes. First level isvlarb_highandvlarb_low. Second level is intra-vlarb, between(sl, weight)entries. The relative weight of vlarb-high/low is inVLHighLimit.
Configuring QoS in OpenSM
You edit /etc/opensm/opensm.conf with something like this:
# enable QoS
qos TRUE
# all prefixes are qos_
# specific sub-prefixes: qos_ca_, qos_rtr_, qos_sw0_, qos_swe_
# (for specific config for CAs, routers, switch port 0's, and switches)
qos_max_vls 2
# send this many from high-priority first (255 ~ infinite priority)
qos_high_limit 255
# intra-high priority
qos_vlarb_high 1:192, 2:128, 3:64
qos_vlarb_low 0:64
# SLs [0, 4) are VL0, [4, 8) are VL1
qos_sl2vl 0,0,0,0,1,1,1,1
The policy for relative weights between the high-priority table and the low-priority table is not clear to me. That is, what exactly does qos_vlarb_high do.
Note: https://www.mail-archive.com/[email protected]/msg04092.html is a good description of the qos_high_limit. It seems to me that the high class preempts the low class up to the high limit (preempts not mid-packet, but otherwise).
There’s also ULP-based QoS I think (ULP: Upper Layer Protocol). Can maybe target service IDs for MPI, Lustre etc.
Service IDs
- Lustre etc. can register a Service Record with SA, containing (Service ID, server LID, service name)
- RDMA-CM (Connection Manager) is involved somehow
- RDMA-CM divides the SID space into port spaces.
RDMA_PS_TCPis a 16-bit namespace.
So SID can be RDMA_PS_TCP << 16 | service_port, or RDMA_PS_IB << 16 | qpn.
$ sudo saquery -S
<nothing>
$ sudo saquery
<bunch of node records>
# On Lustre root
$ cat /sys/module/ko2iblnd/parameters/service
987
So it seems that Lustre’s SID is RDMA_PS_TCP << 16 | 987
$ rdma resource show cm_id | grep LISTEN
link qib0/1 cm-idn 0 state LISTEN ps TCP comm [ko2iblnd] src-addr 10.94.xxx.yyy:987 dst-addr 0.0.0.0:0
# P=987, 0x106 is the `RDMA_PS_TCP`
$ P=987; printf '0x%016x\n' $(( (0x106<<16) | P ))
0x00000000010603db # Service ID for Lustre
This also shows 987 for Lustre. This maps Lustre to SL1.
# /var/cache/opensm/qos-policy.conf
qos-ulps
default : 0
any, service-id 0x00000000010603DB : 1
end-qos-ulps
QoS in PSM
- Use env variable
IPATH_SL.
if (!psmi_getenv("IPATH_SL", "IB outging ServiceLevel number (default 0)",
PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
(union psmi_envvar_val) PSMI_SL_DEFAULT,
&env_sl)) {
opts.outsl = env_sl.e_long;
}
// This seems to be older code for PSM v1.07 to 1.10. Head is 1.16
// No need to set VL in head, will use
#if (PSM_VERNO >= 0x0107) && (PSM_VERNO <= 0x010a)
{
union psmi_envvar_val env_vl;
if (!psmi_getenv("IPATH_VL", "IB outging VirtualLane (default 0)",
PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
(union psmi_envvar_val)0,
&env_vl)) {
opts.outvl = env_vl.e_long;
}
}
#endif
QoS in libfabric/Mercury
Traffic classes are in fi_tx_attr->tclass (u32): https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html
Mercury (since v2.4):
hg_opts.na_init_info.traffic_class = (uint8_t)(desired_sl << 5); // SL -> tclass
Maybe something here: https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html
Things I’ve established so far:
- I’m interested in
verbs; ofi_rxm -
ofi_rxmjust passestclassto the core provider (which is verbs) - I can’t find out what verbs does with it. It seems to use both
ibvAPIs andrdma-cmAPIs. Looks like I may be looking forrdma_set_option(RDMA_OPTION_ID_TOS). But nothing calls that.
Update: 20250923
QoS in libfabric verbs provider: The verbs provider implements QoS through InfiniBand’s native Service Level (SL) mechanism. However, unlike other libfabric providers (e.g., CXI), verbs has no traffic class (tclass) support. The SL is essentially fixed at endpoint creation time and cannot be modified at runtime without recreating the address handle, which would require modifying the libfabric source code.
- SL set in AH:
prov/verbs/src/verbs_dgram_av.c:65- whereah_attr.slgets the SL value during address handle creation - SL source:
prov/verbs/src/verbs_ep.c:840- where endpoint gets SL from subnet manager defaultport_attr.sm_sl - No tclass support:
prov/verbs/src/verbs_ep.c:358-vrb_ep_setopt()function only handles CUDA options, no tclass - Compare CXI:
prov/cxi/src/cxip_ep.c:1148- shows how other providers handleFI_OPT_CXI_SET_TCLASS - Traffic class constants:
include/rdma/fabric.h:357- theFI_TC_*enums that verbs doesn’t use
Update: 20250923 +10 mins
So libfabric has betrayed us, but can we assign the flow a service level anyway, using other classification mechanisms, like that ULP stuff?
Our flow is that we use Mercury to bind to a verbs; ofi+rxm endpoint, and exchange those addresses with friends. What does that address look like?
Our mercury address: ofi+verbs;ofi_rxm://10.94.3.29:56504
Spicy.
$ rdma resource show cm_id | grep LISTEN
...
link qib0/1 cm-idn 17325 state LISTEN ps TCP pid 1344261 comm controller_main src-addr 10.94.3.29:56504 dst-addr 0.0.0.0:0
So it seems that this does go through RDMA-CM. So if we ask Mercury to bind on specific port ranges, I think we can still assign them a QoS using a port range in the ULP thing.
Also Read
- https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1178075141/Understanding+Basic+InfiniBand+QoS
- https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1177878529/Getting+Started+with+InfiniBand+QoS
- https://web-docs.gsi.de/~vpenso/notes/posts/hpc/network/infiniband/subnet-manager.html
- https://docs.nvidia.com/networking/display/mlnxofedv51258060/opensm
- https://docs.nvidia.com/networking/display/mlnxofedv461000/qos+-+quality+of+service
Lustre discussions:
- https://wiki.whamcloud.com/display/LNet/Lustre+QoS – see this!
- https://groups.google.com/g/lustre-discuss-list/c/n6sdj-e5LNA
- https://www.mail-archive.com/[email protected]/msg04092.html
- https://admire-eurohpc.eu/wp-content/uploads/2023/12/Lustre-QoS-Barcelona-GA-Dec-2023.pptx.pdf
Intel Fabric Suite (configrms Lustre ServiceID)