QoS in Infiniband
Basic IB Concepts
Simple Case: One Subnet
- Subnet Manager (
SM
): manages the subnet (thx). Like DHCP + ARP + more things. -
lid
: like IP address, assigned by subnet manager, you can ping it, send to it, whatever.
$ sudo ibstat
CA 'qib0'
...
Port 1:
State: Active
Physical state: LinkUp
Base lid: 475
SM lid: 10
...
SM is lid=10, and this node has a HCA (NIC) with lid=0.
Complex Case: Multiple Subnets
Here routing comes into the picture. Won’t go into details, but simple/sloppy version:
-
guid: u64
: like MAC address, unique per device. -
gid: u128
: globally unique ID,sm_id: u64
+guid: u64
Switches maintain routes indexed by sm_id
, hosts send packets to gid
s, switches look up the relevant destination and forward. Not relevant to us going forward.
Miscellaneous
- Service Level (
SL
): a 4-bit priority carried by a packet - Virtual Lane (
VL
): hardware-level egress lanes over a link, these actually provide QoS Isolation -
SL2VL
: a table mapping SLs to VLs. SLs confer QoS properties only if they are mapped to separate VLs, and as per specific configuration inVLArb
(the VL arbitration tables) - Subnet Management Packet (
SMP
): used by SM to configure and query fabric components (ports and switches) - Management Datagram (
MAD
): management message format, UMAD is the linux interface to send MAD packets - Subnet Administrator (
SA
): a management service that answers queries (part of the Subnet Manager?) - OpenSM: an open-source software-based subnet manager (like
dnsmasq
for DHCP). Fabrics may have proprietary ones. -
VL15
: a dedicated VL for SMP traffic (QP0
: a special queue pair dedicated for it?)How Many VLs Do I Have?
``` $ smpquery portinfo
ibwarn: [1006476] mad_rpc_open_port: can't open UMAD port ((null):0) smpquery: iberror: failed: Failed to open '(null)' port '0'
$ sudo smpquery portinfo
Takeaways:
1. SLs/VLs are supported (`CapMask` has `IsSLMappingSupported`).
2. `VLCap` suggests that two virtual lanes are possible (`VL0` and `VL1`)
3. `OperVLs` suggests that only one VL is configured, operational, active, `VL1` is not configured.
## What Is The Configured QoS State?
$ sudo smpquery sl2vl 475 1
SL2VL table: Lid 475
SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
$ sudo smpquery vlarb 475 1
VLArbitration tables: Lid 475 port 1 LowCap 16 HighCap 16
Low priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
High priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
Takeaways:
1. `sl2vl` says that we have 16 SLs, but all are mapped to VL0, so no QoS.
2. `vlarb` is all `VL0`. There are two levels of priority classes. First level is `vlarb_high` and `vlarb_low`. Second level is intra-vlarb, between `(sl, weight)` entries. The relative weight of vlarb-high/low is in `VLHighLimit`.
## Configuring QoS in OpenSM
You edit `/etc/opensm/opensm.conf` with something like this:
```bash
# enable QoS
qos TRUE
# all prefixes are qos_
# specific sub-prefixes: qos_ca_, qos_rtr_, qos_sw0_, qos_swe_
# (for specific config for CAs, routers, switch port 0's, and switches)
qos_max_vls 2
# send this many from high-priority first (255 ~ infinite priority)
qos_high_limit 255
# intra-high priority
qos_vlarb_high 1:192, 2:128, 3:64
qos_vlarb_low 0:64
# SLs [0, 4) are VL0, [4, 8) are VL1
qos_sl2vl 0,0,0,0,1,1,1,1
The policy for relative weights between the high-priority table and the low-priority table is not clear to me. That is, what exactly does qos_vlarb_high
do.
Note: https://www.mail-archive.com/[email protected]/msg04092.html is a good description of the qos_high_limit
. It seems to me that the high class preempts the low class up to the high limit (preempts not mid-packet, but otherwise).
There’s also ULP-based QoS I think (ULP: Upper Layer Protocol). Can maybe target servicew IDs for MPI, Lustre etc.
Service IDs
- Lustre etc. can register a Service Record with SA, containing (Service ID, server LID, service name)
- RDMA-CM (Connection Manager) is involved somehow
- RDMA-CM divides the SID space into port spaces.
RDMA_PS_TCP
is a 16-bit namespace.
So SID can be RDMA_PS_TCP << 16 | service_port
, or RDMA_PS_IB << 16 | qpn
.
$ sudo saquery -S
<nothing>
$ sudo saquery
<bunch of node records>
# On Lustre root
$ cat /sys/module/ko2iblnd/parameters/service
987
So it seems that Lustre’s SID is RDMA_PS_TCP << 16 | 987
$ rdma resource show cm_id | grep LISTEN
link qib0/1 cm-idn 0 state LISTEN ps TCP comm [ko2iblnd] src-addr 10.94.xxx.yyy:987 dst-addr 0.0.0.0:0
# P=987, 0x106 is the `RDMA_PS_TCP`
$ P=987; printf '0x%016x\n' $(( (0x106<<16) | P ))
0x00000000010603db # Service ID for Lustre
This also shows 987 for Lustre. This maps Lusre to SL1.
# /var/cache/opensm/qos-policy.conf
qos-ulps
default : 0
any, service-id 0x00000000010603DB : 1
end-qos-ulps
QoS in PSM
- Use env variable
IPATH_SL
.
if (!psmi_getenv("IPATH_SL", "IB outging ServiceLevel number (default 0)",
PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
(union psmi_envvar_val) PSMI_SL_DEFAULT,
&env_sl)) {
opts.outsl = env_sl.e_long;
}
// This seems to be older code for PSM v1.07 to 1.10. Head is 1.16
// No need to set VL in head, will use
#if (PSM_VERNO >= 0x0107) && (PSM_VERNO <= 0x010a)
{
union psmi_envvar_val env_vl;
if (!psmi_getenv("IPATH_VL", "IB outging VirtualLane (default 0)",
PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
(union psmi_envvar_val)0,
&env_vl)) {
opts.outvl = env_vl.e_long;
}
}
#endif
QoS in libfabric/Mercury
Traffic classes are in fi_tx_attr->tclass (u32)
: https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html
Mercury (since v2.4):
hg_opts.na_init_info.traffic_class = (uint8_t)(desired_sl << 5); // SL -> tclass
Maybe something here: https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html?utm_source=chatgpt.com
Things I’ve established so far:
- I’m interested in
verbs; ofi_rxm
-
ofi_rxm
just passestclass
to the core provider (which is verbs) - I can’t find out what verbs does with it. It seems to use both
ibv
APIs andrdma-cm
APIs. Looks like I may be looking forrdma_set_option(RDMA_OPTION_ID_TOS)
. But nothing calls that.
To be continued.
Also Read
- https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1178075141/Understanding+Basic+InfiniBand+QoS
- https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1177878529/Getting+Started+with+InfiniBand+QoS
- https://web-docs.gsi.de/~vpenso/notes/posts/hpc/network/infiniband/subnet-manager.html
- https://docs.nvidia.com/networking/display/mlnxofedv51258060/opensm
- https://docs.nvidia.com/networking/display/mlnxofedv461000/qos+-+quality+of+service
Lustre discussions:
- https://wiki.whamcloud.com/display/LNet/Lustre+QoS – see this!
- https://groups.google.com/g/lustre-discuss-list/c/n6sdj-e5LNA
- https://www.mail-archive.com/[email protected]/msg04092.html
- https://admire-eurohpc.eu/wp-content/uploads/2023/12/Lustre-QoS-Barcelona-GA-Dec-2023.pptx.pdf
Intel Fabric Suite (configrms Lustre ServiceID)
- https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_FabricSuite_Fabric_Manager_UG_H76468_v15_0.pdf