Basic IB Concepts

Simple Case: One Subnet

  • Subnet Manager (SM): manages the subnet (thx). Like DHCP + ARP + more things.
  • lid: like IP address, assigned by subnet manager, you can ping it, send to it, whatever.
$ sudo ibstat
CA 'qib0'
		...
        Port 1:
                State: Active
                Physical state: LinkUp
                Base lid: 475
                SM lid: 10
				...

SM is lid=10, and this node has a HCA (NIC) with lid=0.

Complex Case: Multiple Subnets

Here routing comes into the picture. Won’t go into details, but simple/sloppy version:

  • guid: u64: like MAC address, unique per device.
  • gid: u128: globally unique ID, sm_id: u64 + guid: u64

Switches maintain routes indexed by sm_id, hosts send packets to gids, switches look up the relevant destination and forward. Not relevant to us going forward.

Miscellaneous

  • Service Level (SL): a 4-bit priority carried by a packet
  • Virtual Lane (VL): hardware-level egress lanes over a link, these actually provide QoS Isolation
  • SL2VL: a table mapping SLs to VLs. SLs confer QoS properties only if they are mapped to separate VLs, and as per specific configuration in VLArb (the VL arbitration tables)
  • Subnet Management Packet (SMP): used by SM to configure and query fabric components (ports and switches)
  • Management Datagram (MAD): management message format, UMAD is the linux interface to send MAD packets
  • Subnet Administrator (SA): a management service that answers queries (part of the Subnet Manager?)
  • OpenSM: an open-source software-based subnet manager (like dnsmasq for DHCP). Fabrics may have proprietary ones.
  • VL15: a dedicated VL for SMP traffic (QP0: a special queue pair dedicated for it?)

    How Many VLs Do I Have?

    ``` $ smpquery portinfo ibwarn: [1006476] mad_rpc_open_port: can't open UMAD port ((null):0) smpquery: iberror: failed: Failed to open '(null)' port '0'

$ sudo smpquery portinfo CapMask:.........................0x7610868 IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsDRNoticeSupported IsCapabilityMaskNoticeSupported IsLinkRoundTripLatencySupported IsClientRegistrationSupported IsOtherLocalChangesNoticeSupported VLCap:...........................VL0-1 VLHighLimit:.....................0 VLArbHighCap:....................16 VLArbLowCap:.....................16 VLStallCount:....................0 OperVLs:.........................VL0


Takeaways:
1. SLs/VLs are supported (`CapMask` has `IsSLMappingSupported`).
2. `VLCap` suggests that two virtual lanes are possible (`VL0` and `VL1`)
3. `OperVLs` suggests that only one VL is configured, operational, active, `VL1` is not configured.
## What Is The Configured QoS State?

$ sudo smpquery sl2vl 475 1

SL2VL table: Lid 475

SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|

ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|

$ sudo smpquery vlarb 475 1

VLArbitration tables: Lid 475 port 1 LowCap 16 HighCap 16

Low priority VL Arbitration Table:

VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

High priority VL Arbitration Table:

VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 | WEIGHT: |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |


Takeaways:
1. `sl2vl` says that we have 16 SLs, but all are mapped to VL0, so no QoS.
2. `vlarb` is all `VL0`. There are two levels of priority classes. First level is `vlarb_high` and `vlarb_low`. Second level is intra-vlarb, between `(sl, weight)` entries.  The relative weight of vlarb-high/low is in `VLHighLimit`.

## Configuring QoS in OpenSM

You edit `/etc/opensm/opensm.conf` with something like this:

```bash
# enable QoS
qos TRUE

# all prefixes are qos_
# specific sub-prefixes: qos_ca_, qos_rtr_, qos_sw0_, qos_swe_
# (for specific config for CAs, routers, switch port 0's, and switches)

qos_max_vls 2
# send this many from high-priority first (255 ~ infinite priority)
qos_high_limit 255
# intra-high priority
qos_vlarb_high 1:192, 2:128, 3:64
qos_vlarb_low 0:64
# SLs [0, 4) are VL0, [4, 8) are VL1
qos_sl2vl 0,0,0,0,1,1,1,1

The policy for relative weights between the high-priority table and the low-priority table is not clear to me. That is, what exactly does qos_vlarb_high do.

Note: https://www.mail-archive.com/[email protected]/msg04092.html is a good description of the qos_high_limit. It seems to me that the high class preempts the low class up to the high limit (preempts not mid-packet, but otherwise).

There’s also ULP-based QoS I think (ULP: Upper Layer Protocol). Can maybe target servicew IDs for MPI, Lustre etc.

Service IDs

  • Lustre etc. can register a Service Record with SA, containing (Service ID, server LID, service name)
  • RDMA-CM (Connection Manager) is involved somehow
  • RDMA-CM divides the SID space into port spaces. RDMA_PS_TCP is a 16-bit namespace.

So SID can be RDMA_PS_TCP << 16 | service_port, or RDMA_PS_IB << 16 | qpn.

$ sudo saquery -S
<nothing>
$ sudo saquery
<bunch of node records>
# On Lustre root
$ cat /sys/module/ko2iblnd/parameters/service
987

So it seems that Lustre’s SID is RDMA_PS_TCP << 16 | 987

$ rdma resource show cm_id | grep LISTEN
link qib0/1 cm-idn 0 state LISTEN ps TCP comm [ko2iblnd] src-addr 10.94.xxx.yyy:987 dst-addr 0.0.0.0:0
# P=987, 0x106 is the `RDMA_PS_TCP`
$ P=987; printf '0x%016x\n' $(( (0x106<<16) | P ))
0x00000000010603db # Service ID for Lustre

This also shows 987 for Lustre. This maps Lusre to SL1.

# /var/cache/opensm/qos-policy.conf

qos-ulps
  default : 0
  any, service-id 0x00000000010603DB : 1
end-qos-ulps

QoS in PSM

  • Use env variable IPATH_SL.
if (!psmi_getenv("IPATH_SL", "IB outging ServiceLevel number (default 0)",
		PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
		(union psmi_envvar_val) PSMI_SL_DEFAULT,
		&env_sl)) {
opts.outsl = env_sl.e_long;
}

// This seems to be older code for PSM v1.07 to 1.10. Head is 1.16
// No need to set VL in head, will use 
#if (PSM_VERNO >= 0x0107) && (PSM_VERNO <= 0x010a)
{
  union psmi_envvar_val env_vl;
  if (!psmi_getenv("IPATH_VL", "IB outging VirtualLane (default 0)",
		   PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_LONG,
		   (union psmi_envvar_val)0,
		   &env_vl)) {
opts.outvl = env_vl.e_long;
  }
}
#endif

QoS in libfabric/Mercury

Traffic classes are in fi_tx_attr->tclass (u32): https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html

Mercury (since v2.4):

hg_opts.na_init_info.traffic_class = (uint8_t)(desired_sl << 5); // SL -> tclass

Maybe something here: https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html?utm_source=chatgpt.com

Things I’ve established so far:

  1. I’m interested in verbs; ofi_rxm
  2. ofi_rxm just passes tclass to the core provider (which is verbs)
  3. I can’t find out what verbs does with it. It seems to use both ibv APIs and rdma-cm APIs. Looks like I may be looking for rdma_set_option(RDMA_OPTION_ID_TOS). But nothing calls that.

To be continued.

Also Read

  • https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1178075141/Understanding+Basic+InfiniBand+QoS
  • https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1177878529/Getting+Started+with+InfiniBand+QoS
  • https://web-docs.gsi.de/~vpenso/notes/posts/hpc/network/infiniband/subnet-manager.html
  • https://docs.nvidia.com/networking/display/mlnxofedv51258060/opensm
  • https://docs.nvidia.com/networking/display/mlnxofedv461000/qos+-+quality+of+service

Lustre discussions:

  • https://wiki.whamcloud.com/display/LNet/Lustre+QoS – see this!
  • https://groups.google.com/g/lustre-discuss-list/c/n6sdj-e5LNA
  • https://www.mail-archive.com/[email protected]/msg04092.html
  • https://admire-eurohpc.eu/wp-content/uploads/2023/12/Lustre-QoS-Barcelona-GA-Dec-2023.pptx.pdf

Intel Fabric Suite (configrms Lustre ServiceID)

  • https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_FabricSuite_Fabric_Manager_UG_H76468_v15_0.pdf