<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://ankushja.in/feed.xml" rel="self" type="application/atom+xml"/><link href="https://ankushja.in/" rel="alternate" type="text/html" hreflang="en"/><updated>2025-10-14T05:44:09+00:00</updated><id>https://ankushja.in/feed.xml</id><title type="html">Ankush Jain</title><subtitle>Personal website and blog of Ankush Jain, PhD Student at CMU PDL. Topics include systems, storage, networks, HPC etc. </subtitle><entry><title type="html">Is RDMA Misunderstood?</title><link href="https://ankushja.in/blog/2025/rdma-misunderstood/" rel="alternate" type="text/html" title="Is RDMA Misunderstood?"/><published>2025-10-06T14:51:13+00:00</published><updated>2025-10-06T14:51:13+00:00</updated><id>https://ankushja.in/blog/2025/rdma-misunderstood</id><content type="html" xml:base="https://ankushja.in/blog/2025/rdma-misunderstood/"><![CDATA[<p>In this post, I explore the thesis that we keep redoing certain things in fabric design because the history is not properly documented. It is an ambitious claim, and the understanding I need to build to make it comprehensively and conclusively remains elusive, and this is an attempt at a sloppier fail-fast version. Let me reiterate the <em>fail-fast</em> bit: this post was written with an appetite for my own hat. It also forms a loose series with <a href="https://ankushja.in/blog/2023/infiniband-flavors/">this</a>, <a href="https://ankushja.in/blog/2024/network-tradeoffs/">this</a>, and <a href="https://ankushja.in/blog/2024/credits-flow-congestion/">this</a>.</p> <p>Edit: <em>I prompt-engineered some LLMs into generating <a href="https://users.ece.cmu.edu/~ankushj/cbfc.pdf">this</a>. Same reliability caveats as the rest of the post apply.</em></p> <h2 id="losslessless-the-bedrock-of-rdma">Losslessless: The Bedrock of RDMA</h2> <p>RDMA originated on a natively lossless fabric (Infiniband). Since then, we have tried to graft it into inherently lossy fabrics, with brittle outcomes. We like losslessness for the same reason we like strong consistency: it is nice and simple if things just work and are predictable.</p> <ol> <li>Every network becomes lossless at low utilization.</li> <li>CBFC is the best way we know to build a lossless network at higher traffic volumes. PFC is very fragile, intrinsically so.</li> <li>CBFC and PFC represent two fundamentally differing approaches to utilization management: proactive and reactive, with their own tradeoffs. The cost of a proactive approach is <em>“imposing flow control when it was unnecessary”</em>, leading to potential underutilization.</li> <li>There is a fundamental tradeoff between network utilization and loss guarantees. For any given lossless network, you can afford to be a little speculative and increase median usage. The null hypothesis is that this increases loss probability. Unless reasoned all the way through and demonstrated to not exist or not be a factor, it must exist.</li> <li>We like lossless networks because in certain conditions, the cost of loss to higher layers is prohibitive. We should not design lossless schemes that dispatch speculatively without reasoning about why we wanted a lossless scheme in the first place, and why that speculative dispatch does not create problems we wanted to avoid<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</li> </ol> <h2 id="everyone-likes-most-of-infiniband">Everyone Likes Most of Infiniband</h2> <p>Let us disaggregate infiniband into a series of features.</p> <p>Likes:</p> <ol> <li>Everyone likes CBFC[^1,2,3]</li> <li>Everyone likes the performance benefits of RDMA, if not the complexity of memory registration and coding verbs.</li> <li>We like the cache-friendy access patterns enabled by verbs as an interface. The TCP/IP stack, as implemented, has been noted to force cache-unfriendly behavior, creating utilization bubbles in the precious PCIe link<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>. <em>verbs</em>, used along with capabilities such as <em>SRQs</em>, <em>multi-receive</em>, and <em>hugepages</em> enable intrinsically contiguous accesses.</li> </ol> <p>Dislikes:</p> <ol> <li>Everyone dislikes its difficult relationship with the broader datacenter network. It is not a part of the broader network, but a separate sub-network like NVLink or CXL.</li> <li>Many folks dislike its scalability constraints (connections need QPs need NIC SRAM).</li> </ol> <h2 id="what-we-really-need-rd--ethernet-compatibility">What We Really Need: RD + Ethernet Compatibility</h2> <h3 id="case-study-qlogic">Case Study: QLogic</h3> <p>QLogic already had this insight 15 years <a href="https://www.youtube.com/watch?v=E0uSl_gyZnI">ago</a>. QLogic is not super well known, and I think it is unfortunate. Many aspects of their approach are highly relevant to some debates we have had since.</p> <p>QLogic Truescale was the precursor to Intel Omnipath, which was the precursor to Cornelis Networks. Instead of offload to a smart NIC, they wanted <em>onload</em>: CPU-mediated communication for scalability. It was a verbs-compliant fabric that natively spoke a different interface called <a href="https://github.com/pdlfs/psm">psm</a>.</p> <p>Truescale was a messaging-oriented fabric. I think of it as RDMA UD with large datagram sizes. It injected small payloads inline into a packet, and large payloads were DMA’ed on both the send and receive paths using transient connection mechanisms called TID/TIDflows.</p> <p>From what I understand, PSM had CBFC but no <em>link-level retry</em> (LLR). It had a software-based go-back-N-style recovery. I think link-level retry requires a smarter HCA, which they were trying to avoid, at least for the first generation. I have observed software-based retries in that fabric to add an extraordinary amount of jitter, when triggered, but it is triggered by some residual bugs and high timeouts, and not by packet drops.</p> <h3 id="case-study-srd-in-amazon-efa">Case Study: SRD in Amazon EFA</h3> <p>The variable-sized datagram design point remains compelling. Amazon built a fabric called <em>Elastic Fabric Adapter (EFA)</em> that presumably adds LLR to UD. I don’t know much about it so I’ll keep it short.</p> <h2 id="rdma-vs-rpcs-a-false-dichotomy">RDMA vs RPCs: A False Dichotomy?</h2> <p>I’m probably wading into a decade-old debate at this point, but it’s not my fault that I was just a swaddling infant a decade ago.</p> <p>Let us take eRPC<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>, an iconic paper at this point. The argument I will explore is whether eRPC implements Infiniband in software over lossy ethernet, and whether while a solid step forward for lossy ethernet, its design just strengthens the argument for <em>real RDMA/infiniband</em>.</p> <ol> <li>eRPC likes losslessness. They create those conditions by having session credits and limiting dispatched traffic to network BDP. Session credits resemble an end-to-end version of CBFC, something also tried in <sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>.</li> <li>eRPC likes contiguous message buffers, an event loop for short messages, and memcpy’ed contiguous buffers for large messages, with memcpy orchestrated by CPU. This feels very close to the workflow I employ in my current codebase using the RDMA-capable Mercury RPC library, except using CPU cycles instead of RDMA.</li> </ol> <p>I wonder what is the difference between “<em>we found a lossless regime in the network and designed a system optimized for a lossless common-case</em>” and “<em>we engineered a lossless regime in the network, and this proves the value of a genuinely lossless fabric.</em>”</p> <p>I also wonder about the value of pushing retries to the application layer when the ultimate behavior you want is lossless. There are plenty of applications that tolerate lossiness end-to-end, and it makes sense for the lowest common denominator to not include retries for general-purpose networks. But datacenters are a controlled high-density environment, and maybe it is easier to permit lossy flows as a second-class citizen in a lossless fabric than the other way round?</p> <p>Also, I think scalability criticisms of infiniband are hardly the killer argument they are made out to be. Truescale is an example of a production fabric that already solved those challenges, and software-level solutions such as <code class="language-plaintext highlighter-rouge">RxD</code> exist in libfabric (multiplexing RDMA over UD). RDMA is not infiniband, and there are multiple CBFC-based approaches to lossless fabrics.</p> <p>Next, I think RDMA doesn’t preclude the ability to do two-sided communication. I think it is not completely accurate to think of RDMA as disaggregated memory. That is, RDMA is not CXL. An RDMA-capable node is essentially still a request/response-based server. Maybe it is useful to mentally decouple operations into metadata and data, and use RDMA for the data path (common criticism is unnecessary RTTs for pointer chasing). Mercury<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>, with providers for libfabric etc, provides both messaging-style semantics for small messages and verbs-enabled RDMA via a <em>bulk</em> interface for large messages.</p> <p>Finally, I think the end-to-end argument does not preclude merging of layers for performance. I think it is a neat guiding principle for what to do in the absence of competing concerns, but it does not preemptively override the presence of competing concerns.</p> <h2 id="whats-next">What’s Next?</h2> <p>I want to understand HPE Slingshot and UltraEthernet better, and maybe also start dabbling with CXL.</p> <p>Side note: do you know PCIe uses CBFC? I think the gap between an interconnect, a fabric, and a network is smaller than we think!</p> <h2 id="references">References</h2> <p><em>Updated on 20251008 and later on 20251014 to present a version that may age better.</em></p> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:2"> <p>Homa: A receiver-driven low-latency transport protocol using network priorities, <em>SIGCOMM ‘18: ACM SIGCOMM 2018 Conference</em>, https://dl.acm.org/doi/10.1145/3230543.3230564 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p>Ensō: A Streaming Interface for NIC-Application Communication, <em>17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)</em>, https://www.usenix.org/conference/osdi23/presentation/sadok <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:1"> <p>Datacenter RPCs can be General and Fast, <em>16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19)</em>, https://www.usenix.org/conference/nsdi19/presentation/kalia <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>Harmony: A Congestion-free Datacenter Architecture, <em>21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)</em>, https://www.usenix.org/conference/nsdi24/presentation/agarwal-saksham <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>Mercury: Enabling remote procedure call for high-performance computing, <em>2013 IEEE International Conference on Cluster Computing (CLUSTER)</em>, http://ieeexplore.ieee.org/document/6702617/ <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="systems"/><category term="infiniband"/><summary type="html"><![CDATA[Do we keep doing repeating certain things in academic work on HPC fabrics?]]></summary></entry><entry><title type="html">QoS In Infiniband Via RDMA-CM</title><link href="https://ankushja.in/blog/2025/ibqos-2/" rel="alternate" type="text/html" title="QoS In Infiniband Via RDMA-CM"/><published>2025-09-23T15:35:14+00:00</published><updated>2025-09-23T15:35:14+00:00</updated><id>https://ankushja.in/blog/2025/ibqos-2</id><content type="html" xml:base="https://ankushja.in/blog/2025/ibqos-2/"><![CDATA[<p><em>Note: this entire post is a lightly edited checkpoint of a conversation with some LLM.</em> <a href="https://ankushja.in/blog/2025/ibqos/">Part 1</a> here.</p> <h2 id="introduction">Introduction</h2> <p>RDMA-CM (RDMA Connection Manager) is the control-plane layer for RDMA. It exists to make connection setup look and feel like sockets: applications can bind to addresses, listen on ports, and accept or connect without worrying about raw InfiniBand identifiers. Underneath, RDMA-CM exchanges the connection metadata, programs the QPs, and hands back an endpoint ready for verbs. This is distinct from the verbs API, which only covers the data plane.</p> <h2 id="background-qos-in-infiniband">Background: QoS in InfiniBand</h2> <p>InfiniBand provides Quality of Service using <strong>Service Levels (SLs)</strong>. An SL is a small integer carried in each packet’s header. The subnet manager maps SLs onto <strong>Virtual Lanes (VLs)</strong>, which are independent link-level flows with separate buffering and flow control. By assigning flows to different SLs, the fabric can prioritize some traffic or keep classes of traffic from interfering with each other. SL is selected when a QP is created, and remains fixed for that QP’s traffic.</p> <h2 id="rdma-cm-basics">RDMA-CM Basics</h2> <p>RDMA-CM exposes a socket-like API:</p> <ul> <li> <p>The central handle is an <code class="language-plaintext highlighter-rouge">rdma_cm_id</code>, analogous to a socket fd.</p> </li> <li> <p>Each cm_id belongs to a <strong>port space</strong>, such as <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP</code> (connection-oriented) or <code class="language-plaintext highlighter-rouge">RDMA_PS_UDP</code> (datagrams).</p> </li> <li> <p>Applications use <code class="language-plaintext highlighter-rouge">rdma_bind_addr</code>, <code class="language-plaintext highlighter-rouge">rdma_listen</code>, <code class="language-plaintext highlighter-rouge">rdma_connect</code>, etc., in the same way they would with sockets.</p> </li> </ul> <p>What looks like an IP+port pair in RDMA-CM is just a wrapper. With IPoIB enabled, the “IP” is the interface address (e.g. <code class="language-plaintext highlighter-rouge">10.94.3.105</code>), which RDMA-CM resolves to a GID. Without IPoIB, you still can create connections, but the address is directly a GID (usually expressed in IPv6 link-local form). In both cases, the binding resolves to a specific HCA port and GID.</p> <p>The handshake itself is a control-plane exchange: on InfiniBand it uses CM MADs over QP1, on RoCE/iWARP it uses UDP/TCP messages. This is how QP numbers, PSNs, MTU, and SL hints are exchanged.</p> <h2 id="rdma-cm-and-service-ids">RDMA-CM and Service IDs</h2> <p>The key to QoS integration is that RDMA-CM maps the familiar notion of <strong>port numbers</strong> into InfiniBand <strong>Service IDs</strong>. A Service ID is a 64‑bit identifier that the Subnet Manager understands. In <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP</code>, the port number you bind is encoded into the Service ID automatically. This provides a clean classification hook: flows created on different port ranges correspond to different Service IDs.</p> <h2 id="subnet-manager-qos-policies--setup-opensm">Subnet Manager QoS Policies &amp; Setup (OpenSM)</h2> <p>OpenSM can enforce QoS by mapping <strong>Service IDs</strong> (derived from RDMA-CM port numbers) to <strong>Service Levels (SLs)</strong> and then to <strong>Virtual Lanes (VLs)</strong>.</p> <h3 id="1-enable-qos-in-opensm">1) Enable QoS in OpenSM</h3> <p>Edit <code class="language-plaintext highlighter-rouge">/etc/opensm/opensm.conf</code> (or the file you pass with <code class="language-plaintext highlighter-rouge">-f</code>) and set:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>qos TRUE
qos_max_vls 8              # adjust to your ASIC/link support (e.g., 4 or 8)
qos_policy_file /etc/opensm/qos-policy.conf
</code></pre></div></div> <p>You may additionally pin SL→VL and VL arbitration globally here if you want static tables in the options file (advanced fabrics often keep these in the policy file instead).</p> <p>Restart or start OpenSM with QoS active:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>systemctl restart opensm    # distro services
# or manually
opensm -Q -f /etc/opensm/opensm.conf
# (-Q/--qos is equivalent to qos TRUE; -Y sets an explicit policy file path)
</code></pre></div></div> <h3 id="2-define-qos-policy-portserviceidsl">2) Define QoS policy (port→ServiceID→SL)</h3> <p>Create <code class="language-plaintext highlighter-rouge">/etc/opensm/qos-policy.conf</code> with policy rules. A pragmatic pattern is to carve the RDMA-CM <strong>port space</strong> into classes and assign SLs per <strong>Service ID range</strong>. Example with three classes:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># --- Classes ---------------------------------------------------
# Control plane / RPC (low bandwidth, low latency)
service_id 0x0000000000001000-0x0000000000001fff sl 4

# Bulk telemetry / background
service_id 0x0000000000002000-0x0000000000002fff sl 1

# Data path / latency-critical (highest priority)
service_id 0x0000000000003000-0x0000000000003fff sl 6
</code></pre></div></div> <p>Notes:</p> <ul> <li> <p>RDMA-CM encodes <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP</code> + <strong>port</strong> into a 64-bit <strong>Service ID</strong>. Use contiguous <strong>Service ID ranges</strong> to represent your <strong>port ranges</strong>. Confirm the exact mapping in your stack by printing the CM <code class="language-plaintext highlighter-rouge">service_id</code> on connection or using SA queries (see verification below).</p> </li> <li> <p>If you use IPoIB, classification still hinges on <strong>Service ID</strong>, not L3 QoS.</p> </li> </ul> <h3 id="3-map-sls-to-vls-isolation--priority">3) Map SLs to VLs (isolation / priority)</h3> <p>Still in <code class="language-plaintext highlighter-rouge">qos-policy.conf</code>, define SL→VL and VL arbitration so the fabric actually separates traffic:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Map each SL to a VL. Keep hot classes on distinct VLs.
sl2vl 0 0
sl2vl 1 1
sl2vl 4 2
sl2vl 6 3

# Basic VL arbitration (weights). Keep highest-priority VLs with more credits.
vlarb_high 0:64 1:64 2:96 3:128
vlarb_low  0:32 1:32 2:48  3:64
</code></pre></div></div> <p>Guidelines:</p> <ul> <li> <p>Keep the number of active VLs modest (4–8). Ensure all HCAs/links support the count you choose.</p> </li> <li> <p>Use distinct VLs for classes that must not interfere. Coalesce low-priority classes on the same VL if you’re VL-limited.</p> </li> </ul> <h3 id="5-apply--persist">5) Apply &amp; persist</h3> <p>Ensure your distro loads the same <code class="language-plaintext highlighter-rouge">opensm.conf</code> and <code class="language-plaintext highlighter-rouge">qos-policy.conf</code> on boot. On systems with multiple HCAs/SM instances, anchor each OpenSM to a GUID/port and reuse the same policy file.</p> <h3 id="verification--troubleshooting">Verification / Troubleshooting</h3> <ul> <li> <p><strong>Check SM picked up QoS:</strong> review <code class="language-plaintext highlighter-rouge">opensm.log</code> for policy parsing, SL2VL, and VLArb tables.</p> </li> <li> <p><strong>Resolve paths with SA:</strong></p> <ul> <li> <p>Get a PathRecord with an explicit <strong>Service ID</strong> to see the returned <strong>SL</strong> (and MTU/rate). Using <code class="language-plaintext highlighter-rouge">saquery</code> (part of infiniband-diags):</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  saquery --sgid-to-dgid &lt;sgid&gt;:&lt;dgid&gt; --service-id 0x0000000000003010 --pkey 0x7fff --sl
</code></pre></div> </div> </li> </ul> </li> <li> <p><strong>Confirm on endpoints:</strong> log the CM <code class="language-plaintext highlighter-rouge">service_id</code> and the final QP <strong>SL</strong> your app sees during connection events.</p> </li> <li> <p><strong>Port/VL counters:</strong> use <code class="language-plaintext highlighter-rouge">perfquery</code>/<code class="language-plaintext highlighter-rouge">ibqueryerrors</code> to confirm traffic hits the expected VLs; congestion on the wrong VL hints at SL2VL mismatch.</p> </li> </ul> <h2 id="putting-it-together">Putting It Together</h2> <p>Trace the path: <code class="language-plaintext highlighter-rouge">RDMA-CM port</code> → <code class="language-plaintext highlighter-rouge">Service ID</code> → <code class="language-plaintext highlighter-rouge">Subnet Manager policy</code> → <code class="language-plaintext highlighter-rouge">SL</code> → <code class="language-plaintext highlighter-rouge">VL</code> The path from application to QoS is:</p> <p><code class="language-plaintext highlighter-rouge">RDMA-CM port</code> → <code class="language-plaintext highlighter-rouge">Service ID</code> → <code class="language-plaintext highlighter-rouge">Subnet Manager policy</code> → <code class="language-plaintext highlighter-rouge">SL</code> → <code class="language-plaintext highlighter-rouge">VL</code></p> <p>The application simply binds to a port; RDMA-CM turns that into a Service ID; the SM maps Service IDs to SLs; the fabric maps SLs to VLs. QoS enforcement is transparent to the app.</p> <h2 id="practical-notes">Practical Notes</h2> <ul> <li> <p>With IPoIB: CM addresses look like normal IPs, but they resolve to GIDs under the hood.</p> </li> <li> <p>Without IPoIB: you can still use RDMA-CM with GID-based addresses; port numbers still become Service IDs.</p> </li> <li> <p>If no Service Record matches, the default SL (often 0) is used.</p> </li> </ul> <h2 id="conclusion">Conclusion</h2> <p>RDMA-CM is the bridge between socket-like connection setup and InfiniBand’s fabric-level QoS. It hides GIDs and QP details from the application, while exposing a port number abstraction that the Subnet Manager can map to Service Levels. This is what allows administrators to classify traffic by port ranges and enforce fabric QoS without requiring applications to manipulate QP attributes directly.</p>]]></content><author><name></name></author><category term="infiniband"/><category term="infiniband"/><category term="qos"/><summary type="html"><![CDATA[Configuring QoS on IB via RDMA-CM]]></summary></entry><entry><title type="html">QoS in Infiniband</title><link href="https://ankushja.in/blog/2025/ibqos/" rel="alternate" type="text/html" title="QoS in Infiniband"/><published>2025-09-20T15:31:23+00:00</published><updated>2025-09-20T15:31:23+00:00</updated><id>https://ankushja.in/blog/2025/ibqos</id><content type="html" xml:base="https://ankushja.in/blog/2025/ibqos/"><![CDATA[<p>These are some sloppy notes on my attempts to figure out how to configure QoS in infiniband-based stacks. They are in rough shape, my conclusion so far seems to be that libfabric hard sets a service level, but doesn’t allow for it to be configurable. I don’t know if it’s possible to call <code class="language-plaintext highlighter-rouge">ibv_modify_qp</code> or something to do so after verbs has initialized.</p> <p>Edit: post now ends on an optimistic note.</p> <h2 id="basic-ib-concepts">Basic IB Concepts</h2> <h3 id="simple-case-one-subnet">Simple Case: One Subnet</h3> <ul> <li>Subnet Manager (<code class="language-plaintext highlighter-rouge">SM</code>): manages the subnet (thx). Like DHCP + ARP + more things.</li> <li><code class="language-plaintext highlighter-rouge">lid</code>: like IP address, assigned by subnet manager, you can ping it, send to it, whatever.</li> </ul> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">ibstat</span>
<span class="no">CA</span> <span class="err">'</span><span class="n">qib0</span><span class="err">'</span>
		<span class="o">...</span>
        <span class="nc">Port</span> <span class="mi">1</span><span class="o">:</span>
                <span class="nl">State:</span> <span class="nc">Active</span>
                <span class="nc">Physical</span> <span class="nl">state:</span> <span class="nc">LinkUp</span>
                <span class="nc">Base</span> <span class="nl">lid:</span> <span class="mi">475</span>
                <span class="no">SM</span> <span class="nl">lid:</span> <span class="mi">10</span>
				<span class="o">...</span>
</code></pre></div></div> <p>SM is lid=10, and this node has a HCA (NIC) with lid=0.</p> <h3 id="complex-case-multiple-subnets">Complex Case: Multiple Subnets</h3> <p>Here routing comes into the picture. Won’t go into details, but simple/sloppy version:</p> <ul> <li><code class="language-plaintext highlighter-rouge">guid: u64</code>: like MAC address, unique per device.</li> <li><code class="language-plaintext highlighter-rouge">gid: u128</code>: globally unique ID, <code class="language-plaintext highlighter-rouge">sm_id: u64</code> + <code class="language-plaintext highlighter-rouge">guid: u64</code></li> </ul> <p>Switches maintain routes indexed by <code class="language-plaintext highlighter-rouge">sm_id</code>, hosts send packets to <code class="language-plaintext highlighter-rouge">gid</code>s, switches look up the relevant destination and forward. Not relevant to us going forward.</p> <h3 id="miscellaneous">Miscellaneous</h3> <ul> <li>Service Level (<code class="language-plaintext highlighter-rouge">SL</code>): a 4-bit priority carried by a packet</li> <li>Virtual Lane (<code class="language-plaintext highlighter-rouge">VL</code>): hardware-level egress lanes over a link, these actually provide QoS Isolation</li> <li><code class="language-plaintext highlighter-rouge">SL2VL</code>: a table mapping SLs to VLs. SLs confer QoS properties only if they are mapped to separate VLs, and as per specific configuration in <code class="language-plaintext highlighter-rouge">VLArb</code> (the VL arbitration tables)</li> <li>Subnet Management Packet (<code class="language-plaintext highlighter-rouge">SMP</code>): used by SM to configure and query fabric components (ports and switches)</li> <li>Management Datagram (<code class="language-plaintext highlighter-rouge">MAD</code>): management message format, UMAD is the linux interface to send MAD packets</li> <li>Subnet Administrator (<code class="language-plaintext highlighter-rouge">SA</code>): a management service that answers queries (part of the Subnet Manager?)</li> <li>OpenSM: an open-source software-based subnet manager (like <code class="language-plaintext highlighter-rouge">dnsmasq</code> for DHCP). Fabrics may have proprietary ones.</li> <li><code class="language-plaintext highlighter-rouge">VL15</code>: a dedicated VL for SMP traffic (<code class="language-plaintext highlighter-rouge">QP0</code>: a special queue pair dedicated for it?)</li> </ul> <h2 id="how-many-vls-do-i-have">How Many VLs Do I Have?</h2> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">smpquery</span> <span class="n">portinfo</span> <span class="o">&lt;</span><span class="n">lid</span><span class="o">&gt;</span> <span class="o">&lt;</span><span class="n">portnum</span><span class="o">&gt;</span>
<span class="nl">ibwarn:</span> <span class="o">[</span><span class="mi">1006476</span><span class="o">]</span> <span class="nl">mad_rpc_open_port:</span> <span class="n">can</span><span class="err">'</span><span class="n">t</span> <span class="n">open</span> <span class="no">UMAD</span> <span class="nf">port</span> <span class="o">((</span><span class="kc">null</span><span class="o">):</span><span class="mi">0</span><span class="o">)</span>
<span class="nl">smpquery:</span> <span class="nl">iberror:</span> <span class="nl">failed:</span> <span class="nc">Failed</span> <span class="n">to</span> <span class="n">open</span> <span class="err">'</span><span class="o">(</span><span class="kc">null</span><span class="o">)</span><span class="err">'</span> <span class="n">port</span> <span class="sc">'0'</span>

<span class="err">$</span> <span class="n">sudo</span> <span class="n">smpquery</span> <span class="n">portinfo</span> <span class="o">&lt;</span><span class="n">lid</span><span class="o">&gt;</span> <span class="o">&lt;</span><span class="n">portnum</span><span class="o">&gt;</span>
<span class="nl">CapMask:</span><span class="o">.........................</span><span class="mh">0x7610868</span>
                                <span class="nc">IsTrapSupported</span>
                                <span class="nc">IsAutomaticMigrationSupported</span>
                                <span class="nc">IsSLMappingSupported</span>
                                <span class="nc">IsSystemImageGUIDsupported</span>
                                <span class="nc">IsCommunicatonManagementSupported</span>
                                <span class="nc">IsDRNoticeSupported</span>
                                <span class="nc">IsCapabilityMaskNoticeSupported</span>
                                <span class="nc">IsLinkRoundTripLatencySupported</span>
                                <span class="nc">IsClientRegistrationSupported</span>
                                <span class="nc">IsOtherLocalChangesNoticeSupported</span>
<span class="nl">VLCap:</span><span class="o">...........................</span><span class="na">VL0</span><span class="o">-</span><span class="mi">1</span>
<span class="nl">VLHighLimit:</span><span class="o">.....................</span><span class="mi">0</span>
<span class="nl">VLArbHighCap:</span><span class="o">....................</span><span class="mi">16</span>
<span class="nl">VLArbLowCap:</span><span class="o">.....................</span><span class="mi">16</span>
<span class="nl">VLStallCount:</span><span class="o">....................</span><span class="mi">0</span>
<span class="nl">OperVLs:</span><span class="o">.........................</span><span class="na">VL0</span>
</code></pre></div></div> <p>Takeaways:</p> <ol> <li>SLs/VLs are supported (<code class="language-plaintext highlighter-rouge">CapMask</code> has <code class="language-plaintext highlighter-rouge">IsSLMappingSupported</code>).</li> <li><code class="language-plaintext highlighter-rouge">VLCap</code> suggests that two virtual lanes are possible (<code class="language-plaintext highlighter-rouge">VL0</code> and <code class="language-plaintext highlighter-rouge">VL1</code>)</li> <li><code class="language-plaintext highlighter-rouge">OperVLs</code> suggests that only one VL is configured, operational, active, <code class="language-plaintext highlighter-rouge">VL1</code> is not configured.</li> </ol> <h2 id="what-is-the-configured-qos-state">What Is The Configured QoS State?</h2> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">smpquery</span> <span class="n">sl2vl</span> <span class="mi">475</span> <span class="mi">1</span>
<span class="err">#</span> <span class="no">SL2VL</span> <span class="nl">table:</span> <span class="nc">Lid</span> <span class="mi">475</span>
<span class="err">#</span>                 <span class="nl">SL:</span> <span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">1</span><span class="o">|</span> <span class="mi">2</span><span class="o">|</span> <span class="mi">3</span><span class="o">|</span> <span class="mi">4</span><span class="o">|</span> <span class="mi">5</span><span class="o">|</span> <span class="mi">6</span><span class="o">|</span> <span class="mi">7</span><span class="o">|</span> <span class="mi">8</span><span class="o">|</span> <span class="mi">9</span><span class="o">|</span><span class="mi">10</span><span class="o">|</span><span class="mi">11</span><span class="o">|</span><span class="mi">12</span><span class="o">|</span><span class="mi">13</span><span class="o">|</span><span class="mi">14</span><span class="o">|</span><span class="mi">15</span><span class="o">|</span>
<span class="nl">ports:</span> <span class="n">in</span>  <span class="mi">0</span><span class="o">,</span> <span class="n">out</span>  <span class="mi">0</span><span class="o">:</span> <span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span> <span class="mi">0</span><span class="o">|</span>

<span class="err">$</span> <span class="n">sudo</span> <span class="n">smpquery</span> <span class="n">vlarb</span> <span class="mi">475</span> <span class="mi">1</span>
<span class="err">#</span> <span class="nc">VLArbitration</span> <span class="nl">tables:</span> <span class="nc">Lid</span> <span class="mi">475</span> <span class="n">port</span> <span class="mi">1</span> <span class="nc">LowCap</span> <span class="mi">16</span> <span class="nc">HighCap</span> <span class="mi">16</span>
<span class="err">#</span> <span class="nc">Low</span> <span class="n">priority</span> <span class="no">VL</span> <span class="nc">Arbitration</span> <span class="nl">Table:</span>
<span class="no">VL</span>    <span class="o">:</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span>
<span class="nl">WEIGHT:</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span>
<span class="err">#</span> <span class="nc">High</span> <span class="n">priority</span> <span class="no">VL</span> <span class="nc">Arbitration</span> <span class="nl">Table:</span>
<span class="no">VL</span>    <span class="o">:</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span>
<span class="nl">WEIGHT:</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span><span class="mh">0x0</span> <span class="o">|</span>
</code></pre></div></div> <p>Takeaways:</p> <ol> <li><code class="language-plaintext highlighter-rouge">sl2vl</code> says that we have 16 SLs, but all are mapped to VL0, so no QoS.</li> <li><code class="language-plaintext highlighter-rouge">vlarb</code> is all <code class="language-plaintext highlighter-rouge">VL0</code>. There are two levels of priority classes. First level is <code class="language-plaintext highlighter-rouge">vlarb_high</code> and <code class="language-plaintext highlighter-rouge">vlarb_low</code>. Second level is intra-vlarb, between <code class="language-plaintext highlighter-rouge">(sl, weight)</code> entries. The relative weight of vlarb-high/low is in <code class="language-plaintext highlighter-rouge">VLHighLimit</code>.</li> </ol> <h2 id="configuring-qos-in-opensm">Configuring QoS in OpenSM</h2> <p>You edit <code class="language-plaintext highlighter-rouge">/etc/opensm/opensm.conf</code> with something like this:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"># enable QoS
</span><span class="n">qos</span> <span class="n">TRUE</span>

<span class="cp"># all prefixes are qos_
# specific sub-prefixes: qos_ca_, qos_rtr_, qos_sw0_, qos_swe_
# (for specific config for CAs, routers, switch port 0's, and switches)
</span>
<span class="n">qos_max_vls</span> <span class="mi">2</span>
<span class="cp"># send this many from high-priority first (255 ~ infinite priority)
</span><span class="n">qos_high_limit</span> <span class="mi">255</span>
<span class="cp"># intra-high priority
</span><span class="n">qos_vlarb_high</span> <span class="mi">1</span><span class="o">:</span><span class="mi">192</span><span class="p">,</span> <span class="mi">2</span><span class="o">:</span><span class="mi">128</span><span class="p">,</span> <span class="mi">3</span><span class="o">:</span><span class="mi">64</span>
<span class="n">qos_vlarb_low</span> <span class="mi">0</span><span class="o">:</span><span class="mi">64</span>
<span class="cp"># SLs [0, 4) are VL0, [4, 8) are VL1
</span><span class="n">qos_sl2vl</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span>
</code></pre></div></div> <p>The policy for relative weights between the high-priority table and the low-priority table is not clear to me. That is, what exactly does <code class="language-plaintext highlighter-rouge">qos_vlarb_high</code> do.</p> <p>Note: <a href="https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg04092.html">https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg04092.html</a> is a good description of the <code class="language-plaintext highlighter-rouge">qos_high_limit</code>. It seems to me that the high class preempts the low class up to the high limit (preempts not mid-packet, but otherwise).</p> <p>There’s also ULP-based QoS I think (ULP: Upper Layer Protocol). Can maybe target service IDs for MPI, Lustre etc.</p> <h2 id="service-ids">Service IDs</h2> <ul> <li>Lustre etc. can register a Service Record with SA, containing (Service ID, server LID, service name)</li> <li>RDMA-CM (Connection Manager) is involved somehow</li> <li>RDMA-CM divides the SID space into port spaces. <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP</code> is a 16-bit namespace.</li> </ul> <p>So SID can be <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP &lt;&lt; 16 | service_port</code>, or <code class="language-plaintext highlighter-rouge">RDMA_PS_IB &lt;&lt; 16 | qpn</code>.</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">sudo</span> <span class="n">saquery</span> <span class="o">-</span><span class="n">S</span>
<span class="o">&lt;</span><span class="n">nothing</span><span class="o">&gt;</span>
<span class="err">$</span> <span class="n">sudo</span> <span class="n">saquery</span>
<span class="o">&lt;</span><span class="n">bunch</span> <span class="n">of</span> <span class="n">node</span> <span class="n">records</span><span class="o">&gt;</span>
<span class="cp"># On Lustre root
</span><span class="err">$</span> <span class="n">cat</span> <span class="o">/</span><span class="n">sys</span><span class="o">/</span><span class="n">module</span><span class="o">/</span><span class="n">ko2iblnd</span><span class="o">/</span><span class="n">parameters</span><span class="o">/</span><span class="n">service</span>
<span class="mi">987</span>
</code></pre></div></div> <p>So it seems that Lustre’s SID is <code class="language-plaintext highlighter-rouge">RDMA_PS_TCP &lt;&lt; 16 | 987</code></p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">rdma</span> <span class="n">resource</span> <span class="n">show</span> <span class="n">cm_id</span> <span class="o">|</span> <span class="n">grep</span> <span class="n">LISTEN</span>
<span class="n">link</span> <span class="n">qib0</span><span class="o">/</span><span class="mi">1</span> <span class="n">cm</span><span class="o">-</span><span class="n">idn</span> <span class="mi">0</span> <span class="n">state</span> <span class="n">LISTEN</span> <span class="n">ps</span> <span class="n">TCP</span> <span class="n">comm</span> <span class="p">[</span><span class="n">ko2iblnd</span><span class="p">]</span> <span class="n">src</span><span class="o">-</span><span class="n">addr</span> <span class="mi">10</span><span class="p">.</span><span class="mi">94</span><span class="p">.</span><span class="n">xxx</span><span class="p">.</span><span class="n">yyy</span><span class="o">:</span><span class="mi">987</span> <span class="n">dst</span><span class="o">-</span><span class="n">addr</span> <span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="p">.</span><span class="mi">0</span><span class="o">:</span><span class="mi">0</span>
<span class="cp"># P=987, 0x106 is the `RDMA_PS_TCP`
</span><span class="err">$</span> <span class="n">P</span><span class="o">=</span><span class="mi">987</span><span class="p">;</span> <span class="n">printf</span> <span class="err">'</span><span class="mi">0</span><span class="n">x</span><span class="o">%</span><span class="mo">016</span><span class="n">x</span><span class="err">\</span><span class="n">n</span><span class="err">'</span> <span class="err">$</span><span class="p">((</span> <span class="p">(</span><span class="mh">0x106</span><span class="o">&lt;&lt;</span><span class="mi">16</span><span class="p">)</span> <span class="o">|</span> <span class="n">P</span> <span class="p">))</span>
<span class="mh">0x00000000010603db</span> <span class="err">#</span> <span class="n">Service</span> <span class="n">ID</span> <span class="k">for</span> <span class="n">Lustre</span>
</code></pre></div></div> <p>This also shows 987 for Lustre. This maps Lustre to SL1.</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp"># /var/cache/opensm/qos-policy.conf
</span>
<span class="n">qos</span><span class="o">-</span><span class="n">ulps</span>
<span class="err"> </span> <span class="k">default</span> <span class="o">:</span> <span class="mi">0</span>
<span class="err"> </span> <span class="n">any</span><span class="p">,</span> <span class="n">service</span><span class="o">-</span><span class="n">id</span> <span class="mh">0x00000000010603DB</span> <span class="o">:</span> <span class="mi">1</span>
<span class="n">end</span><span class="o">-</span><span class="n">qos</span><span class="o">-</span><span class="n">ulps</span>
</code></pre></div></div> <h3 id="qos-in-psm">QoS in PSM</h3> <ul> <li>Use env variable <code class="language-plaintext highlighter-rouge">IPATH_SL</code>.</li> </ul> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">psmi_getenv</span><span class="p">(</span><span class="s">"IPATH_SL"</span><span class="p">,</span> <span class="s">"IB outging ServiceLevel number (default 0)"</span><span class="p">,</span>
		<span class="n">PSMI_ENVVAR_LEVEL_USER</span><span class="p">,</span> <span class="n">PSMI_ENVVAR_TYPE_LONG</span><span class="p">,</span>
		<span class="p">(</span><span class="k">union</span> <span class="n">psmi_envvar_val</span><span class="p">)</span> <span class="n">PSMI_SL_DEFAULT</span><span class="p">,</span>
		<span class="o">&amp;</span><span class="n">env_sl</span><span class="p">))</span> <span class="p">{</span>
<span class="n">opts</span><span class="p">.</span><span class="n">outsl</span> <span class="o">=</span> <span class="n">env_sl</span><span class="p">.</span><span class="n">e_long</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// This seems to be older code for PSM v1.07 to 1.10. Head is 1.16</span>
<span class="c1">// No need to set VL in head, will use </span>
<span class="cp">#if (PSM_VERNO &gt;= 0x0107) &amp;&amp; (PSM_VERNO &lt;= 0x010a)
</span><span class="p">{</span>
  <span class="k">union</span> <span class="n">psmi_envvar_val</span> <span class="n">env_vl</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">psmi_getenv</span><span class="p">(</span><span class="s">"IPATH_VL"</span><span class="p">,</span> <span class="s">"IB outging VirtualLane (default 0)"</span><span class="p">,</span>
		   <span class="n">PSMI_ENVVAR_LEVEL_USER</span><span class="p">,</span> <span class="n">PSMI_ENVVAR_TYPE_LONG</span><span class="p">,</span>
		   <span class="p">(</span><span class="k">union</span> <span class="n">psmi_envvar_val</span><span class="p">)</span><span class="mi">0</span><span class="p">,</span>
		   <span class="o">&amp;</span><span class="n">env_vl</span><span class="p">))</span> <span class="p">{</span>
<span class="n">opts</span><span class="p">.</span><span class="n">outvl</span> <span class="o">=</span> <span class="n">env_vl</span><span class="p">.</span><span class="n">e_long</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="cp">#endif
</span>
</code></pre></div></div> <h3 id="qos-in-libfabricmercury">QoS in libfabric/Mercury</h3> <p>Traffic classes are in <code class="language-plaintext highlighter-rouge">fi_tx_attr-&gt;tclass (u32)</code>: <a href="https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html">https://ofiwg.github.io/libfabric/main/man/fi_endpoint.3.html</a></p> <p>Mercury (since v2.4):</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hg_opts</span><span class="p">.</span><span class="n">na_init_info</span><span class="p">.</span><span class="n">traffic_class</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="p">)(</span><span class="n">desired_sl</span> <span class="o">&lt;&lt;</span> <span class="mi">5</span><span class="p">);</span> <span class="c1">// SL -&gt; tclass</span>
</code></pre></div></div> <p>Maybe something here: <a href="https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html">https://docs.nvidia.com/doca/sdk/rdma%2Bover%2Bconverged%2Bethernet/index.html</a></p> <p>Things I’ve established so far:</p> <ol> <li>I’m interested in <code class="language-plaintext highlighter-rouge">verbs; ofi_rxm</code></li> <li><code class="language-plaintext highlighter-rouge">ofi_rxm</code> just passes <code class="language-plaintext highlighter-rouge">tclass</code> to the core provider (which is verbs)</li> <li>I can’t find out what verbs does with it. It seems to use both <code class="language-plaintext highlighter-rouge">ibv</code> APIs and <code class="language-plaintext highlighter-rouge">rdma-cm</code> APIs. Looks like I may be looking for <code class="language-plaintext highlighter-rouge">rdma_set_option(RDMA_OPTION_ID_TOS)</code>. But nothing calls that.</li> </ol> <h3 id="update-20250923">Update: 20250923</h3> <p>QoS in libfabric verbs provider: The verbs provider implements QoS through InfiniBand’s native Service Level (SL) mechanism. However, unlike other libfabric providers (e.g., CXI), verbs has no traffic class (tclass) support. The SL is essentially fixed at endpoint creation time and cannot be modified at runtime without recreating the address handle, which would require modifying the libfabric source code.</p> <ul> <li>SL set in AH: <code class="language-plaintext highlighter-rouge">prov/verbs/src/verbs_dgram_av.c:65</code> - where <code class="language-plaintext highlighter-rouge">ah_attr.sl</code> gets the SL value during address handle creation</li> <li>SL source: <code class="language-plaintext highlighter-rouge">prov/verbs/src/verbs_ep.c:840</code> - where endpoint gets SL from subnet manager default <code class="language-plaintext highlighter-rouge">port_attr.sm_sl</code></li> <li>No tclass support: <code class="language-plaintext highlighter-rouge">prov/verbs/src/verbs_ep.c:358</code> - <code class="language-plaintext highlighter-rouge">vrb_ep_setopt()</code> function only handles CUDA options, no tclass</li> <li>Compare CXI: <code class="language-plaintext highlighter-rouge">prov/cxi/src/cxip_ep.c:1148</code> - shows how other providers handle <code class="language-plaintext highlighter-rouge">FI_OPT_CXI_SET_TCLASS</code></li> <li>Traffic class constants: <code class="language-plaintext highlighter-rouge">include/rdma/fabric.h:357</code> - the <code class="language-plaintext highlighter-rouge">FI_TC_*</code> enums that verbs doesn’t use</li> </ul> <h3 id="update-20250923-10-mins">Update: 20250923 +10 mins</h3> <p>So libfabric has betrayed us, but can we assign the flow a service level anyway, using other classification mechanisms, like that ULP stuff?</p> <p>Our flow is that we use Mercury to bind to a <code class="language-plaintext highlighter-rouge">verbs; ofi+rxm</code> endpoint, and exchange those addresses with friends. What does that address look like?</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Our</span> <span class="n">mercury</span> <span class="n">address</span><span class="p">:</span> <span class="n">ofi</span><span class="o">+</span><span class="n">verbs</span><span class="p">;</span><span class="n">ofi_rxm</span><span class="p">:</span><span class="o">//</span><span class="mf">10.94</span><span class="p">.</span><span class="mf">3.29</span><span class="p">:</span><span class="mi">56504</span>
</code></pre></div></div> <p>Spicy.</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">$</span> <span class="n">rdma</span> <span class="n">resource</span> <span class="n">show</span> <span class="n">cm_id</span> <span class="o">|</span> <span class="n">grep</span> <span class="no">LISTEN</span>
<span class="o">...</span>
<span class="n">link</span> <span class="n">qib0</span><span class="o">/</span><span class="mi">1</span> <span class="n">cm</span><span class="o">-</span><span class="n">idn</span> <span class="mi">17325</span> <span class="n">state</span> <span class="no">LISTEN</span> <span class="n">ps</span> <span class="no">TCP</span> <span class="n">pid</span> <span class="mi">1344261</span> <span class="n">comm</span> <span class="n">controller_main</span> <span class="n">src</span><span class="o">-</span><span class="n">addr</span> <span class="mf">10.94</span><span class="o">.</span><span class="mf">3.29</span><span class="o">:</span><span class="mi">56504</span> <span class="n">dst</span><span class="o">-</span><span class="n">addr</span> <span class="mf">0.0</span><span class="o">.</span><span class="mf">0.0</span><span class="o">:</span><span class="mi">0</span>
</code></pre></div></div> <p>So it seems that this does go through RDMA-CM. So if we ask Mercury to bind on specific port ranges, I think we can still assign them a QoS using a port range in the ULP thing.</p> <h2 id="also-read">Also Read</h2> <ul> <li><a href="https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1178075141/Understanding+Basic+InfiniBand+QoS">https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1178075141/Understanding+Basic+InfiniBand+QoS</a></li> <li><a href="https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1177878529/Getting+Started+with+InfiniBand+QoS">https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1177878529/Getting+Started+with+InfiniBand+QoS</a></li> <li><a href="https://web-docs.gsi.de/~vpenso/notes/posts/hpc/network/infiniband/subnet-manager.html">https://web-docs.gsi.de/~vpenso/notes/posts/hpc/network/infiniband/subnet-manager.html</a></li> <li><a href="https://docs.nvidia.com/networking/display/mlnxofedv51258060/opensm">https://docs.nvidia.com/networking/display/mlnxofedv51258060/opensm</a></li> <li><a href="https://docs.nvidia.com/networking/display/mlnxofedv461000/qos+-+quality+of+service">https://docs.nvidia.com/networking/display/mlnxofedv461000/qos+-+quality+of+service</a></li> </ul> <p>Lustre discussions:</p> <ul> <li><a href="https://wiki.whamcloud.com/display/LNet/Lustre+QoS">https://wiki.whamcloud.com/display/LNet/Lustre+QoS</a> – see this!</li> <li><a href="https://groups.google.com/g/lustre-discuss-list/c/n6sdj-e5LNA">https://groups.google.com/g/lustre-discuss-list/c/n6sdj-e5LNA</a></li> <li><a href="https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg04092.html">https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg04092.html</a></li> <li><a href="https://admire-eurohpc.eu/wp-content/uploads/2023/12/Lustre-QoS-Barcelona-GA-Dec-2023.pptx.pdf">https://admire-eurohpc.eu/wp-content/uploads/2023/12/Lustre-QoS-Barcelona-GA-Dec-2023.pptx.pdf</a></li> </ul> <p>Intel Fabric Suite (configrms Lustre ServiceID)</p> <ul> <li><a href="https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_FabricSuite_Fabric_Manager_UG_H76468_v15_0.pdf">https://www.intel.com/content/dam/support/us/en/documents/network-and-i-o/fabric-products/Intel_OP_FabricSuite_Fabric_Manager_UG_H76468_v15_0.pdf</a></li> </ul>]]></content><author><name></name></author><category term="infiniband"/><category term="infiniband"/><summary type="html"><![CDATA[Notes on configuring QoS in Infiniband]]></summary></entry><entry><title type="html">Prediction and Adaptation in Decisions</title><link href="https://ankushja.in/blog/2025/prediction-adaptation/" rel="alternate" type="text/html" title="Prediction and Adaptation in Decisions"/><published>2025-07-26T16:59:58+00:00</published><updated>2025-07-26T16:59:58+00:00</updated><id>https://ankushja.in/blog/2025/prediction-adaptation</id><content type="html" xml:base="https://ankushja.in/blog/2025/prediction-adaptation/"><![CDATA[<p>Some possibly non-coherent thoughts on decision-making:</p> <p>Any autonomous system must, by definition, take decisions. It is rational to consider the utility of the decision space and pick a nice point. We draw on our experience and simulate the impact of different decisions in our head, and then we pick one. Is that it?</p> <p>No, right? Decisions are not irrevocable. You can adapt. If you accidentally oversped, you can apply brakes.</p> <p>Predictive decisions rely on implicit or explicit modeling. Models are, in the best case, an nth-order approximation of a complex system, with ideally a decent predictive value and a low residual. Intuition, experience, and wisdom are also just models.</p> <p>Models are by definition imperfect—a perfect model would also necessitate a perfect simulation of all reality. But they can be, and are, more or less imperfect. We all start off with more imperfect models, and (unless maladaptive) use the residuals from using their predictions to refine them. The decision loop, therefore, may be defined to have these components:</p> <ol> <li>Observe inputs, and compute <code class="language-plaintext highlighter-rouge">decision = f(input, intent)</code></li> <li>Observe the impact of the decision and compute a course correction if necessary</li> <li>Use the residual to update your hypothesis</li> </ol> <p>Or a more formal attempt:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">:</span> <span class="n">m_init</span>   <span class="c1"># Our initial model
</span><span class="n">intent</span><span class="p">:</span> <span class="n">i_init</span>  <span class="c1"># Some mysterious missing ingredient
</span>
<span class="k">while</span> <span class="n">Alive</span><span class="p">:</span>
  <span class="n">reality</span> <span class="o">=</span> <span class="nf">input</span><span class="p">()</span>
  <span class="n">decision</span> <span class="o">=</span> <span class="nf">f</span><span class="p">(</span><span class="n">reality</span><span class="p">,</span> <span class="n">intent</span><span class="p">,</span> <span class="n">model</span><span class="p">)</span>
  <span class="n">impact</span> <span class="o">=</span> <span class="n">reality</span><span class="p">.</span><span class="nf">apply</span><span class="p">(</span><span class="n">decision</span><span class="p">)</span>
  <span class="n">error</span> <span class="o">=</span> <span class="n">intent</span> <span class="o">-</span> <span class="n">impact</span>
  <span class="n">model</span><span class="p">.</span><span class="nf">update</span><span class="p">(</span><span class="n">error</span><span class="p">)</span>
  <span class="n">intent</span><span class="p">.</span><span class="nf">update</span><span class="p">(</span><span class="n">reality</span><span class="p">,</span> <span class="n">impact</span><span class="p">)</span>
</code></pre></div></div> <h2 id="what-is-an-optimal-decision">What Is An Optimal Decision?</h2> <p>It is tempting to then say that the quality of the initial decision does not matter, because it can always be updated. That models the system as a stochastic process, and is true insofar as that holds/is a reasonable approximation. But that is not the case: human lifetimes are finite and there is a time cost and material cost of bad decisions, there is no recovering from jumping off a cliff, prior actions also affect other agents’ models (what we call reputation or trust).</p> <p>It may then be tempting to say that you want the best model before you take any decisions. But models are by definition updated by taking decisions. Scheduling of decisions is a useful escape hatch/degree of freedom: you may decide to read a book to have a better model before taking a decision with a bigger modeled impact. Scheduling is perhaps a meta-decision: choosing to <em>explore</em> (deferring the big decision) rather than <em>exploit</em>.</p> <p>Deferring is useful insofar as the marginal improvement in your model outweighs the opportunity cost of delaying a decision. Two examples of instances where deferring is bad:</p> <ol> <li>There exists no decision that meaningfully enhances the utility of the deferred decision</li> <li>The opportunity cost of deferment is large</li> </ol> <p><strong>Connections With Standard Terminology</strong> I think open-loop and closed-loop control in control theory are not appropriate parallels. What I am describing is maybe an adaptive closed-loop system? But an autonomous, adaptive, completely programmable closed-loop system where decisions update both the model and the intent. Or just intelligent and conscious life?</p> <h2 id="why-is-this-useful">Why Is This Useful</h2> <p><strong>How to design systems</strong>. If a system requires some information that is, by definition, not available at bootstrap, adaptive mechanisms are necessary to solve the problem.</p> <p><strong>Autonomous intelligence, maybe?</strong> With current ML models, backpropagation is a progressive hypothesis refinement loop, but without an intent. And then model weights are frozen and they are tasked with prediction. But being static, they are not refined, unless we account for periodic batch refinement that is conducted with additional data gathered since the previous run.</p> <p>Are models sufficiently adaptive? Is it possible that even with static weights, they learn to simulate memories etc. within those weights? What are the limits of computations that an algorithm running on top of a static machine can express, versus those of an algorithm that can mutate the machine itself?</p>]]></content><author><name></name></author><category term="systems,"/><category term="ml"/><category term="reasoning"/><category term="ml"/><category term="systems"/><summary type="html"><![CDATA[How to model decision-making itself?]]></summary></entry><entry><title type="html">Notes on Deepseek 3FS Filesystem</title><link href="https://ankushja.in/blog/2025/notes-on-deepseek-3fs/" rel="alternate" type="text/html" title="Notes on Deepseek 3FS Filesystem"/><published>2025-03-02T14:19:21+00:00</published><updated>2025-03-02T14:19:21+00:00</updated><id>https://ankushja.in/blog/2025/notes-on-deepseek-3fs</id><content type="html" xml:base="https://ankushja.in/blog/2025/notes-on-deepseek-3fs/"><![CDATA[<p>This is just me making some quick notes on the 3FS Parallel Filesystem from Deepseek<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Caveat that while I have opinions on all systems, filesystems is strictly not my day job and I have to go back and think about a bunch of things and it doesn’t quite work, and this post will likely have a bunch of mistakes.</p> <h1 id="services-in-3fs">Services in 3FS</h1> <ul> <li>Cluster manager: highly available, manages membership, config etc (uses zookeeper?)</li> <li>Metadata service: uses FoundationDB</li> <li>Storage service (chunk store?)</li> <li>Client (two implementations: FUSE-based, and a more performant one.)</li> </ul> <h1 id="metadata-in-3fs">Metadata in 3FS</h1> <h3 id="the-ongoing-metadata-debate">The Ongoing Metadata Debate</h3> <p>Whether we need object stores or filesystem semantics has been a topic of conversation recently<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>. My own views so far:</p> <ul> <li>I buy the argument that data accesses during a training run do not need a namespace. Maybe the namespace can be <em>batch-updated</em> later on, if it gets in the way and does not provide the required performance.</li> <li>I feel that the usability argument still applies. At some point, you need to think about what does all the data mean, what to keep around, what to prune. There are a million reasons – legal, regulatory, financial, just basic IT maintenance that require metadata. Long-term, it is easy to think about things as a group or a hierarchy.</li> </ul> <p>I think performant fileystem namespaces is doable. It may get needlessly hard if you tie yourself to a spec or a standard. That is because this is in flux, and we do not yet have unanimous agreement on what the standard should look like.</p> <h3 id="the-3fs-approach">The 3FS Approach</h3> <p>They use a key-value model for their metadata, built on top of FoundationDB. Each key represents an inode.</p> <p>FoundationDB, as an aside, is a very interesting database. It is like a distributed RocksDB with support for strict serializability in transactions. This is interesting to me — normally you think of KV stores as single-node storage backends for DBMSes to use. Transactions are implemented at a SQL level. FDB implements transactions at the KV level, but is supposed to be a lower-level database you layer on top of. (Aside: I wonder what the end-to-end principle says about the correct layering). It is like a persistent distributed software transactional memory, that tries to be lock-free, with optimistic and multi-version concurrency control.</p> <ul> <li>File inodes are <code class="language-plaintext highlighter-rouge">INOD&lt;inode_id&gt;</code>, mapped to ownership, permissions, times., and chunk information etc. for files. Interestingly, they encode <code class="language-plaintext highlighter-rouge">inode_id</code> in little-endian to spread inodes across FDB nodes. Directory inodes begin with <code class="language-plaintext highlighter-rouge">DENT</code>, have bidirectional links to parents to ensure no loops.needs to be and should be super-fast.</li> <li><code class="language-plaintext highlighter-rouge">fstat</code>, <code class="language-plaintext highlighter-rouge">lookup</code>, <code class="language-plaintext highlighter-rouge">listdir</code> etc. invoke read-only txns. <code class="language-plaintext highlighter-rouge">crate</code>, <code class="language-plaintext highlighter-rouge">link</code> etc. invoke read-write txns.</li> </ul> <h3 id="state-in-3fs">State in 3FS</h3> <p>3FS metadata stores are stateless. Any operation is a self-contained transaction.</p> <p>The metadata service does store file descriptors for files opened in write mode, but not in read mode. This helps with bootstrapping training jobs. This IMO is super important!! Any parallel job, when bootstrapping, needs to retrieve a large number of files in read-only mode. This should be super-fast, but is not so with Lustre. I vaguely recall someone mentioning that they had to hack on a cache on top of Lustre to push <code class="language-plaintext highlighter-rouge">.so</code> libraries to MPI ranks, and thinking that this feels like a fundamental flaw in what is supposed to be a parallel filesystem.</p> <p>Question: how do you handle a write request for a file that is currently open in read-only mode? Ideally you need to track the readers and revoke their lease, but that requires maintaining state at the metadata layer? I think that’s done by tracking versions?</p> <h2 id="chunk-store">Chunk Store</h2> <p>This is relatively straightforward. It is per-node, and comes with a RocksDB instance for chunk metadata. Chunk metadata is cached in-memory. They have 11 chunk sizes from 64 KiB to 64 MiB. I don’t think this is a big deal but metadata shouldn’t be too much, even with 64KiB chunks, it is a little surprising that they went for this number of chunk sizes. But eh. Updates are CoW – old chunks remain valid until update completes.</p> <h2 id="chain-replication">Chain Replication</h2> <p>They use CRAQ (Chained Replication with Apportioned Queries) for replication. This is just a fancy way of saying that any replica can respond to a read request, with the catch being that you may get a stale version. A node may have a committed version of a block and a pending version. If so, it lets the requester know the existence of both versions, and the decision on whether to tolerate staleness or try again is left to the requester. Note that this is different from the case where a chain replica does not even know of the existence of an update – I don’t know if stale reads are tolerated in that case (or maybe version numbers are used to read a consistent snapshot).</p> <h2 id="zero-copy-in-fuse">Zero-copy in FUSE!?</h2> <p>They use FUSE for clients, so as to not deal with debugging kernel panics. They define an interface called <code class="language-plaintext highlighter-rouge">USRBIO</code>, for zero-copy interaction between the userspace and the FUSE layer. <code class="language-plaintext highlighter-rouge">USRBIO</code> is inspired by <code class="language-plaintext highlighter-rouge">io-uring</code> (or the Verbs API for that matter). The FUSE process manages Verbs-registered memory and submission/completion rings, and does the dispatch etc. These zero-copy ops are only used for the data path, and metadata ops still go through regular FUSE APIs.</p> <h2 id="misc">Misc</h2> <ul> <li>Codebase is mostly C++ with some Rust.</li> <li>They use <code class="language-plaintext highlighter-rouge">P</code><sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> to verify their protocols. <code class="language-plaintext highlighter-rouge">P</code> is apparently a formal verification language for event-driven distributed services. Very cool!</li> <li>They seem to use <code class="language-plaintext highlighter-rouge">flatbuffers</code> for serdes and their own RDMA/RPC wrappers on top of the <code class="language-plaintext highlighter-rouge">verbs</code> interface. There’s a lot of code there and I just briefly skimmed through it.</li> <li>They heavily use <code class="language-plaintext highlighter-rouge">folly</code>, Facebook’s assorted library of C++ abstractions (including coroutines).</li> </ul> <h1 id="epilogue">Epilogue</h1> <p>Not sure what to make of this. Writing a parallel filesystem is a massive undertaking. While metadata bottlenecks is a well-known problem in parallel filesystems, offloading it entirely to a DB like FoundationDB is an interesting choice. I have been trying to revisit how Colossus/Tectonic manage their metadata, and it is Bigtable and ZippyDB respectively. Ceph is pretty much the only open-source filesystem to support multiple metadata servers (Lustre is unwieldy in multiple ways). 3FS is optimized for RDMA fabrics over the data path, and makes subtle choices (like cheap metadata reads) to optimize for a large parallel job bootstrapping. I remember reading somewhere that it is designed for small random accesses, and I don’t really see why that is the case. I will end with the caveat that there are a lot of things to keep straight in distributed/parallel filesystem discussions, I don’t really work on these, and this post probably has a bunch of mistakes.</p> <p>(Oh their design notes do say that they enable random access to training samples, I don’t see how they optimize for random accesses, maybe they mean that their system will transfer partial chunks and not do aggressive prefetching etc?)</p> <h1 id="references">References</h1> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>https://github.com/deepseek-ai/3FS <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>https://blog.glennklockwood.com/2025/02/llm-training-without-parallel-file.html <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>https://github.com/p-org/P <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="systems"/><category term="filesystems"/><category term="storage"/><summary type="html"><![CDATA[Notes on the Fire-Flyer Filesystem by Deepseek]]></summary></entry><entry><title type="html">On The Universality of Query Languages</title><link href="https://ankushja.in/blog/2025/universality-query-languages/" rel="alternate" type="text/html" title="On The Universality of Query Languages"/><published>2025-02-06T15:08:30+00:00</published><updated>2025-02-06T15:08:30+00:00</updated><id>https://ankushja.in/blog/2025/universality-query-languages</id><content type="html" xml:base="https://ankushja.in/blog/2025/universality-query-languages/"><![CDATA[<p>This is an initial quick draft of this line of thought, written 16 hours before a paper deadline I have things to do about (that has nothing to do with query languages). It will be rushed and half-baked, and surreptitiously refined later.</p> <p>Goal for this post: let us try justifying SQL. Why it exists, why is it in a specific form etc.</p> <h2 id="sql-vs-relational-algebra">SQL vs Relational Algebra</h2> <p>I think SQL is made-up and non-essential. It is syntactic sugar over relational algebra. Other declarative interfaces may be more convenient abstractions — I personally prefer dataframes and chaining of expressions.</p> <p>If you think of the query plan as a dataflow tree leading up to a root, SQL is essentially a traversal over that tree, like in-order, pre-order etc. The order SQL follows, specifically, may be called a <em>Weird Arbitrary Traversal</em> (or <em>WAT</em>). That the query planning infra evolved to reflect dataflow is not surprising, data movement is expensive, be it across functions or machines, and the more data you discard closer to the source the better off you are.</p> <p>These pedantic distinctions are important, because they enable us to reason about</p> <h2 id="on-the-universality-of-relational-algebra">On the Universality of Relational Algebra?</h2> <p>Okay so why does relational algebra exist in the form it does?</p> <p>As a lower bound on this argument, a query interface could just be say… C. You write a blob, machine runs blob, you get output.</p> <p>We do declarative interfaces to make life easier given certain domain-specific constraints. In the relational world, it is:</p> <ol> <li>Leveraging the relational/tabular data model.</li> <li>Leveraging distributed cluster resources.</li> <li>Leveraging fine-grained scheduling, concurrency etc.</li> <li>Getting easy access to indexes etc.</li> <li>Enabling reduced data movement by exposing primitives that capture data flow</li> <li>Enabling query optimization by exposing primitives that make it easier to reason about ordering and statistics</li> <li>Enable reasoning about partitioning and shuffling using key constraints and joins</li> </ol> <h2 id="do-relational-operators-do-these-things">Do Relational Operators Do These Things?</h2> <p>To a large extent, yes. That is why relational algebra endures. SQL just pretends to be inseparable from that and most people do not ask too many questions.</p> <p>You always need escape hatches. That is why UDFs exist. But beyond a point, they hinder the query planning infra’s ability to reason about what the blob does.</p> <h2 id="who-deserves-to-be-in-the-club">Who Deserves To Be In The Club?</h2> <p>What operators should be in this club?</p> <ul> <li>Can we justify the existence of all existing operators?</li> <li>Are there new ones that should be there but are not?</li> <li>Under what models do we need radically different solutions?</li> </ul> <p>Relational algebra checks out as it is essentially set theory. Given a table, you can slice it horizontally or vertically. Indexing helps with horizontal slicing. Vertical slicing is harder in row-based stores but column-based stores have solutions. Selection and projection map nicely to dataflow, indexing etc. Division seems to be the only relational operator not directly used in query plans.</p> <h2 id="where-am-i-getting-at">Where Am I Getting At?</h2> <p>There exist data models (such as multidimensional array-based scientific data) that are not relational. Graphs are one example. Nothing here is really new.</p> <p>The question I am trying to answer is: what is the set of ideas that lead you to the perfect operators for these domains? Do they exist? Are they in conflict with relational algebra? Or can they be members of the extended family?</p> <p><em>Note to self: read referenced material</em> <sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup><sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup><sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup><sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup></p> <h2 id="references">References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>SciQL, a query language for science applications, Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases, 2011, https://dl.acm.org/doi/10.1145/1966895.1966896 <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>MeshSQL: The query language for simulation mesh data, Information Sciences, 2004, https://www.sciencedirect.com/science/article/pii/S0020025503001981 <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>Toward unstructured mesh algebra and query language, Proceedings of the 2014 SIGMOD PhD symposium, 2014, https://dl.acm.org/doi/10.1145/2602622.2602626o <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>TileDB ∙ Designed for Discovery, https://tiledb.com/ <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="philosoraptormode"/><category term="sql"/><summary type="html"><![CDATA[Why is SQL what it is? What motivates a different query language?]]></summary></entry><entry><title type="html">Are All Networks Just Tradeoffs?</title><link href="https://ankushja.in/blog/2024/network-tradeoffs/" rel="alternate" type="text/html" title="Are All Networks Just Tradeoffs?"/><published>2024-12-07T00:58:30+00:00</published><updated>2024-12-07T00:58:30+00:00</updated><id>https://ankushja.in/blog/2024/network-tradeoffs</id><content type="html" xml:base="https://ankushja.in/blog/2024/network-tradeoffs/"><![CDATA[<p><a href="https://ankushja.in/blog/2024/credits-flow-congestion/">Part 1 here</a>.</p> <p>This post will explore the following thesis:</p> <ol> <li>All networks and variants are tradeoffs.</li> <li>If there are tradeoffs, there is no permanent winning. You are always tuning to chase <em>the window</em>.</li> </ol> <p>Disclaimer: this could be the most obvious statement in the world. Or maybe deep within the symbolic mines of queuing and scheduling, someone has said something that disproves all of it. Or maybe that that statement is obvious but the arguments below are flawed. I have no idea — this blog does not come with any guarantees for rigor or scrutiny.</p> <h2 id="context-thinking-about-cbfc-vs-pfc">Context: Thinking About CBFC vs PFC</h2> <p>The thing in the back of my mind recently has been — why does everyone complain about PFC and RoCEv2, and why do they not complain about Infiniband and CBFC (Credit-based Flow Control). Why could RoCEv2 not adopt CBFC? What do you gain and what do you lose? Do you still need higher-level congestion control with CBFC?</p> <h2 id="cbfc-is-not-a-magic-pill">CBFC Is Not A Magic Pill</h2> <p>An upper bound to CBFC goodness is much easier to establish than its precise extent.</p> <p>CBFC is not a magic pill. It can still lead to credit loops, head-of-line blocking, buffer overflows, packet loss emanating from it, and so on. It is also amenable to tuning: routing, topologies, credit management schemes etc. to reduce the likelihood of bad things happening.</p> <p>CBFC absolutely provides no theoretical guarantee (I’m not talking about a bounds-type argument: under X flows and Y buffer sizes you prove a Z\% upper bound on packet loss). Also not everyone means the same thing when they talk about CBFC — the Infiniband spec on this is not very prescriptive — my understanding is that it describes packet formats, and some basic mechanics, but still leaves a lot of room for vendors to do better.</p> <h2 id="guarantees-are-expensive">Guarantees Are Expensive</h2> <p>You could provide a lot of guarantees by implementing a network as an all-to-all network of separate links. That is essentially what a crossbar is. But we want to optimize for cost. So we do tiers and routing and non-uniform bandwidth and all that stuff. You lose out on worst-case performance but that’s the only way to get systems to scale.</p> <p>Turns out that this applies for everything — including tail latencies, jitter etc. The only way to guarantee that you will move a packet in a predictable timely manner is to explicitly carve out room for that packet on the path before sending it out. If you care about maintaining a high utilization or cost or scalability, this is bad.</p> <p>Given a guarantee, you can always make an argument that “we can relax this guarantee a little and get a lot more performance in the common case.” This is statistical multiplexing or oversubscription or optimistic concurrency control or whatever. These are incredibly important — these thoughtful relaxations are the reason that systems work and can be built for the prices that’re built for. But it is always nice to be explicit about what you are losing in return.</p> <h2 id="overloading-creates-more-risks">Overloading Creates More Risks</h2> <p>As the goal of this exercise was to understand CBFC, let us compare it with sender-driven mechanisms. Sender-driven congestion control overloads certain signals to infer network state — RTT and packet loss seem to be two of them. This works until something happens that changes the meaning of these numbers — bufferbloat being one example. An advantage, in principle, of CBFC is that it is explicitly communicating actual network state, so theoretically, it should be robust to some such issues. I do not know if that is actually realized in practice.</p> <p>This is in no way trashing sender-driven mechanisms. A good system is one that gets the job done — clean conscience and mathematical elegance are the domains of whiteboard hoggers. The end-to-end principle and this layered approach to TCP/IP has taken us a long long long way. Overloading those signals is the practical thing to do. (Also note that I’m conflating flow control with congestion control — CBFC is very much the former, but I guess it dictates and comes bundled with a different congestion control in IB with ECNs and injection throttling).</p> <h2 id="expectations-for-future-congestion-control">Expectations for Future Congestion Control</h2> <p>This post is a precursor to me trying to understand all the cool new things in datacenter fabrics — Omnipath, Slingshot, UltraEthernet, maybe Globally Scheduled Ethernet?</p> <p>If these arguments hold, none of these systems’ mechanisms will be perfect. There will always be cases where Bad Things$^{TM}$ happen. Part of the argument is that that may be okay, and an evaluation of their properties must be holistic.</p> <h2 id="some-degree-of-tuning-is-okay">Some Degree Of Tuning Is Okay?</h2> <ul> <li>I can not pick an outfit that works in all weathers. I am destined to keep tuning.</li> <li>Some tuning is worse than others. If I wore an outfit that had only one shoe, I would have to hop and keep swapping my shoe from one foot to the other.</li> <li>An outfit that covers the expected swings over one day is sufficient for an outfit.</li> <li>A bag that covers the expected swings over one week is sufficient for a short trip.</li> <li>A wardrobe that covers the expected swings over the year in Pittsburgh is sufficient for most other cases.</li> </ul> <p>What are the takeaways for systems?</p> <ul> <li>We are destined to tune. We should plan for that.</li> <li>It would be nice if we understood systems in terms of their coverage of design space.</li> <li>Nothing in this discussion rules out the existence of systems that cover strictly more of the design space than others. Probably someone has and/or will prove that certain aspects are pareto-optimal. But it’s still a small sheet over massive legs.</li> </ul> <h2 id="chasing-the-global-optima">Chasing The Global Optima</h2> <p>Network sharing and scheduling mechanisms are probably by far the most well-studied and deployed examples of a distributed system trying to construct/reach a global optimum, and both mathematical and practical arguments of the challenges that comes with.</p> <p>All systems would benefit from better decisions. Better decisions are enabled by a better view of the system state. But constructing this system state may be prohibitively expensive. It may also be a moving target. It will definitely have some uncertainty bounds because physics. It may also be possible that explicitly constructing the system state is not necessary — we can apparently go quite a bit with individual actors acting on simple rules. (Something about game theory and cooperative games comes to mind but I’m already wayyyy past my depth here).</p>]]></content><author><name></name></author><category term="networks"/><category term="networks"/><summary type="html"><![CDATA[Or, are we destined to tune?]]></summary></entry><entry><title type="html">Credits, Flow, and Congestion Control in Infiniband</title><link href="https://ankushja.in/blog/2024/credits-flow-congestion/" rel="alternate" type="text/html" title="Credits, Flow, and Congestion Control in Infiniband"/><published>2024-12-04T19:00:36+00:00</published><updated>2024-12-04T19:00:36+00:00</updated><id>https://ankushja.in/blog/2024/credits-flow-congestion</id><content type="html" xml:base="https://ankushja.in/blog/2024/credits-flow-congestion/"><![CDATA[<p>Things keep happening in datacenter networks. There (used to be) Ethernet, there’s Infiniband, Omnipath<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> (and its predecessors and successors), Slingshot<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, UltraEthernet<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>, and now <em>Globally Scheduled Ethernet</em><sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>. I don’t get why they all need to exist concurrently and this is an initial attempt at unraveling this for myself.</p> <p>Step 1 in this process is understanding Infiniband. Hyperscalers seem to like Infiniband and RDMA, but they also like their IP addresses and routing tables. RoCE and ROCEv2 were attempts to bridge the gaps, but they turned out to have problems. At this point, I know the talking points everyone rehashes — PFC storms, HoL blocking, deadlocks… but I don’t really understand any of it.</p> <p>From my understanding, all of the changes in RoCEv2 were necessitated because it is hard to get RDMA to work over lossy networks. (As an aside, it seems to me that there is no such thing as a truly lossless network, but you can build a pretty good illusion of one with two properties: 1. Reduce the likelihood of packet loss, and 2. Handle retransmissions transparently at some lower layer). I haven’t heard anyone complain about how flow control works in infiniband, so the key is maybe to understand it. This post is an attempt at that.</p> <h1 id="credit-based-flow-control-in-infiniband">Credit-based Flow Control in Infiniband</h1> <p>This is all based off slides here: <sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>.</p> <h2 id="virtual-lanes-and-service-levels">Virtual Lanes and Service Levels</h2> <p>Credits are maintained per-VL on each HCA. VLs/SLs are technically different but equivalent for now. They are priority classes — Infiniband supports up to 16. VL15 is reserved for subnet management traffic, while the others are available for regular data. The difference is that the application requests service levels, and the SL-VL mapping is a management decision handled presumably by the subnet manager.</p> <h2 id="basic-numbers">Basic Numbers</h2> <ul> <li>Each <code class="language-plaintext highlighter-rouge">Flow Control Block</code> is 64B. One 64B send requires one credit to be authorized.</li> <li>A VL will issue a maximum of 2048 credits, which translates to a 128KB receive buffer. <h2 id="cbfc-actual">CBFC Actual</h2> <p>There are 3 main terms: <code class="language-plaintext highlighter-rouge">ABR</code>, <code class="language-plaintext highlighter-rouge">FCCL</code>, <code class="language-plaintext highlighter-rouge">FCTBS</code>.</p> </li> <li><code class="language-plaintext highlighter-rouge">FCCL</code>: the credit limit (the max point up to which the sender has been authorized to send).</li> <li><code class="language-plaintext highlighter-rouge">ABR</code>: total blocks received at the receiver so far.</li> <li><code class="language-plaintext highlighter-rouge">FCTBS</code>: total blocks sent.</li> </ul> <p><code class="language-plaintext highlighter-rouge">FCTBS</code> - <code class="language-plaintext highlighter-rouge">ABR</code>: blocks in transit on the wire. <code class="language-plaintext highlighter-rouge">FCCL</code> - <code class="language-plaintext highlighter-rouge">FCTBS</code>: remaining credits for the sender. A send will be permitted if the size is smaller than this limit.</p> <h1 id="flow-control-vs-congestion-control">Flow Control vs Congestion Control</h1> <p>Flow control makes infiniband largely and inherently lossless. Rare occasions that cause packet corruption etc. may require retransmissions. The fabric will do retransmissions for you if <code class="language-plaintext highlighter-rouge">Reliable Connected</code> was requested. CBFC kicks in regardless of whether you use RC or UC or UD. I haven’t looked into how retransmissions for <code class="language-plaintext highlighter-rouge">RC</code> are managed.</p> <p>Infiniband also has congestion control on top of flow control. Why both need to exist is not entirely clear to me yet. What I do know is that IB employs some variant of <code class="language-plaintext highlighter-rouge">ECN (Explicit Congestion Notification)</code> to help detect congestion (<code class="language-plaintext highlighter-rouge">FECN</code> and <code class="language-plaintext highlighter-rouge">BECN</code>). I don’t know what the endpoints do in response to ECNs.</p> <h1 id="questions-for-myself">Questions For Myself</h1> <ul> <li>Why is it hard to retrofit CBFC on to ethernet?</li> <li>Why don’t all datacenter fabrics use CBFC?</li> <li>How does CBFC compare to loss-based congestion control. <h1 id="references">References</h1> </li> </ul> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:2"> <p>What If Omni-Path Morphs Into The Best Ultra Ethernet?, , 2024, https://www.nextplatform.com/2024/06/26/what-if-omni-path-morphs-into-the-best-ultra-ethernet/ <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>Cray’s Slingshot Interconnect Is At The Heart Of HPE’s HPC And AI Ambitions, , 2022, https://www.nextplatform.com/2022/01/31/crays-slingshot-interconnect-is-at-the-heart-of-hpes-hpc-and-ai-ambitions/ <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:1"> <p>UltraEthernet: Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification, , , <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>Whitepaper on Globally Scheduled Ethernet, , 2024, https://regmedia.co.uk/2024/11/26/china_mobile_gse_whitepaper.pdf <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>Infiniband Credit-Based Link-Layer Flow-Control, , 2014, https://www.ieee802.org/1/files/public/docs2014/new-dcb-crupnicoff-ibcreditstutorial-0314.pdf <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="systems"/><category term="networks"/><summary type="html"><![CDATA[Notes on CBFC, lossless fabrics, receiver-driven congestion control etc.]]></summary></entry><entry><title type="html">Notes on Web Pages and CMU Infra</title><link href="https://ankushja.in/blog/2024/cmu-web-pages.md/" rel="alternate" type="text/html" title="Notes on Web Pages and CMU Infra"/><published>2024-11-29T18:11:12+00:00</published><updated>2024-11-29T18:11:12+00:00</updated><id>https://ankushja.in/blog/2024/cmu-web-pages.md</id><content type="html" xml:base="https://ankushja.in/blog/2024/cmu-web-pages.md/"><![CDATA[<p>Infra for hosting web pages on CMU web servers is a little complicated and confusing, and I waste some time figuring it out every time I need to change something. These are notes to make my life easier in the future, and may be relevant for others. This is from the perspective of an ECE student, adapt to your needs.</p> <h2 id="afs-cells">AFS Cells</h2> <p>I get to have a home directory on two CMU cells – <code class="language-plaintext highlighter-rouge">andrew.cmu.edu</code> and <code class="language-plaintext highlighter-rouge">ece.cmu.edu</code>. I don’t know how to discover all the cells you have a home directory in.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ssh ankushj@unix.andrew.cmu.edu # give them your andrew password
$ pwd
/afs/andrew.cmu.edu/usr18/ankushj
$ cd /afs/ece.cmu.edu/usr/ankushj
$ ls
Permission denied
$ aklog ece.cmu.edu
$ ls
top_secret_infiniband_content.txt
$ tokens
&lt;all the cells I'm authenticated for&gt;

# random AFS commands
$ fs listaliases
$ fs listquota
</code></pre></div></div> <h2 id="hosting-on-ece">Hosting on ECE</h2> <p>This is straightforward. Your <code class="language-plaintext highlighter-rouge">public_html</code> is available on <code class="language-plaintext highlighter-rouge">users.ece.cmu.edu/~ankushj</code></p> <p>ECE also respects <code class="language-plaintext highlighter-rouge">.htaccess</code></p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls -a /path/to/ece/homedir/public_html
.htaccess index.html
</code></pre></div></div> <h2 id="hosting-on-andrew">Hosting on Andrew</h2> <p>This is more confusing. Target address is <code class="language-plaintext highlighter-rouge">https://www.andrew.cmu.edu/user/ankushj/</code>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls -a /path/to/andrew/homedir/www
.htaccess index.html
</code></pre></div></div> <ol> <li>Andrew does not use <code class="language-plaintext highlighter-rouge">.htaccess</code>.</li> <li><code class="language-plaintext highlighter-rouge">index.html</code> is not automatically available. You have to go to <code class="language-plaintext highlighter-rouge">https://www.andrew.cmu.edu/server/publish.html</code>, type your username, hit publish, and the data gets copied to some “real hosting destination”.</li> <li><code class="language-plaintext highlighter-rouge">.unpublish</code> is a thing<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></li> </ol> <h2 id="redirects">Redirects</h2> <p>On ECE, I can set up a HTTP-level redirect via <code class="language-plaintext highlighter-rouge">.htaccess</code>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RedirectMatch 301 ^/$ https://my.real.website
</code></pre></div></div> <p>I previously tried the following, but it would redirect to <code class="language-plaintext highlighter-rouge">https://my.real.website~ankushj</code>. I am pretty sure that ChatGPT is just giving me a stupid redirect command, but the version above works and the version below does not — that’ll have to do for now.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Redirect 301 / https://my.real.website
</code></pre></div></div> <p>From Andrew, I had to set up a HTTP redirect.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;!DOCTYPE html&gt;
&lt;html&gt;
  &lt;head&gt;
        &lt;meta http-equiv="refresh" content="0; url=https//my.real.website"&gt;
            &lt;title&gt;Redirecting...&lt;/title&gt;
  &lt;/head&gt;
  &lt;body&gt;
        &lt;p&gt;If you are not redirected, &lt;a href="https://my.real.website"&gt;click here&lt;/a&gt;.&lt;/p&gt;
  &lt;/body&gt;
&lt;/html&gt;
</code></pre></div></div> <h2 id="troubleshooting">Troubleshooting</h2> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Permissions
$ fs sa www system:anyuser rl
</code></pre></div></div> <h2 id="references">References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>https://www.cmu.edu/computing/services/comm-collab/websites/user-course-web/how-to/publish.html <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name></name></author><category term="web"/><category term="web"/><summary type="html"><![CDATA[Dealing with AFS, cells, permissions etc.]]></summary></entry><entry><title type="html">Inscrutable Coredump v. Unmoveable Grad Student</title><link href="https://ankushja.in/blog/2024/inscrutable-core-dump/" rel="alternate" type="text/html" title="Inscrutable Coredump v. Unmoveable Grad Student"/><published>2024-07-09T19:24:13+00:00</published><updated>2024-07-09T19:24:13+00:00</updated><id>https://ankushja.in/blog/2024/inscrutable-core-dump</id><content type="html" xml:base="https://ankushja.in/blog/2024/inscrutable-core-dump/"><![CDATA[<p>Situation: we have a core dump that is shy to reveal its inner workings. The goal is to extract some more information from this core dump, using fancier analyses. The core dump is 1.2GB in size, so I know that there is some insight in there, it is just buried.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) bt</span><span class="w">
</span>#0<span class="w">  </span><span class="mh">0x00007fbaa4653c30</span><span class="w"> </span>in<span class="w"> </span>??<span class="w"> </span><span class="p">()</span><span class="w">
</span>#1<span class="w">  </span><span class="mh">0x0000000000000000</span><span class="w"> </span>in<span class="w"> </span>??<span class="w"> </span><span class="p">()</span><span class="w">
</span></code></pre></div></div> <p>Since we have an intact <code class="language-plaintext highlighter-rouge">$pc</code> (which refers to <code class="language-plaintext highlighter-rouge">%rip</code>), we can figure out the instructions it was executing.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) disassemble $pc-20,$pc+20</span><span class="w">
</span>Dump<span class="w"> </span>of<span class="w"> </span>assembler<span class="w"> </span>code<span class="w"> </span>from<span class="w"> </span><span class="mh">0x7fbaa4653c1c</span><span class="w"> </span>to<span class="w"> </span><span class="mh">0x7fbaa4653c44</span>:<span class="w">
   </span><span class="mh">0x00007fbaa4653c1c</span>:<span class="w">  </span>add<span class="w">    </span><span class="nv">%al</span><span class="p">,(</span><span class="nv">%rax</span><span class="p">)</span><span class="w">
   </span><span class="mh">0x00007fbaa4653c1e</span>:<span class="w">  </span>jmp<span class="w">    </span><span class="mh">0x7fbaa4653a51</span><span class="w">
   </span><span class="mh">0x00007fbaa4653c23</span>:<span class="w">  </span>nopl<span class="w">   </span><span class="mh">0x0</span><span class="p">(</span><span class="nv">%rax</span><span class="p">,</span><span class="nv">%rax</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span><span class="w">
   </span><span class="mh">0x00007fbaa4653c28</span>:<span class="w">  </span>mov<span class="w">    </span><span class="mh">0x98</span><span class="p">(</span><span class="nv">%r12</span><span class="p">),</span><span class="nv">%rax</span><span class="w">
</span>=&gt;<span class="w"> </span><span class="mh">0x00007fbaa4653c30</span>:<span class="w">  </span>cmpb<span class="w">   </span>$0x48,(%rax)<span class="w">
   </span><span class="mh">0x00007fbaa4653c33</span>:<span class="w">  </span>jne<span class="w">    </span><span class="mh">0x7fbaa4653b90</span><span class="w">
   </span><span class="mh">0x00007fbaa4653c39</span>:<span class="w">  </span>movabs<span class="w"> </span>$0x50f0000000fc0c7,%rdx<span class="w">
   </span><span class="mh">0x00007fbaa4653c43</span>:<span class="w">  </span>cmp<span class="w">    </span><span class="nv">%rdx</span><span class="p">,</span><span class="mh">0x1</span><span class="p">(</span><span class="nv">%rax</span><span class="p">)</span><span class="w">
</span>End<span class="w"> </span>of<span class="w"> </span>assembler<span class="w"> </span>dump.<span class="w">
</span></code></pre></div></div> <p>Let us inspect the registers.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) info reg</span><span class="w">
</span>rax<span class="w">            </span><span class="mh">0x323338342034342e</span><span class="w">  </span><span class="mi">3617296722238387246</span><span class="w">
</span>rbx<span class="w">            </span><span class="mh">0x7fff678ebb50</span><span class="w">      </span><span class="mi">140734930795344</span><span class="w">
</span>rcx<span class="w">            </span><span class="mh">0x7fba82086120</span><span class="w">      </span><span class="mi">140439022231840</span><span class="w">
</span>rdx<span class="w">            </span><span class="mh">0x1</span><span class="w">                 </span><span class="mi">1</span><span class="w">
</span>rsi<span class="w">            </span><span class="mh">0x1</span><span class="w">                 </span><span class="mi">1</span><span class="w">
</span></code></pre></div></div> <p>Okay so our <code class="language-plaintext highlighter-rouge">%rax</code> was clearly a gibberish address, no wonder dereferencing it failed. Now the question is what source file/line was mapped to <code class="language-plaintext highlighter-rouge">$pc</code>. ChatGPT says that the following can work:</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) list *$pc</span><span class="w">
</span>&lt;no<span class="w"> </span>output&gt;<span class="w">
</span><span class="gp">(gdb) info symbol $pc</span><span class="w">
</span>No<span class="w"> </span>symbol<span class="w"> </span>matches<span class="w"> </span>$pc.<span class="w">
</span></code></pre></div></div> <p>ChatGPT also says that we can also dereference addresses using these, but first we need to know what library is laid out in our memory, and at what offset. Noting these down for later.</p> <div class="language-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$ addr2line -e /path/to/your/executable 0xADDRESS</span><span class="w">
</span><span class="gp">$ objdump -d -S /path/to/your/executable</span><span class="w">
</span></code></pre></div></div> <p>Some more useful information, saving for later.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) info frame 0</span><span class="w">
</span>Stack<span class="w"> </span>frame<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf8</span>:<span class="w">
 </span>rip<span class="w"> </span>=<span class="w"> </span><span class="mh">0x7fbaa4653c30</span>;<span class="w"> </span>saved<span class="w"> </span>rip<span class="w"> </span>=<span class="w"> </span><span class="mh">0x0</span><span class="w">
 </span>called<span class="w"> </span>by<span class="w"> </span>frame<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebb00</span><span class="w">
 </span>Arglist<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebae8</span><span class="p">,</span><span class="w"> </span>args:<span class="w">
 </span>Locals<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebae8</span><span class="p">,</span><span class="w"> </span>Previous<span class="w"> </span>frame's<span class="w"> </span>sp<span class="w"> </span>is<span class="w"> </span><span class="mh">0x7fff678ebaf8</span><span class="w">
 </span>Saved<span class="w"> </span>registers:<span class="w">
  </span>rip<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf0</span><span class="w">
</span><span class="gp">(gdb) info frame 1</span><span class="w">
</span>Stack<span class="w"> </span>frame<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebb00</span>:<span class="w">
 </span>rip<span class="w"> </span>=<span class="w"> </span><span class="mh">0x0</span>;<span class="w"> </span>saved<span class="w"> </span>rip<span class="w"> </span>=<span class="w"> </span><span class="mh">0x0</span><span class="w">
 </span>caller<span class="w"> </span>of<span class="w"> </span>frame<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf8</span><span class="w">
 </span>Arglist<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf0</span><span class="p">,</span><span class="w"> </span>args:<span class="w">
 </span>Locals<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf0</span><span class="p">,</span><span class="w"> </span>Previous<span class="w"> </span>frame's<span class="w"> </span>sp<span class="w"> </span>is<span class="w"> </span><span class="mh">0x7fff678ebb00</span><span class="w">
 </span>Saved<span class="w"> </span>registers:<span class="w">
  </span>rip<span class="w"> </span>at<span class="w"> </span><span class="mh">0x7fff678ebaf8</span><span class="w">
</span><span class="gp">(gdb) info frame 2</span><span class="w">
</span>No<span class="w"> </span>frame<span class="w"> </span>at<span class="w"> </span>level<span class="w"> </span><span class="mi">2</span>.<span class="w">
</span></code></pre></div></div> <p>Let’s look at shared memory mappings now.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) info shared</span><span class="w">
</span>No<span class="w"> </span>shared<span class="w"> </span>libraries<span class="w"> </span>loaded<span class="w"> </span>at<span class="w"> </span>this<span class="w"> </span>time.<span class="w">
</span><span class="gp">(gdb) info proc mappings</span><span class="w">
</span>Mapped<span class="w"> </span>address<span class="w"> </span>spaces:<span class="w">

          </span>Start<span class="w"> </span>Addr<span class="w">           </span>End<span class="w"> </span>Addr<span class="w">       </span>Size<span class="w">     </span>Offset<span class="w"> </span>objfile<span class="w">
      </span><span class="mh">0x5585a6034000</span><span class="w">     </span><span class="mh">0x5585a604b000</span><span class="w">    </span><span class="mh">0x17000</span><span class="w">        </span><span class="mh">0x0</span><span class="w"> </span>/some/bin<span class="w">
      </span><span class="mh">0x5585a604b000</span><span class="w">     </span><span class="mh">0x5585a64ef000</span><span class="w">   </span><span class="mh">0x4a4000</span><span class="w">    </span><span class="mh">0x17000</span><span class="w"> </span>/some/bin<span class="w">
      </span><span class="mh">0x5585a64ef000</span><span class="w">     </span><span class="mh">0x5585a6582000</span><span class="w">    </span><span class="mh">0x93000</span><span class="w">   </span><span class="mh">0x4bb000</span><span class="w"> </span>/some/bin<span class="w">
      </span><span class="mh">0x7fba3ad28000</span><span class="w">     </span><span class="mh">0x7fba3c000000</span><span class="w">  </span><span class="mh">0x12d8000</span><span class="w">        </span><span class="mh">0x0</span><span class="w"> </span>/dev/shm/...<span class="w">
      </span><span class="mh">0x7fba40d29000</span><span class="w">     </span><span class="mh">0x7fba40f41000</span><span class="w">   </span><span class="mh">0x218000</span><span class="w">        </span><span class="mh">0x0</span><span class="w"> </span>/dev/shm/...<span class="w">
</span></code></pre></div></div> <p>Alright, getting somewhere. I have no idea why <code class="language-plaintext highlighter-rouge">info shared</code> failed but <code class="language-plaintext highlighter-rouge">info proc mappings</code> did not. We want to find a mapping around the address <code class="language-plaintext highlighter-rouge">0x00007fbaa4653c30</code>.</p> <p>Found something.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mh">0x7fbaa4647000</span><span class="w">     </span><span class="mh">0x7fbaa4659000</span><span class="w">    </span><span class="mh">0x12000</span><span class="w">     </span><span class="mh">0x3000</span><span class="w"> </span>/usr/lib/x86_64-linux-gnu/libgcc_s.so.1<span class="w">
</span></code></pre></div></div> <p>The difference between the base address and our <code class="language-plaintext highlighter-rouge">$pc</code> is <code class="language-plaintext highlighter-rouge">0xcc30</code>. Add the offset <code class="language-plaintext highlighter-rouge">0x3000</code> to get <code class="language-plaintext highlighter-rouge">0xfc30</code>.</p> <div class="language-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$ addr2line -e /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 0xfc30</span><span class="w">
</span>??;0<span class="w">
</span></code></pre></div></div> <p>Okay well thx.</p> <div class="language-session highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$ objdump -d -S /usr/lib/x86_64-linux-gnu/libgcc_s.so.1</span><span class="w">
</span>...<span class="w">
</span>fc30:<span class="w">       </span>80<span class="w"> </span>38<span class="w"> </span>48<span class="w">                </span>cmpb<span class="w">   </span>$0x48,(%rax)<span class="w">
    </span>fc33:<span class="w">       </span>0f<span class="w"> </span>85<span class="w"> </span>57<span class="w"> </span>ff<span class="w"> </span>ff<span class="w"> </span>ff<span class="w">       </span>jne<span class="w">    </span>fb90<span class="w"> </span>&lt;_Unwind_GetTextRelBase@@GCC_3.0+0xe40&gt;<span class="w">
    </span>fc39:<span class="w">       </span>48<span class="w"> </span>ba<span class="w"> </span>c7<span class="w"> </span>c0<span class="w"> </span>0f<span class="w"> </span>00<span class="w"> </span>00<span class="w">    </span>movabs<span class="w"> </span>$0x50f0000000fc0c7,%rdx<span class="w">
    </span>...<span class="w">
</span></code></pre></div></div> <p>Okay this wasn’t super useful. This is just some GCC unwinding utility function after a segfault. At this point, I just decide to examine the entire stack.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) x/128xg 0x7fff678eba00</span><span class="w">
</span>...<span class="w">
</span><span class="mh">0x7fff678eba00</span>:<span class="w"> </span><span class="mh">0x00007fba82086040</span><span class="w">      </span><span class="mh">0x00007fbaa465000b</span><span class="w">
</span><span class="mh">0x7fff678eba10</span>:<span class="w"> </span><span class="mh">0x000000000000002e</span><span class="w">      </span><span class="mh">0x0000000000000000</span><span class="w">
</span><span class="mh">0x7fff678eba20</span>:<span class="w"> </span><span class="mh">0x0000000000000000</span><span class="w">      </span><span class="mh">0x0000000000000000</span><span class="w">
</span><span class="mh">0x7fff678eba30</span>:<span class="w"> </span><span class="mh">0x00007fff678eced8</span><span class="w">      </span><span class="mh">0xb741446eb7f0e800</span><span class="w">
</span><span class="mh">0x7fff678eba40</span>:<span class="w"> </span><span class="mh">0x00007fff678eceb0</span><span class="w">      </span><span class="mh">0x00007fff678ebb50</span><span class="w">
</span><span class="mh">0x7fff678eba50</span>:<span class="w"> </span><span class="mh">0x00007fff678ebbf8</span><span class="w">      </span><span class="mh">0x00007fff678ebb50</span><span class="w">
</span><span class="mh">0x7fff678eba60</span>:<span class="w"> </span><span class="mh">0x00007fff678ebe00</span><span class="w">      </span><span class="mh">0x323338342034342d</span><span class="w">
</span>...<span class="w">
</span></code></pre></div></div> <p>Some patterns start to emerge. All values starting with <code class="language-plaintext highlighter-rouge">0x7fff</code> are pointers to things on the stack. Things in the range of <code class="language-plaintext highlighter-rouge">0x7fbaa..</code> are probably related to instructions. We can also see the junk value <code class="language-plaintext highlighter-rouge">0x3233</code> that was implicated in the segfault.</p> <h2 id="two-hours-later-">Two hours later …</h2> <p>My approach was to examine the stack visually, find pointers with prefixes that I knew to map to code I wore, and try and dereference them to get an idea of where my program was when it crashed.</p> <p>This is doable, but it is not as straightforward as you might think. The <em>why</em> requires going into how ELF binaries/shared libraries are loaded in the memory.</p> <ol> <li>There is a <code class="language-plaintext highlighter-rouge">/proc/&lt;pid&gt;/map</code> corresponding to <code class="language-plaintext highlighter-rouge">info proc mappings</code> that we saw earlier.</li> <li>Each ELF file is divided into segments, which are further divided into sections. Mapping happens at the granularity of a segment.</li> <li>The mapped segment will have a different offset than the on-disk segment. This may have something to do with alignment and/or ASLR requirements. But the segment sizes are also different for me, between what is reported by gdb, and what is shown by <code class="language-plaintext highlighter-rouge">readelf/objdump</code>.</li> </ol> <p>As a result, I was unable to map symbol addresses from the core dump to symbols in libraries effectively. There is theoretically no reason why gdb should not be able to do this automatically, and it does, for more benign cases. But it does not seem to load the shared libraries for me for this particular crash.</p> <h2 id="wait-">Wait …</h2> <p>Okay, I ran gdb with this specific sequence, and suddenly it chose to load shared libraries.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$<span class="w"> </span>gdb<span class="w">
</span><span class="gp">(gdb) set auto-solib-add off # do not auto-load solibs</span><span class="w">
</span><span class="gp">(gdb) set substitute-path /dev/shm /dev/null # something for shm maps</span><span class="w">
</span><span class="gp">(gdb) set solib-search-path /path/to/lib</span><span class="w">
</span><span class="gp">(gdb) file /path/to/my/binary</span><span class="w">
</span><span class="gp">(gdb) target core /path/to/core-file</span><span class="w">
</span><span class="gp">(gdb) info sharedlibrary</span><span class="w">
</span><span class="gp">(gdb) info sharedlibrary</span><span class="w">
</span>From<span class="w">                </span>To<span class="w">                  </span>Syms<span class="w"> </span>Read<span class="w">   </span>Shared<span class="w"> </span>Object<span class="w"> </span>Library<span class="w">
</span><span class="mh">0x00007fbaa4e98350</span><span class="w">  </span><span class="mh">0x00007fbaa4eaccd1</span><span class="w">  </span>No<span class="w">          </span>/lib/libx.so<span class="w"> 
</span><span class="mh">0x00007fbaa4e21a00</span><span class="w">  </span><span class="mh">0x00007fbaa4e72bc9</span><span class="w">  </span>No<span class="w">          </span>/lib/liby.so<span class="w">
</span>...<span class="w">
</span><span class="gp">(gdb) sharedlibrary /path/to/libmycode.so</span><span class="w">
</span>Reading<span class="w"> </span>symbols<span class="w"> </span>from<span class="w"> </span>...<span class="w">
</span></code></pre></div></div> <p>I have no idea which of the above did the trick. Consider it a magic sequence of commands for now.</p> <p>The game plan now is to go through the stack with <code class="language-plaintext highlighter-rouge">x/64xg $pc</code> and beyond to look for familiar addresses and try to resolve them via the symbol table. I tried a bunch of random symbols, and finally hit jackpot.</p> <div class="language-gdb highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">(gdb) info symbol 0x00005585a6470c80</span><span class="w">
</span>Serialize[...]<span class="w"> </span>in<span class="w"> </span>section<span class="w"> </span>.text<span class="w"> </span>of<span class="w"> </span>/my/binary<span class="w">
</span></code></pre></div></div> <p>It was a buffer overflow in a serialization routine.</p> <h2 id="conclusions">Conclusions</h2> <p>The battle between you and a coy-acting core dump is a battle of wills. Do not blink.</p>]]></content><author><name></name></author><category term="systems"/><category term="gdb"/><summary type="html"><![CDATA[On the applications of fuzzy human pattern matching to extract secrets from a corrupted core dump in the age of trillion parameter AI]]></summary></entry></feed>