Rust Performance Measurement Initial Steps

Introduction

While working on wishpy one of the things that I realized was, Wireshark is primarily meant to be used as a tool. While, Wireshark provides programmatic API (C based), they are not very convenient to use. One of the Rust project ideas that I had was - to develop a packet dissection framework, that is API friendly. There is a Go based gopacket which comes quite close to what I had in mind. I had a rough idea of how to get it working in Rust and I started upon implementing it in a crate called scalpel. It's still a work in progress and this blog post is not about it per se.

Since this code is targeted for working at a very low level in a Packet Analytics stack, the actual dissection should be really fast. My goal to start with was to be at-least at par, if not better with gopacket's performance (since we are developing it in a non-GC language) without any optimizations. For a couple of basic dissectors like IPv4 and IPv6, it was doing actually substantially better - nearly twice as fast compared to gopacket, and this looked quite promising, but for one particular benchmark for a DNS Packet, the performance was more than twice slower than the gopacket's. The actual reason for this ended up being something else and not just the dissection performance, but identifying the root cause helped in better understanding at-least some parts of Rust's ecosystem for running benchmarks and instrumentation of running code. Next, we'll look at some of those tools and also how these tools helped identify and reason about the performance issues and what steps were taken to solve the issue.

Background

Before we get into the details of the actual changes and the eventual improvements, a quick look at the ecosystem. There is a cargo command called cargo bench that can be used to run benchmark programs. criterion is a micro-benchmarking tool that can be used with cargo bench. The documentation of criterion is quite good and we will not look into the details of how that is to be used. We need to add it as one of the dev-dependencies and add modules inside the benches directory of your crate and then cargo bench will take over.

While cargo bench is useful for running benchmarks, to be able to look underneath, we need additional instrumentation mechanism. Brendan Greg's Flamegraphs is an excellent tool for this. It can be used to generate a nice SVG which shows the frequently used code paths along with their call stacks. This can for example show which parts of the program the CPU is spending spending more time. There is a Rust port of FlameGraph, which is integrated with cargo toolchain and one can run a command called cargo flamegraph, that generates the flame graph for running a particular benchmark.

So the plan was to generate FlameGraph for the benchmark that was performing poorly and identifying the issues and see how they can be addressed if at all. Also, the idea was not to actually perform any 'hand optimizations', but to simply avoid parts that were poorly implemented (like too many allocations for example) and see if they can be improved and how much.

A Note about DNS Records

The code that we are exploring is for the dissection of DNS packets, before we actually look into the code, let's quickly understand a bit about how records are compressed in the DNS packets as this particular part of the code was problematic.

In a DNS packet, DNS records for the hostnames are compressed by not repeating the parts of the hostname again if multiple records are present and simply providing the pointer to the actual records, for instance let's say there are four name server records for ns1.example.com, ns2.example.com, ns3.example.com and ns4.example.com, this is encoded in a packet as ns1.example.com and for ns2, ns3 and ns4 only the pointer to the .example.com is kept. This section in the RFC-1035 has more details about it.

First Implementation and Issues

We are going to look at a specific function that is obtaining the DNS names from the records (potentially compressed) in a DNS Packet. This function is dns_name_from_u8. The function looks like -

 fn dns_name_from_u8(
        bytes: &[u8],
        start: usize,
        remaining: usize,
    ) -> Result<(DNSName, usize), Error> {
        fn labels_from_offset(
            bytes: &[u8],
            offset: usize,
            mut remaining: usize,
            check_remaining: bool,
        ) -> Result<(Vec<Vec<u8>>, usize), Error> {
           ...
           ...
        }

        let (labels, consumed) = labels_from_offset(bytes, start, remaining, true)?;

        Ok((DNSName(labels.into_iter().flatten().collect()), consumed))
    }

The actual code for this implementation can be found here.

So what the code was essentially doing was recursively calling the labels_from_offset, where each call to labels_from_offset was returning a Vec<u8> and finally all the vectors were flattened into a Vec<u8>. Records will be generated using this Vec<u8>, which is often just getting the String from that vector.

This part looked like the culprit and indeed the FlameGraph did indicate that a chunk of time was being spent here (Open the SVG and size of the rectangle for <alloc::vec::Vec<T>.. vs ...::dns_name_from_u8::labels_from_offset). See below -

open flamegraph in new tab

FlameGraph 1 - FlameGraph for Vec<Vec<u8>>

Benchmark Results

The cargo bench results for this particular code were -

Benchmark	Mean Time per Iteration
Parse_DNS_AAAA	2.7950 us ❗
Parse_DNS	525.86 ns
Parse_DNS_Regression_Packet	575.52 ns

Note: In the table above only the mean value (central value in three values that cargo bench reports) is shown. There are many other details that cargo bench provides, but are omitted here since we are mainly looking at one particular parameter, that is the run-time of the benchmark. Also, these results should be compared for relative performance as actual numbers will vary from machine to machine.

Iteration 1 - A False Start

To improve the performance, the recursive call to the labels_from_offset path was avoided, this looked like improved the performance, but this actually resulted in a breaking test case and hence this approach was discarded. In hindsight, it was 'obvious' that this made little sense because the nature of the 'compression' is indeed recursive, so recursive calls cannot be avoided.

Iteration 2 - Flattening the Vector

This Vec<Vec<u8>> can be avoided by using a 'flattened' vector and thus avoiding the flattening cost. This iteration was tried next. This helped improve the performance quite a bit, nearly twice for the worst performing case and by about 20-30 percent in other cases. This was understandable because in the other cases there were not as many compressions and hence that part of the code was not getting hit often. Also, instead of allocating a Vec with no capacity and growing it, some capacity was reserved, nearly a fourth of performance improvement was due to 'reserve'ing the vector.

The actual code for this implementation can be found here.

Benchmark Results

Summary results after running cargo bench after the above changes. As can be seen the problematic case performance improved substantially.

Benchmark	Mean Time per Iteration
Parse_DNS_AAAA	1.2503 us ✅
Parse_DNS	394.31 ns
Parse_DNS_Regression_Packet	462.30 ns

Also, as can be seen in the flamegraph below - the time spent in Vector allocation is reduced, considerably (this can be observed in the SVG by comparing the relative size of the rectangles for <alloc::vec::Vec<T>...> and ...::dns_name_from_u8::labels_from_offset.
open flamegraph in new tab

FlameGraph 2 - FlameGraph for Vec<u8>

Iteration 3 - Avoiding the Unused Vector Allocations

One more observation was, in every recursive call, we were allocating a vector and it could go unused if that recursive call didn't end up in extending the vector. This could as well be avoided by allocating and reserving the vector in the main calling function (dns_name_from_u8) and then passing a mutable reference to it when calling labels_from_offset. Thus for every record there is only one allocation. This improved the performance of the problematic case by another 20-30%, while the performance for the other two cases was more or less unchanged (understandably, because this performance degradation was a function of number of calls to dns_name_from_u8).

The actual code for this implementation can be found here.

Benchmark Results

Benchmark	Mean Time per Iteration
Parse_DNS_AAAA	919.64 ns
Parse_DNS	384.97 ns
Parse_DNS_Regression_Packet	450.01 ns

As can be seen from the FlameGraph, relative time spent in Vector allocation (size of the rectangle in the SVG ) is even smaller in this case.
open flamegraph in new tab

FlameGraph 3 - FlameGraph for Vec<u8>

Key Takeaways

Key takeaways from these iterations can be summarized as follows -

Avoid flattening the vectors in a fast path as far as possible. Note: This may not be always possible, but if there is a way to avoid this, it's perhaps a good idea.
Also, avoid all unnecessary allocations and consider reserving a capacity if it makes sense. What 'reserved' capacity is a sweet spot can be determined by experimenting with different values.
Tools like FlameGraph provide excellent visible feedback about our reasoning about performance and help us identify areas where bulk of the time in the code is spent if any and one can then focus on improving those areas of the code.

Postscript

Even after these 'optimizations' (they should rather be called - improvements than optimizations), the code finally was not doing just as well as the gopacket code, which still didn't make a whole lot of sense as most other benchmarks were running substantially better. It turned out, that the gopacket benchmark was reusing an allocated slice and hence appeared to run faster. When the same benchmark in gopacket was changed to run like the benchmark above, the results were consistent, the benchmark running substantially faster in scalpel. The below table shows the comparison for benchmarks between gopacket implementation and scalpel implementation. While it might look like scalpel is doing twice better than gopacket for these benchmarks, I would not jump to the conclusion and shout this is twice as fast, but surely it's promising enough (in fact more than promising enough) to be pursued further.

Benchmark	`gopacket`*	`gopacket`+	`scalpel`
Parse_DNS_AAAA	4566.2 ns	644.3 ns	919.64 ns ✅
Parse_DNS	N/A	867.8 ns	384.97 ns ✅
Parse_DNS_Regression_Packet	N/A	1094 ns	450.01 ns ✅

* - gopacket benchmark modified for comparison with scalpel.
+ - The actual gopacket benchmark (reusing the allocated slice).

For the gopacket benchmarks, look at the dns_test.go and udp_test.go files.

Contributions Welcome

Scalpel is looking for help from people to contribute to write dissectors for more protocols. Please check the scalpel project on Github.

scalpel - Packet Dissection and Sculpting in Rust