Converting a hundred million integers to strings per second

Almost 7 years ago I wrote a blog post comparing performance of different methods of converting an integer into a string. A lot of things have changed since then so I decided to write a follow-up post and see how much progress has been made over the years.

First of all let’s look at the benchmark itself. It is based on Boost Karma int_performance benchmark ported to Google Benchmark to make the results more reproducible. It reports the time taken to convert ten million integers into strings. This is a somewhat weird metric so I changed it into a more intuitive conversion speed in integers per second.

One problem with the benchmark was that it used rand to generate input data in a non-portable way (thanks Travis Downs for reporting this issue). This was fixed by replacing rand with a proper pseudo-random number generator so now the input distribution is consistent across platforms.

Travis also pointed me to a recent bug on Intel CPUs which may affect performance when jump instructions cross cache lines. Experiments showed a big variation in mite_uops / dsb_uops after unrelated code changes (e.g. adding an unused function). To make the benchmark more reliable I implemented Intel CPU detection and the recommended mitigation, namely added -Wa,-mbranches-within-32B-boundaries to the compiler options.

Here is an example of a single benchmark run without mitigation:

$ perf stat -e uops_issued.any,idq.dsb_uops,idq.mite_uops,dsb2mite_switches.penalty_cycles \
    ./int-benchmark --benchmark_filter=format_int
...
---------------------------------------------------------------------
Benchmark           Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------
format_int    8420131 ns      8420392 ns           83 items_per_second=118.759M/s

 Performance counter stats for './int-benchmark --benchmark_filter=format_int':

     9,708,172,187      uops_issued.any                                             
     8,745,800,873      idq.dsb_uops                                                
     1,903,352,743      idq.mite_uops                                               
       470,737,031      dsb2mite_switches.penalty_cycles

and here is the same benchmark run with mitigation:

$ perf stat -e uops_issued.any,idq.dsb_uops,idq.mite_uops,dsb2mite_switches.penalty_cycles \
    ./int-benchmark --benchmark_filter=format_int
...
---------------------------------------------------------------------
Benchmark           Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------
format_int    7455764 ns      7456006 ns           93 items_per_second=134.12M/s

 Performance counter stats for './int-benchmark --benchmark_filter=format_int':

    11,585,171,382      uops_issued.any                                             
    13,069,550,403      idq.dsb_uops                                                
       335,873,122      idq.mite_uops                                               
        11,939,259      dsb2mite_switches.penalty_cycles

As you can see there is almost 13% performance difference and a dramatic drop in dsb2mite_switches.penalty_cycles (and mite_uops / dsb_uops) that suggests that the mitigation is working.

And finally I added digest computation for output strings to prevent compilers from optimizing away parts of the conversion and to validate the results. This is only a few percent slower than benchmark::DoNotOptimize even for the fastest case and makes the benchmark much more reliable and arguably more realistic since the conversion output is unlikely to be discarded in practice.

Here are the results on Intel Core i7-8569U CPU @ 2.80GHz running macOS and compiled with Apple clang version 11.0.3 (clang-1103.0.32.62) and libc++:

Speed ratio, as the name suggests, is the fastest method’s speed divided by the current method speed or how much slower this method is compared to the leader. [c] and [r] mean compile-time and runtime format string processing respectively.

As you can see {fmt}’s format_int still remains the fastest on this benchmark closely followed by format_to with compile-time processing of format strings. Both can now deliver over a 100 million conversions per second on a medium-spec laptop.

Alf P. Steinbach’s elegantly simple decimal_from which does formatting in reverse to avoid size computation and then calls std::reverse is in the third place (fourth after adding u2985907).

std::to_chars is a new addition to the benchmark. It is a low-level numeric formatting function introduced in C++17 to address some of the limitations of sprintf, iostreams and std::to_string. It shows a respectable result, only 30% slower than fmt::format_int.

Remarkably, fmt::to_string and fmt::format with compile-time format string processing are only ~7% slower than libc++’s to_chars even though they return std::string while to_chars operates on a stack-allocated buffer. This is partly due to small string optimization that avoids dynamic memory allocation.

std::to_string is no longer a performance disaster as it used to be, at least on integer input. It is still affected by the global locale giving unpredictable results and nondeterministic performance and should generally be avoided but its an interesting development nevertheless.

Another interesting result is that fmt::format with runtime format string processing is faster than std::to_string and is more than 4x faster than sprintf (previously it was less than 2x faster).

Now let’s repeat the exercise on Linux. Here are the results on Intel Core i9-9900K CPU @ 3.60GHz running Ubuntu 20.20 and compiled with GCC 9.3.0 and libstdc++:

Here the results are more balanced. fmt::format_int is still the leader but it is only ~21% faster than std::to_chars.

Interestingly, std::ostringstream is faster than sprintf on this platform (if you reuse the same stream object).

Runtime format string processing is not as good on gcc as on clang so it might be something worth looking into. In the meantime it’s possible to use compile-time format string processing in rare cases when formatting is a performance bottleneck.

So what is format_int and what makes it fast?

The idea is very simple - it has a small buffer within the object itself and formats backwards with a single pass handling pairs of digits at a time to minimize expensive integer divisions (which a compiler turns into cheaper but still expensive multiplications). The idea of processing pairs of digits comes from the great talk by Alexei Alexandrescu Three Optimization Tips For C++. format_int is very easy to use (https://godbolt.org/z/2Kp9iZ):

auto f = fmt::format_int(42);
// f.data() is the data, f.size() is the size

It’s more user-friendly and safer than other low-level alternatives because it manages memory automatically but you may need to copy the data. Note that neither snprintf nor to_chars give you the information on how much storage you need, so you often have to overallocate and then do an extra copy anyway. For example, here’s how to format into a std::string using std::to_chars:

std::array<char, std::numeric_limits<int>::digits10 + 2> buffer;
auto result = std::to_chars(buffer.data(),
                            buffer.data() + buffer.size(), number);
if (result.ec == std::errc()) {
  std::string result(buffer.data(), result.ptr); // Copy the data into string.
  // Use result.
} else {
  // Handle the error.
}

In any case, you don’t need to use these low-level APIs often - they are mostly for implementing higher level facilities and it’s easy to wrap std::to_chars in a more user-friendly API. For example numeric formatting in std::format is defined in terms of std::to_chars.

Summary

There have been big improvements to standard library implementations in terms of performance (e.g. std::to_string is much faster now) and the availability of new locale-independent formatting facilities (std::to_chars). The open-source {fmt} library continues to provide some of the fastest integer formatters that are often more performant than their standard counterparts.

Benchmarking remains to be a tricky business and extra care should be taken to ensure reproducibility of results, particularly in view of recent hardware bugs and mitigations and increasingly advanced compiler optimizations.

Update:

As pointed out by Иван Афанасьев, libc++’s implementation of to_string ignores the locale for integer formatting (but not for floating point).

Update 2:

Added u2985907, an integer to string conversion method from https://stackoverflow.com/a/19944488/471164 with fixes. This method by StackOverflow user user2985907, sometimes incorrectly attributed to jiaendu, shows good results. Unfortunately it uses a whopping 90k of data tables which is a bit excessive. For comparison, the whole {fmt} library compiled with LTO is ~57k after recent optimizations which includes implementations of integer and floating-point formatting algorithms.

Last modified on 2020-06-13