Faster double-to-string conversion

There comes a time in every software engineer’s life when they come up with a new binary-to-decimal floating-point conversion method. I guess my time has come. I just wrote one, mostly over a weekend: https://github.com/vitaut/zmij.

It incorporates lessons learned from implementing Dragon4, Grisu and Schubfach along with a few new ideas from myself and others. The main guiding principle is Alexandrescu’s “no work is less work than some work” so a number of improvements come from removing things from Schubfach (conditional branches, computations and even candidate numbers).

Performance

Here is how it performs on dtoa-benchmark:

  • ~68% faster than Dragonbox, the previous leader (at least among algorithms with correctness proofs)
  • ~2 times faster than my Schubfach implementation that closely follows the original paper
  • ~3.5 times faster than std::to_chars from libc++ (Ryu?)
  • ~6.8 times faster than Google’s double-conversion (Grisu3)
  • ~59 times (not percent!) faster than sprintf on macOS (Dragon4?)

Converting a single double takes about 10 to 20 ns on Apple M1.

What are the improvements?

Here is a list of improvements compared to Schubfach:

  • Selection from 1-3 candidates instead of 2-4
  • Fewer integer multiplications in the shorter case
  • Faster logarithm approximations
  • Faster division and modulo
  • Fewer conditional branches
  • More efficient significand and exponent output

Let’s take a look at some of them.

The first small improvement is having a single branch to quickly check for special cases: NaN, infinity, zero or subnormals. There are still additional checks within that path but the common case is more streamlined.

Another improvement is using faster fixed-point logarithm approximations. Schubfach does the following:

// log10_2_sig = round(log10(2) * 2**log10_2_exp)
constexpr int64_t log10_2_sig = 661'971'961'083;
constexpr int log10_2_exp = 41;

// Computes floor(log10(pow(2, e))) for e <= 5456721.
auto floor_log10_pow2(int e) noexcept -> int {
  return e * log10_2_sig >> log10_2_exp;
}

which uses 64-bit multiplication:

floor_log10_pow2(int):
        movsxd  rcx, edi
        movabs  rax, 661971961083
        imul    rax, rcx
        sar     rax, 41
        ret

However, for the range of inputs (exponents) we could use 32-bit approximations:

// log10_2_sig = round(log10(2) * 2**log10_2_exp)
constexpr int log10_2_sig = 315'653;
constexpr int log10_2_exp = 20;

auto floor_log10_pow2(int e) noexcept -> int {
  return e * log10_2_sig >> log10_2_exp;
}

resulting in

floor_log10_pow2(int):
        imul    eax, edi, 315653
        sar     eax, 20
        ret

Dragonbox also uses 32-bit approximations with slightly different constants.

Similarly, we can replace some integer divisions with integer multiplications. Compilers already know how to do this, but we can do better when we know that the range of inputs is small:

// Returns {value / 100, value % 100} correct for values of up to 4 digits.
inline auto divmod100(uint32_t value) noexcept -> divmod_result {
  assert(value < 10'000);
  constexpr int exp = 19;  // 19 is faster or equal to 12 even for 3 digits.
  constexpr int sig = (1 << exp) / 100 + 1;
  uint32_t div = (value * sig) >> exp;  // value / 100
  return {div, value - div * 100};
}

Another optimization and simplification is branchless handling of irregular rounding intervals. I wrote about rounding intervals in my earlier blog post, but for the purposes of this post it is sufficient to know that a rounding interval for a floating-point number is an interval that contains all real numbers that round back to that number. Normally the intervals are symmetric, except when there is a jump in the exponent (the irregular case):

Most algorithms handle irregular intervals via a completely separate path or at least some branching. This is not terrible, because irregular cases are rare for random floating-point numbers. However, it is possible to handle it cheaply and branchlessly, avoiding extra complexity, which is what I did.

A more interesting improvement comes from a talk by Cassio Neri Fast Conversion From Floating Point Numbers. In Schubfach, we look at four candidate numbers. The first two, of which at most one is in the rounding interval, correspond to a larger decimal exponent. The other two, of which at least one is in the rounding interval, correspond to the smaller exponent. Cassio’s insight is that we can directly construct a single candidate from the upper bound in the first case.

This improvement has a nice effect: it allows us to avoid scaling the value itself by a power of 10, because we only need the lower and upper bounds. This saves two 64-bit integer multiplications in the shorter case.

Unfortunately, this does not help in the longer case, but there are improvements to be made there as well. Classic Schubfach first checks whether there is only one candidate from the second set in the rounding interval and returns early in that case. We can combine this check with the closedness check. This seems counterintuitive, because we do more work (sorry, Andrei), but it eliminates a poorly predicted conditional branch and also simplifies the code.

So we go from this:

uint64_t dec_sig_over = dec_sig_under + 1;
bool under_in = lower + bin_sig_lsb <= (dec_sig_under << 2);
bool over_in = (dec_sig_over << 2) + bin_sig_lsb <= upper;
if (under_in != over_in) {
  // Only one of dec_sig_under or dec_sig_over are in the rounding interval.
  return write(buffer, under_in ? dec_sig_under : dec_sig_over, dec_exp);
}

// Both dec_sig_under and dec_sig_over are in the interval - pick the closest.
int cmp = scaled_sig - ((dec_sig_under + dec_sig_over) << 1);
bool under_closer = cmp < 0 || cmp == 0 && (dec_sig_under & 1) == 0;
return write(buffer, under_closer ? dec_sig_under : dec_sig_over, dec_exp);

to this:

// Pick the closest of dec_sig_under and dec_sig_over and check if it's in
// the rounding interval.
int64_t cmp = int64_t(scaled_sig - ((dec_sig_under + dec_sig_over) << 1));
bool under_closer = cmp < 0 || (cmp == 0 && (dec_sig_under & 1) == 0);
bool under_in = (dec_sig_under << 2) >= lower;
write(buffer, (under_closer & under_in) ? dec_sig_under : dec_sig_over,
      dec_exp);

There are also many improvements in significand and exponent output. The simplest one, which has been used for many years in {fmt} and which I learned from Alexandrescu’s talk “Three Optimization Tips for C++”, is using a lookup table to output pairs of decimal digits. This alone halves the number of integer multiplications and is particularly important here, because the significand is often 16–17 digits long.

Another trick is branchless removal of trailing zeros using another small lookup table, which I believe comes from the excellent Drachennest project by Alexander Bolz. There are ideas for improving this further and potentially getting rid of the lookup table entirely.

Is this a new algorithm?

Does it deserve to be called a new algorithm, or is it just an optimization of Schubfach?

I am not sure, but at the very least it deserves a separate GitHub project =).

Where this can be used

This method, or at least some elements of it, will be used in {fmt}, and it is also a good candidate for JSON serialization in Thrift and elsewhere. If you have other applications that could benefit from faster floating-point formatting, feel free to check it out now, or wait until it is integrated into {fmt}.

Thanks to my ISO C++ paper P2587 “to_string or not to_string”, std::to_string will also be able to use this or a similar method. This will make this standard API both performant and actually useful.

Current limitations

Despite the name, the implementation is not fully polished yet. In particular, it currently supports only exponential, also known as scientific, format, although adding fixed format should be straightforward.

“Fun” fact

My former colleague David Gay wrote an early dtoa implementation back at Bell Labs, and it was widely used for many years.


Last modified on 2025-12-13