double to string conversion in 150 lines of code

My previous blog post gained some attention on Hacker News (discussion), and as someone deeply interested in floating-point formatting, I couldn’t ignore the following comment:

It’s kind of mindblowing to see how much code floating point formatting needs.

If you want it to be fast. The baseline implementation isn’t terrible[1,2] even if it is still ultimately an > implementation of arbitrary-precision arithmetic.

[1] https://research.swtch.com/ftoa

[2] https://go.dev/src/strconv/ftoa.go

The linked algorithm seemed clearly unsuitable for anything beyond teaching purposes, but I decided to give it a try anyway. One particularly suspicious claim was that the post boasted 100,000 conversions per second — about three orders of magnitude slower than state-of-the-art methods, yet still surprisingly good given that it worked on individual decimal digits and these were 2011 numbers.

To verify these claims, the first thing I did was reproduce the results using a simple benchmark on the implementation from the gist:

func BenchmarkFtoa(b *testing.B) {
  for i := 0; i < b.N; i++ {
    ftoa(data[i % 1000], 17);
  }
}

I used data from dtoa-benchmark to ensure meaningful comparisons with other methods later.

Here are the benchmark results on my M1 MacBook Pro (best of three runs):

% go test --bench=.
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFtoa-8             6006            203975 ns/op
PASS

The full benchmark is avaliable here.

The numbers are nowhere near the claimed 100,000 conversions per second. Converting a random floating-point number takes approximately 200 microseconds (5,000 conversions per second), which is four orders of magnitude slower than state-of-the-art algorithms like Dragonbox or even an optimized version of Grisu, which run in the tens of nanoseconds on the same machine. My initial estimate was off by one order of magnitude — classic programmer mistake!

However, the runtime of this algorithm depends on the magnitude of the exponent. If you restrict the input to numbers close to 1 (away from infinity and zero), the performance goes from terrible to merely bad. For instance, formatting π takes “only” 2.5 microseconds.

While disappointing, this outcome wasn’t entirely unexpected. But what about the implementation in the strconv package? Being part of the standard library, you would expect it to be optimized, right?

Unfortunately, the standard implementation isn’t directly accessible via the public API. However, with a small tweak to the strconv package (commenting out the calls to Ryu), we can force the algorithm to be used and test it with the following benchmark:

func BenchmarkStdFtoa(b *testing.B) {
  for i := 0; i < b.N; i++ {
    FormatFloat(data[i % 1000], 'e', 17, 64);
  }
}
% go test --bench=.
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkStdFtoa-8        327270              3657 ns/op
PASS

This is now close to the best case for the naive implementation yet it remains several times slower than sprintf or even a basic implementation of Dragon4, and about 100 times slower than modern algorithms.

It’s unclear why anyone would use this algorithm in practice, even as a fallback, because the optimized version loses its only advantage: simplicity. For instance, the implementation of multi-precision decimal, which is its main component, is larger and more complex than its binary equivalent, bigint, used in the fallback Dragon4 implementation of the {fmt} library. Additionally, it requires several kilobytes of data to make performance a bit less terrible. While it might potentially outperform Dragon4 when producing “garbage” digits, optimizing for that scenario is questionable at best.

Regardless, I decided to test the final hypothesis and see if we could identify some low-hanging fruit to make this algorithm perform at least on par with Dragon4. As noted in a comment on Hacker News:

You can likely get some of the performance back by picking the low-hanging fruit, e.g. switching from dumb one-byte bigint limbs in [0,10) to somewhat less dumb 32-bit limbs in [0,1e9).

From this point forward, I’ll be using C++, which should provide additional performance benefits. The implementation turned out to be quite straightforward, only slightly more complex than the initial textbook version. Here’s an implementation of a multi-precision decimal number that supports two operations: shifting left and right by a specified number of bits:

// A fixed-point decimal number.
struct decimal {
  int num_bigits = 0;
  // Each bigit is a 9-digit decimal number.
  uint32_t bigits[100];
  static constexpr int bigit_bound = 1000000000;

  // Bigits are organized as follows:
  //   bigits[0] ... bigits[F - 1].bigits[F] ... bigits[N - 1],
  // where F is fraction_start.
  int fraction_start;

  void shift_left(int n) {
    int offset = *bigits >= (bigit_bound >> n) ? 1 : 0;
    uint32_t carry = 0;
    for (int i = num_bigits - 1; i >= 0; --i) {
      uint64_t bigit = bigits[i];
      bigit = (bigit << n) + carry;
      if (bigit >= bigit_bound) {
        carry = bigit / bigit_bound;
        bigit = bigit % bigit_bound;
      } else {
        carry = 0;
      }
      bigits[i + offset] = static_cast<uint32_t>(bigit);
    }
    if (offset != 0) {
      bigits[0] = carry;
      ++num_bigits;
    }
  }

  void shift_right(int n) {
    uint32_t mask = (1 << n) - 1;
    uint32_t borrow = 0;
    int offset = 0;
    if ((*bigits >> n) == 0 && *bigits != 0) {
      offset = 1;
	    --num_bigits;
      --fraction_start;
      borrow = uint64_t(*bigits) * bigit_bound >> n;
    }
    for (int i = 0; i != num_bigits; ++i) {
      uint64_t bigit = bigits[i + offset];
      uint32_t new_borrow = (bigit & mask) * bigit_bound >> n;
      bigits[i] = borrow + (bigit >> n);
      borrow = new_borrow;
    }
    if (borrow != 0) bigits[num_bigits++] = borrow;
  }

  explicit decimal(double d) {
    int exp;
    int num_bits = std::numeric_limits<double>::digits;
    int64_t v = static_cast<int64_t>(frexp(d, &exp) * (1ull << num_bits));
    if (v < 0) v = -v;
    exp -= num_bits;

    if (exp >= 0) {
      if (v >= bigit_bound) {
        uint32_t upper = v / bigit_bound;
        if (upper != 0) bigits[num_bigits++] = upper;
      }
      bigits[num_bigits++] = v % bigit_bound;
      int i = 0;
      int bits_per_iteration = 29; // 2**29 fits in one bigit.
      for (; i <= exp - bits_per_iteration; i += bits_per_iteration)
        shift_left(bits_per_iteration);
      if (i != exp) shift_left(exp - i);
      fraction_start = num_bigits;
    } else {
      fraction_start = 1;
      if (v >= bigit_bound) {
        uint32_t upper = v / bigit_bound;
        if (upper != 0) {
          bigits[num_bigits++] = upper;
          ++fraction_start;
        }
      }
      bigits[num_bigits++] = v % bigit_bound;
      int i = 0;
      int bits_per_iteration = 9; // 10**9 can only be shifted left 9 bits.
      for (; i - bits_per_iteration >= exp; i -= bits_per_iteration)
        shift_right(bits_per_iteration);
      if (i != exp) shift_right(i - exp);
    }
  }
};

decimal handles all the heavy lifting, so in dtoa, our primary tasks are to read the digits, place them in the output buffer, and possibly perform the rounding:

void dtoa_puff(char* buf, double val, int precision) {
  decimal d(val);

  int bigit_index = *d.bigits > 0 ? 0 : 1;
  char* ptr = std::to_chars(buf, buf + precision, d.bigits[bigit_index++]).ptr;
  int count = ptr - buf;
  int exp = (d.fraction_start - bigit_index) * 9 + count - 1;
  for (; bigit_index < d.num_bigits && count <= precision; ++bigit_index) {
    char* block = buf + count;
    ptr = std::to_chars(block, block + 9, d.bigits[bigit_index]).ptr;
    int num_digits = ptr - block, num_zeros = 9 - num_digits;
    if (num_digits < 9) {
      memmove(block + num_zeros, block, num_digits);
      memcpy(block, "00000000", num_zeros);
    }
    count += 9;
  }
  auto has_nonzero = [=]() {
    for (int i = precision + 1; i < count; ++i) {
      if (buf[i] != '0') return true;
    }
    for (int i = bigit_index + 1; i < d.num_bigits; ++i) {
      if (d.bigits[i] != 0) return true;
    }
    return false;
  };
  if (count > precision) {
    char digit = buf[precision];
    if (digit > '5' || digit == '5' &&
        ((buf[precision - 1] % 2) == 1 || has_nonzero())) {
      int i = precision - 1;
      for (; i >= 0 && buf[i] == '9'; --i) buf[i] = '0';
      if (i >= 0) {
        ++buf[i];
      } else {
        buf[0] = '1';
        ++exp;
      }
    }
    count = precision;
  }
  bool negative = signbit(val);
  memmove(buf + 2 + (negative ? 1 : 0), buf + 1, count - 1);
  int offset = 1;
  if (negative) {
    buf[1] = buf[0];
    buf[0] = '-';
    ++offset;
  }
  buf[offset] = '.';
  for (count += offset; count <= precision; ++count) buf[count] = '0';
  buf[count++] = 'e';
  if (exp >= 0) buf[count++] = '+';
  *std::to_chars(buf + count, buf + count + 4, exp).ptr = '\0';
}

To differentiate this implementation from the numerous others, let’s call it Puff.

The full implementation is available here: https://www.godbolt.org/z/rYqnaW6bq.

In addition to the obvious improvement of using larger “digits”, Puff also shifts by multiple bits per iteration — up to 29 bits for left shifts and up to 9 bits for right shifts.

Now, let’s examine its performance using the following benchmark, which is essentially the same as the previous one but rewritten to utilize Google Benchmark:

void dtoa_puff(benchmark::State &s) {
  size_t i = 0;
  char buf[100];
  while (s.KeepRunning())
    dtoa_puff(buf, data[i++ % 1000], 17);
}
BENCHMARK(dtoa_puff);

One caveat worth mentioning is that to_chars is not available on my system, so I had to implement it using the {fmt} library.

Here are the results:

Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 4.82, 4.52, 4.66
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
dtoa              984 ns          984 ns       710992

(Ignore the 24 MHz part—this is a known Google Benchmark issue that doesn’t affect the results.)

Puff is approximately 3.5 times faster than its counterpart in the Go standard ibrary or 200 times faster than the original unoptimized version. It’s also simpler, with completely different optimizations applied, and it doesn’t rely on any data tables.

But how does it stack up against other methods? To find out, I added it to a fork of dtoa-benchmark, and here are the results:

As you can see, the optimized version of this method comes quite close to sprintf. It’s still about 30 times slower than Dragonbox (the version integrated into {fmt}), but let’s give it some slack — not every dragon is destined to soar; some are just here for a scenic glide.

But hey, for a few hours of work (mostly spent debugging), it’s not too shabby!

I hope you found this useful! Let me know in the comments if you’d be interested in a similar deep dive into Dragon4.


Last modified on 2024-09-27