My previous blog post gained some attention on Hacker News (discussion), and as someone deeply interested in floating-point formatting, I couldn’t ignore the following comment:

It’s kind of mindblowing to see how much code floating point formatting needs.

If you want it to be fast. The baseline implementation isn’t terrible[1,2] even if it is still ultimately an > implementation of arbitrary-precision arithmetic.

The linked algorithm seemed clearly unsuitable for anything beyond teaching purposes, but I decided to give it a try anyway. One particularly suspicious claim was that the post boasted 100,000 conversions per second — about three orders of magnitude slower than state-of-the-art methods, yet still surprisingly good given that it worked on individual decimal digits and these were 2011 numbers.

To verify these claims, the first thing I did was reproduce the results using a simple benchmark on the implementation from the gist:

```
func BenchmarkFtoa(b *testing.B) {
for i := 0; i < b.N; i++ {
ftoa(data[i % 1000], 17);
}
}
```

I used data from dtoa-benchmark to ensure meaningful comparisons with other methods later.

Here are the benchmark results on my M1 MacBook Pro (best of three runs):

```
% go test --bench=.
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkFtoa-8 6006 203975 ns/op
PASS
```

The full benchmark is avaliable here.

The numbers are nowhere near the claimed 100,000 conversions per second. Converting a random floating-point number takes approximately 200 microseconds (5,000 conversions per second), which is four orders of magnitude slower than state-of-the-art algorithms like Dragonbox or even an optimized version of Grisu, which run in the tens of nanoseconds on the same machine. My initial estimate was off by one order of magnitude — classic programmer mistake!

However, the runtime of this algorithm depends on the magnitude of the exponent. If you restrict the input to numbers close to 1 (away from infinity and zero), the performance goes from terrible to merely bad. For instance, formatting π takes “only” 2.5 microseconds.

While disappointing, this outcome wasn’t entirely unexpected. But what about the
implementation in the `strconv`

package? Being part of the standard library, you
would expect it to be optimized, right?

Unfortunately, the standard implementation isn’t directly accessible via the
public API. However, with a small tweak to the `strconv`

package (commenting out
the calls to Ryu), we can force the algorithm to be used and test it with the
following benchmark:

```
func BenchmarkStdFtoa(b *testing.B) {
for i := 0; i < b.N; i++ {
FormatFloat(data[i % 1000], 'e', 17, 64);
}
}
```

```
% go test --bench=.
goos: darwin
goarch: arm64
cpu: Apple M1 Pro
BenchmarkStdFtoa-8 327270 3657 ns/op
PASS
```

This is now close to the best case for the naive implementation yet it remains
several times slower than `sprintf`

or even a basic implementation of Dragon4,
and about 100 times slower than modern algorithms.

It’s unclear why anyone would use this algorithm in practice, even as a
fallback, because the optimized version loses its only advantage: simplicity.
For instance, the implementation of multi-precision `decimal`

, which is its main
component, is larger and more complex than its binary equivalent, `bigint`

,
used in the fallback Dragon4 implementation of the {fmt} library. Additionally, it requires several kilobytes of
data to make performance a bit less terrible. While it might potentially
outperform Dragon4 when producing “garbage” digits, optimizing for that scenario is
questionable at best.

Regardless, I decided to test the final hypothesis and see if we could identify some low-hanging fruit to make this algorithm perform at least on par with Dragon4. As noted in a comment on Hacker News:

You can likely get some of the performance back by picking the low-hanging fruit, e.g. switching from dumb one-byte bigint limbs in [0,10) to somewhat less dumb 32-bit limbs in [0,1e9).

From this point forward, I’ll be using C++, which should provide additional performance benefits. The implementation turned out to be quite straightforward, only slightly more complex than the initial textbook version. Here’s an implementation of a multi-precision decimal number that supports two operations: shifting left and right by a specified number of bits:

```
// A fixed-point decimal number.
struct decimal {
int num_bigits = 0;
// Each bigit is a 9-digit decimal number.
uint32_t bigits[100];
static constexpr int bigit_bound = 1000000000;
// Bigits are organized as follows:
// bigits[0] ... bigits[F - 1].bigits[F] ... bigits[N - 1],
// where F is fraction_start.
int fraction_start;
void shift_left(int n) {
int offset = *bigits >= (bigit_bound >> n) ? 1 : 0;
uint32_t carry = 0;
for (int i = num_bigits - 1; i >= 0; --i) {
uint64_t bigit = bigits[i];
bigit = (bigit << n) + carry;
if (bigit >= bigit_bound) {
carry = bigit / bigit_bound;
bigit = bigit % bigit_bound;
} else {
carry = 0;
}
bigits[i + offset] = static_cast<uint32_t>(bigit);
}
if (offset != 0) {
bigits[0] = carry;
++num_bigits;
}
}
void shift_right(int n) {
uint32_t mask = (1 << n) - 1;
uint32_t borrow = 0;
int offset = 0;
if ((*bigits >> n) == 0 && *bigits != 0) {
offset = 1;
--num_bigits;
--fraction_start;
borrow = uint64_t(*bigits) * bigit_bound >> n;
}
for (int i = 0; i != num_bigits; ++i) {
uint64_t bigit = bigits[i + offset];
uint32_t new_borrow = (bigit & mask) * bigit_bound >> n;
bigits[i] = borrow + (bigit >> n);
borrow = new_borrow;
}
if (borrow != 0) bigits[num_bigits++] = borrow;
}
explicit decimal(double d) {
int exp;
int num_bits = std::numeric_limits<double>::digits;
int64_t v = static_cast<int64_t>(frexp(d, &exp) * (1ull << num_bits));
if (v < 0) v = -v;
exp -= num_bits;
if (exp >= 0) {
if (v >= bigit_bound) {
uint32_t upper = v / bigit_bound;
if (upper != 0) bigits[num_bigits++] = upper;
}
bigits[num_bigits++] = v % bigit_bound;
int i = 0;
int bits_per_iteration = 29; // 2**29 fits in one bigit.
for (; i <= exp - bits_per_iteration; i += bits_per_iteration)
shift_left(bits_per_iteration);
if (i != exp) shift_left(exp - i);
fraction_start = num_bigits;
} else {
fraction_start = 1;
if (v >= bigit_bound) {
uint32_t upper = v / bigit_bound;
if (upper != 0) {
bigits[num_bigits++] = upper;
++fraction_start;
}
}
bigits[num_bigits++] = v % bigit_bound;
int i = 0;
int bits_per_iteration = 9; // 10**9 can only be shifted left 9 bits.
for (; i - bits_per_iteration >= exp; i -= bits_per_iteration)
shift_right(bits_per_iteration);
if (i != exp) shift_right(i - exp);
}
}
};
```

`decimal`

handles all the heavy lifting, so in `dtoa`

, our primary tasks are to
read the digits, place them in the output buffer, and possibly perform the
rounding:

```
void dtoa_puff(char* buf, double val, int precision) {
decimal d(val);
int bigit_index = *d.bigits > 0 ? 0 : 1;
char* ptr = std::to_chars(buf, buf + precision, d.bigits[bigit_index++]).ptr;
int count = ptr - buf;
int exp = (d.fraction_start - bigit_index) * 9 + count - 1;
for (; bigit_index < d.num_bigits && count <= precision; ++bigit_index) {
char* block = buf + count;
ptr = std::to_chars(block, block + 9, d.bigits[bigit_index]).ptr;
int num_digits = ptr - block, num_zeros = 9 - num_digits;
if (num_digits < 9) {
memmove(block + num_zeros, block, num_digits);
memcpy(block, "00000000", num_zeros);
}
count += 9;
}
auto has_nonzero = [=]() {
for (int i = precision + 1; i < count; ++i) {
if (buf[i] != '0') return true;
}
for (int i = bigit_index + 1; i < d.num_bigits; ++i) {
if (d.bigits[i] != 0) return true;
}
return false;
};
if (count > precision) {
char digit = buf[precision];
if (digit > '5' || digit == '5' &&
((buf[precision - 1] % 2) == 1 || has_nonzero())) {
int i = precision - 1;
for (; i >= 0 && buf[i] == '9'; --i) buf[i] = '0';
if (i >= 0) {
++buf[i];
} else {
buf[0] = '1';
++exp;
}
}
count = precision;
}
bool negative = signbit(val);
memmove(buf + 2 + (negative ? 1 : 0), buf + 1, count - 1);
int offset = 1;
if (negative) {
buf[1] = buf[0];
buf[0] = '-';
++offset;
}
buf[offset] = '.';
for (count += offset; count <= precision; ++count) buf[count] = '0';
buf[count++] = 'e';
if (exp >= 0) buf[count++] = '+';
*std::to_chars(buf + count, buf + count + 4, exp).ptr = '\0';
}
```

To differentiate this implementation from the numerous others, let’s call it Puff.

The full implementation is available here: https://www.godbolt.org/z/rYqnaW6bq.

In addition to the obvious improvement of using larger “digits”, Puff also shifts by multiple bits per iteration — up to 29 bits for left shifts and up to 9 bits for right shifts.

Now, let’s examine its performance using the following benchmark, which is essentially the same as the previous one but rewritten to utilize Google Benchmark:

```
void dtoa_puff(benchmark::State &s) {
size_t i = 0;
char buf[100];
while (s.KeepRunning())
dtoa_puff(buf, data[i++ % 1000], 17);
}
BENCHMARK(dtoa_puff);
```

One caveat worth mentioning is that `to_chars`

is not available on my system,
so I had to implement it using the {fmt} library.

Here are the results:

```
Run on (8 X 24 MHz CPU s)
CPU Caches:
L1 Data 64 KiB
L1 Instruction 128 KiB
L2 Unified 4096 KiB (x8)
Load Average: 4.82, 4.52, 4.66
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
dtoa 984 ns 984 ns 710992
```

*(Ignore the 24 MHz part—this is a known Google Benchmark issue that doesn’t
affect the results.)*

Puff is approximately 3.5 times faster than its counterpart in the Go standard ibrary or 200 times faster than the original unoptimized version. It’s also simpler, with completely different optimizations applied, and it doesn’t rely on any data tables.

But how does it stack up against other methods? To find out, I added it to a fork of dtoa-benchmark, and here are the results:

As you can see, the optimized version of this method comes quite close to
`sprintf`

. It’s still about 30 times slower than Dragonbox (the version
integrated into {fmt}), but let’s give it some slack — not every dragon is
destined to soar; some are just here for a scenic glide.

But hey, for a few hours of work (mostly spent debugging), it’s not too shabby!

I hope you found this useful! Let me know in the comments if you’d be interested in a similar deep dive into Dragon4.

Last modified on 2024-09-27