Reducing {fmt} library size 4x using Bloaty McBloatface

When it comes to comparing different software solutions, speed is often the main if not the only factor considered. This is not the case in the {fmt} library which tries to find a good balance between speed, binary size and compile times by default and give you an option to get maximum performance in cases that matter.

A lot of effort has been put into making binary code at the usage site very compact, comparable to that of printf. For example (godbolt):

#include <fmt/core.h>

int main() {
  fmt::print("The answer is {}.\n", 42);
}

compiles to just

main:                                   # @main
        sub     rsp, 24
        mov     qword ptr [rsp], 42
        mov     rcx, rsp
        mov     edi, offset .L.str
        mov     esi, 18
        mov     edx, 1
        call    fmt::v6::vprint(fmt::v6::basic_string_view<char>, fmt::v6::format_args)
        xor     eax, eax
        add     rsp, 24
        ret
.L.str:
        .asciz  "The answer is {}.\n"

Compare this to iostreams (godbolt):

#include <iostream>

int main() {
  std::cout << "The answer is " << 42 << ".\n";
}
main:                                   # @main
        push    rax
        mov     edi, offset std::cout
        mov     esi, offset .L.str
        mov     edx, 14
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        mov     edi, offset std::cout
        mov     esi, 42
        call    std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
        mov     esi, offset .L.str.1
        mov     edx, 2
        mov     rdi, rax
        call    std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
        xor     eax, eax
        pop     rcx
        ret
_GLOBAL__sub_I_example.cpp:             # @_GLOBAL__sub_I_example.cpp
        push    rax
        mov     edi, offset std::__ioinit
        call    std::ios_base::Init::Init() [complete object constructor]
        mov     edi, offset std::ios_base::Init::~Init() [complete object destructor]
        mov     esi, offset std::__ioinit
        mov     edx, offset __dso_handle
        pop     rax
        jmp     __cxa_atexit            # TAILCALL
.L.str:
        .asciz  "The answer is "

.L.str.1:
        .asciz  ".\n"

(You can ignore the _GLOBAL__sub_I_example.cpp part.)

Not only {fmt}’s binary code is smaller, it has fewer function calls and is faster. And this is one of the best cases for iostreams specifically chosen to avoid blog post bloat. With ostringstream the compiler can generate half of a kilobyte of binary code even for trivial formatting (see e.g. Binary code comparison). Code bloat is a common problem with concatenation-based formatting methods compared to replacement-based ones like printf.

While per-use binary size in {fmt} is highly optimized, little effort has been put so far into making the library itself smaller. We try not to do anything obviously silly that would result in an explosion of template instantiations but little more than that.

At the same time, reducing library size has been a frequently requested feature from folks working on mobile and embedded systems. So I finally decided to look into this problem more systematically. In this post I use the wonderfully named and overall super nice tool called Bloaty McBloatface, a size profiler for binaries, to try making the {fmt} library smaller.

Let’s use the earlier example and compile it with {fmt} revision 5b02881.

test.cc:

#include <fmt/core.h>

int main() {
  fmt::print("The answer is {}.\n", 42);
}
% clang++ -std=c++17  -I include -O2 -DNDEBUG test.cc src/format.cc
% strip a.out
% ls -l a.out
...  367708 May 21 08:24 a.out

So we start with ~368k which seems excessive even considering that {fmt} implements a replacement for (s)printf, iostreams and double-conversion including full implementation of Grisu3 and Dragon4 floating point formatting algorithms. Can we do better?

Recompiling the test binary to restore symbols and running bloaty gives the following result:

% bloaty a.out -d symbols
     VM SIZE                                                                    FILE SIZE
 --------------                                                              --------------
  59.5%   242Ki fmt::v6::internal::basic_writer<>::write_padded<>()            242Ki  59.5%
  13.3%  54.1Ki [152 Others]                                                  54.1Ki  13.3%
   3.4%  13.7Ki fmt::v6::internal::format_decimal<>()                         13.7Ki   3.4%
   3.0%  12.3Ki fmt::v6::internal::basic_writer<>::int_writer<>::on_num()     12.3Ki   3.0%
   ...

As you can see write_padded is responsible for 59% of the binary code. So much for not doing anything silly 🤦😂. This function adds padding before and after the formatted argument so that it appears aligned to the left, center, or right per format specifications. For example, in

fmt::print("{:>10}", 42);

it will write 8 spaces and then call a function that writes “42”.

template <typename F> void write_padded(const format_specs& specs, F&& f) {
  unsigned width = to_unsigned(specs.width);
  size_t size = f.size();
  size_t num_code_points = width != 0 ? f.width() : size;
  if (width <= num_code_points) return f(reserve(size));
  size_t padding = width - num_code_points;
  size_t fill_size = specs.fill.size();
  auto&& it = reserve(size + padding * fill_size);
  if (specs.align == align::right) {
    it = fill(it, padding, specs.fill);
    f(it);
  } else if (specs.align == align::center) {
    std::size_t left_padding = padding / 2;
    it = fill(it, left_padding, specs.fill);
    f(it);
    it = fill(it, padding - left_padding, specs.fill);
  } else {
    f(it);
    it = fill(it, padding, specs.fill);
  }
}

Here f.size() returns the data size in code units, f.width() returns width in user-perceived characters and f(it) writes the data.

write_padded itself is relatively small but because it’s used in many places and thanks to overly eager inliner we get this code bloat. Fortunately it has turned out to be a recent issue which hasn’t affected most of the earlier library versions and also very easy to fix - just mark the fill function called from write_padded as noinline:

__attribute__((noinline)) OutputIt fill(
    OutputIt it, size_t n, const fill_t<Char>& fill) {
  ...
}

This trivial change reduced the binary size from ~368k to ~245k which is better but still not perfect. Let’s look at full symbols to see why we have so many instantiations:

% bloaty a.out -d fullsymbols
     VM SIZE                                                                  FILE SIZE
 --------------                                                            --------------
  71.8%   206Ki [358 Others]                                                206Ki  71.9%
   3.8%  10.8Ki [__LINKEDIT]                                               10.8Ki   3.7%
   1.8%  5.11Ki void fmt::...fallback_format<double>(double, fmt::v6::int  5.11Ki   1.8%
   1.5%  4.21Ki void fmt::...<fmt::v6::buffer_range<wchar_t> >::write_pad  4.21Ki   1.5%
   1.4%  4.09Ki void fmt::...<fmt::v6::buffer_range<wchar_t> >::write_pad  4.09Ki   1.4%
   1.4%  4.03Ki void fmt::...<fmt::v6::buffer_range<wchar_t> >::write_pad  4.03Ki   1.4%
   ...

Many of the instantiations are for wchar_t which we don’t even need. A simple way to get rid of those is by using link-time optimization (LTO):

clang++ test.cc -O2 -I include -std=c++17 src/format.cc -DNDEBUG -flto

This reduces the binary size to ~164k. To get the same reduction without LTO let’s remove the wchar_t vformat instantiation:

template FMT_API std::wstring internal::vformat<wchar_t>(
    wstring_view, basic_format_args<wformat_context>);

This way users don’t pay for what they don’t use.

While at it I also noticed and nuked a few deprecated functions such as

     VM SIZE                                                                    FILE SIZE
 --------------                                                              --------------
   ...
   0.4%  1.28Ki fmt::v6::internal::arg_map<>::init()                         1.28Ki   0.4%

and another interesting thing - we get type_info for std::allocator which is not even a polymorphic type:

   0.0%      65 typeinfo name for std::__1::allocator<char>                      65   0.0%
   0.0%      65 typeinfo name for std::__1::allocator<unsigned int>              65   0.0%
   0.0%      65 typeinfo name for std::__1::allocator<wchar_t>                   65   0.0%

The type info is generated because we use the empty base class optimization in memory_buffer, which is a polymorphic class, inheriting from std::allocator:

template <typename T, std::size_t SIZE = inline_buffer_size,
          typename Allocator = std::allocator<T>>
class basic_memory_buffer : private Allocator ...

The type info is tiny but let’s nuke it too replacing inheritance with composition to reduce symbol noise:

template <typename T, std::size_t SIZE = inline_buffer_size,
          typename Allocator = std::allocator<T>>
class basic_memory_buffer : ... {
  Allocator alloc_;
  ...
};

In C++20 we’ll be able to use [[no_unique_address]] to get the same optimization without the unwanted side-effects.

This brings library size to ~132k (mostly thanks to removing wchar_t instantiations), even better than with LTO. However, write_padded still takes way too much space:

     VM SIZE                                                                 FILE SIZE
 --------------                                                           --------------
  30.7%  48.0Ki fmt::v6::internal::basic_writer<>::write_padded<>()        48.0Ki  30.9%
  26.0%  40.5Ki [128 Others]                                               40.4Ki  26.0%
   6.5%  10.2Ki [__LINKEDIT]                                               9.34Ki   6.0%
   ...

Let’s improve it by doing all the padding computation first and reducing the number of calls to fill:

template <typename F> void write_padded(const format_specs& specs, F&& f) {
  unsigned width = to_unsigned(specs.width);
  size_t size = f.size();
  size_t num_code_points = width != 0 ? f.width() : size;
  size_t padding = width > num_code_points ? width - num_code_points : 0;
  size_t left_padding = 0;
  if (specs.align == align::right)
    left_padding = padding;
  else if (specs.align == align::center)
    left_padding = padding / 2;
  auto&& it = reserve(size + padding * specs.fill.size());
  it = fill(it, left_padding, specs.fill);
  f(it);
  it = fill(it, padding - left_padding, specs.fill);
}

This brings the size to 108k. Now let’s reduce branching by using a bit shift to compute left_padding. Also while at it let’s pass the size and width explicitly:

const char padding_shifts[] = {
// shift     align    meaning
// -----     -------  -------
      31, // default  same as left
      31, // left     shift left padding out of existence
       0, // right    left padding takes it all
       1, // center   left padding gets half
       0  // numeric  I don't even but let's say it's similar to right
};

template <typename F>
void write_padded(const format_specs& specs, size_t size, size_t width,
                  const F& f) {
  unsigned spec_width = to_unsigned(specs.width);
  size_t padding = spec_width > width ? spec_width - width : 0;
  size_t left_padding = padding >> data::padding_shifts[specs.align];
  auto&& it = reserve(size + padding * specs.fill.size());
  it = fill(it, left_padding, specs.fill);
  it = f(it);
  it = fill(it, padding - left_padding, specs.fill);
}

Now we are at ~104k and optimized write_padded so much (30x) that it is no longer the main contributor to the binary size (yay!):

     VM SIZE                                                                 FILE SIZE
 --------------                                                           --------------
  30.6%  37.9Ki [125 Others]                                               37.8Ki  30.6%
   8.0%  9.95Ki [__LINKEDIT]                                               9.39Ki   7.6%
   7.5%  9.24Ki fmt::v6::internal::basic_writer<>::write_int<>()           9.24Ki   7.5%
   6.4%  7.97Ki fmt::v6::internal::write_padded<>()                        7.97Ki   6.5%
   ...

Let’s look into write_int which is an internal function used for fancy integer formatting. It writes an integer in the format <left-padding><prefix><numeric-padding><digits><right-padding> per format specs:

template <typename F>
void write_int(int num_digits, string_view prefix, format_specs specs, F f) {
  std::size_t size = prefix.size() + to_unsigned(num_digits);
  char_type fill = specs.fill[0];
  std::size_t padding = 0;
  if (specs.align == align::numeric) {
    auto unsiged_width = to_unsigned(specs.width);
    if (unsiged_width > size) {
      padding = unsiged_width - size;
      size = unsiged_width;
    }
  } else if (specs.precision > num_digits) {
    size = prefix.size() + to_unsigned(specs.precision);
    padding = to_unsigned(specs.precision - num_digits);
    fill = static_cast<char_type>('0');
  }
  if (specs.align == align::none) specs.align = align::right;
  out_ = write_padded(out_, specs, size, [=](reserve_iterator it) {
    ...
  });
}

Here prefix is a base prefix such as “0x” and f is a callable that writes digits through the output iterator. We can optimize this by moving the part that doesn’t depend on F to a separate function only parameterized by the character type:

template <typename Char> struct write_int_params {
  std::size_t size;
  std::size_t padding;
  Char fill;
};

template <typename Char>
write_int_params<Char> make_write_int_params(int num_digits, string_view prefix,
                                             basic_format_specs<Char>& specs) {
  std::size_t size = prefix.size() + to_unsigned(num_digits);
  Char fill = specs.fill[0];
  std::size_t padding = 0;
  if (specs.align == align::numeric) {
    auto width = to_unsigned(specs.width);
    if (width > size) {
      padding = width - size;
      size = width;
    }
  } else if (specs.precision > num_digits) {
    size = prefix.size() + to_unsigned(specs.precision);
    padding = to_unsigned(specs.precision - num_digits);
    fill = static_cast<Char>('0');
  }
  if (specs.align == align::none) specs.align = align::right;
  return {size, padding, fill};
}

template <typename F>
void write_int(int num_digits, string_view prefix, format_specs specs, F f) {
  auto params = make_write_int_params(num_digits, prefix, specs);
  out_ = write_padded(out_, specs, params.size, [=](reserve_iterator it) {
    ...
  });
}

Bloaty also shows some suspicious instantiations of int_writer for signed integral types:

     VM SIZE                                                          FILE SIZE
 --------------                                                    --------------
   ...
   0.7%     863 fmt::...basic_writer<...>::int_writer<int...       863   0.7%
   0.7%     863 fmt::...basic_writer<...>::int_writer<long...      863   0.7%

Why are these suspicious? {fmt} tries to reduce the number of instantiations of internal templates by mapping integral types into a smaller subset which still covers all possible bit sizes. On most platforms it’s just 3 unsigned types: 32-bit, 64-bit and 128-bit. However, int_writer slipped through the cracks and was accidentally instantiated for signed types. This also explains much of the code bloat we’ve seen so far because those functions are used from int_writer.

template <typename Int> struct int_writer {
  using unsigned_type = uint32_or_64_or_128_t<Int>;

  int_writer(basic_writer<Range>& w, Int value,
             const basic_format_specs<char_type>& s)
      : abs_value(static_cast<unsigned_type>(value)), ...

The fix is to move type mapping (uint32_or_64_or_128_t<Int>) from the int_writer class to its constructor and add a static assert so that it doesn’t happen again:

 template <typename UInt> struct int_writer {
  template <typename Int>
  int_writer(basic_writer<Range>& w, Int value,
             const basic_format_specs<char_type>& s)
      : abs_value(static_cast<UInt>(value)), ... {
    static_assert(std::is_same<uint32_or_64_or_128_t<Int>, UInt>::value, "");
  }

This reduces the binary size to ~92k. We can go even further by compiling with -Os instead of -O2 which gives ~75k or with -Os -flto which gives ~57k.

Summary

With the help of Bloaty McBloatface and a few simple transformations we managed to reduce library size 4x, from 368k to just 92k without affecting performance. These improvements have been achieved by

  • reducing the number of template instantiations,
  • reducing inlining,
  • moving code that doesn’t have to be parameterized outside of templates,
  • reducing branching.

Numbers reported here are obtained on x86-64 macOS with clang++ and -O2. On embedded/mobile targets they will likely be smaller. It is possible to reduce library size even more by disabling floating-point support which could be useful on some embedded platforms.

All of these size optimizations and more are already available in {fmt}’s master and will ship in the next major release since they affect ABI.


Last modified on 2020-05-21