Reducing printf call overhead with variadic templates

A few days ago I learned about a cool new project called printpp which is a printf format string preprocessor. It rewrites format strings from brace-delimited format into printf format at compile time. Although the project mentions Python formatting, the only thing they seem to have in common is delimiters.

Great thing about this approach is that it doesn’t add any overhead to printf. For example, the code

int main() {
  pprintf("{} hello {s}! {}\n", 1, "world", 2);
}

where pprintf is a macro defined by pprintpp, compiles to

0000000000400450 <main>:
  400450:       48 83 ec 08             sub    $0x8,%rsp
  400454:       41 b8 02 00 00 00       mov    $0x2,%r8d
  40045a:       b9 04 06 40 00          mov    $0x400604,%ecx
  40045f:       ba 01 00 00 00          mov    $0x1,%edx
  400464:       be 10 06 40 00          mov    $0x400610,%esi
  400469:       bf 01 00 00 00          mov    $0x1,%edi
  40046e:       31 c0                   xor    %eax,%eax
  400470:       e8 bb ff ff ff          callq  400430 <__printf_chk@plt>
  400475:       31 c0                   xor    %eax,%eax
  400477:       48 83 c4 08             add    $0x8,%rsp
  40047b:       c3                      retq

This is very nice in terms of binary code size, but can we do better in terms of performance if we don’t limit ourselves to printf? Notice that the compiler generated a call to __printf_chk. Let’s have a look at this function:

/* Write formatted output to stdout from the format string FORMAT.  */
int
___printf_chk (int flag, const char *format, ...)
{
  va_list ap;
  int done;

  _IO_acquire_lock_clear_flags2 (stdout);
  if (flag > 0)
    stdout->_flags2 |= _IO_FLAGS2_FORTIFY;

  va_start (ap, format);
  done = vfprintf (stdout, format, ap);
  va_end (ap);

  if (flag > 0)
    stdout->_flags2 &= ~_IO_FLAGS2_FORTIFY;
  _IO_release_lock (stdout);

  return done;
}

Unsurprisingly this function does little but call vfprintf. Calling an overload that takes va_list is common for functions that use varargs. Let’s use Google Benchmark to measure the effect of this extra call and vararg processing:

int __attribute__((noinline)) test_vprintf(const char *f, std::va_list) {
  benchmark::DoNotOptimize(f);
  return 0;
}

int test_printf(const char *format, ...) {
  std::va_list args;
  va_start(args, format);
  int result = test_vprintf(format, args);
  va_end(args);
  return result;
}

void varargs(benchmark::State& state) {
  while (state.KeepRunning())
    test_printf("%d", 42);
}

BENCHMARK(varargs);

and let’s compare this to a method of passing formatting arguments via variadic templates implemented in the fmt library:

void __attribute__((noinline)) test_vprint(const char *f,
                                           fmt::format_args) {
  benchmark::DoNotOptimize(f);
}

typedef fmt::BasicFormatter<char> Formatter;

template <typename ... Args>
inline void test_print(const char *format, const Args & ... args) {
  test_vprint(format, fmt::make_format_args<Formatter>(args...));
}

void fmt_variadic(benchmark::State &state) {
  while (state.KeepRunning())
    test_print("{}", 42);
}

BENCHMARK(fmt_variadic);

Note that I’m using the experimental std branch of the fmt library for cleaner C++11 API, but the results should be the same for the stock version which is C++98 compatible.

The test_print function calls make_format_args to place formatting arguments in an fmt::format_args object that is passed to test_vprint. This looks very similar to the printf case but because test_print and fmt::make_format_args are inline, compiler optimizes them away and the effect is that the arguments are placed into fmt::format_args at the caller site. And since fmt::format_args is not parameterized on format argument types, there is no template code bloat.

Running this benchmark on my Early 2015 MacBook Pro gives the following result:

Run on (4 X 3100 MHz CPU s)
2016-11-05 17:32:39
Benchmark                 Time           CPU Iterations
-------------------------------------------------------
varargs                   5 ns          5 ns  133920031
fmt_variadic              2 ns          2 ns  355192920

The fmt_variadic method that uses variadic templates is 2.5 times faster but the absolute difference is just 3ns per call. Does it really matter? To answer this question, let’s compare it to the total formatting time:

void test_sprintf(benchmark::State &state) {
  char buffer[64];
  while (state.KeepRunning())
    std::sprintf(buffer, "%d", 42);
}

BENCHMARK(test_sprintf);

void test_format(benchmark::State &state) {
  while (state.KeepRunning())
    fmt::format("{}", 42);
}

BENCHMARK(test_format);

Benchmark                 Time           CPU Iterations
-------------------------------------------------------
varargs                   5 ns          5 ns  133920031
fmt_variadic              2 ns          2 ns  355192920
test_sprintf             73 ns         72 ns    9571341
test_format              47 ns         47 ns   14373805

As you can see formatting is so fast that even varargs overhead is nontrivial. Using varargs would slow down fmt::format by more than 6% and this is not even the most efficient API fmt provides, because it returns std::string.

The situation becomes even more interesting when you take into account positional arguments. The arguments in va_list can only be accessed sequentially and need to be copied by a printf-like function to an array for efficient access by index.

With variadic templates nothing prevents storing arguments in an array on the caller side and this is exactly what fmt does. The effect is easy to see on this benchmark:

void test_sprintf_pos(benchmark::State &state) {
  char buffer[64];
  while (state.KeepRunning())
    std::sprintf(buffer, "%1$d", 42);
}

BENCHMARK(test_sprintf_pos);

void test_format_pos(benchmark::State &state) {
  while (state.KeepRunning())
    fmt::format("{0}", 42);
}

BENCHMARK(test_format_pos);

Benchmark                 Time           CPU Iterations
-------------------------------------------------------
varargs                   5 ns          5 ns  133920031
fmt_variadic              2 ns          2 ns  355192920
test_sprintf             73 ns         72 ns    9571341
test_format              47 ns         47 ns   14373805
test_sprintf_pos         97 ns         97 ns    7242253
test_format_pos          49 ns         49 ns   13841652

Using positional arguments slows down sprintf by 33% and fmt::format by only 4%.

Of course you don’t get all this performance improvement completely for free. The formatting code

int main() {
  fmt::print("{} hello {}! {}\n", 1, "world", 2);
}

compiles to the following binary code:

$ g++ test.cc -O3 -Lfmt -lfmt -fno-stack-protector
$ objdump -d a.out
...
0000000000401210 <main>:
  401210:       48 83 ec 38             sub    $0x38,%rsp
  401214:       be a2 02 00 00          mov    $0x2a2,%esi # <- type info
  401219:       bf c4 3a 41 00          mov    $0x413ac4,%edi
                                          # <- "{} hello {}! {}\n"
  40121e:       48 89 e2                mov    %rsp,%rdx # <- arg array ptr
  401221:       c7 04 24 01 00 00 00    movl   $0x1,(%rsp) 3 <- 1 (1st arg)
  401228:       48 c7 44 24 10 d5 3a    movq   $0x413ad5,0x10(%rsp)
  40122f:       41 00                     # <- "world" (2nd arg)
  401231:       c7 44 24 20 02 00 00    movl   $0x2,0x20(%rsp)
  401238:       00                        # <- 2 (3rd arg)
  401239:       e8 32 0c 00 00          callq  401e70 <_ZN3fmt5printENS_15BasicCStringRefIcEENS_7ArgListE>
  40123e:       31 c0                   xor    %eax,%eax
  401240:       48 83 c4 38             add    $0x38,%rsp
  401244:       c3                      retq

Since the earlier printpp’s objdump output seem to be obtained with GCC on Linux I use the same platform here to get a meaningful comparison. I also passed -fno-stack-protector to gcc because there was no stack protection code in printpp’s case either.

Compared to printf, fmt::print requires 9 extra bytes to place arguments in the array on stack, 3 more bytes to pass the pointer to the argument array, and 3 fewer bytes due to no return value. So the total difference is just 9 bytes and the number of CPU instructions is the same.

The cool thing is that the type information is effectively passed for free (compared to printf) because it takes the same amount of code as passing the flag argument to __printf_chk.

So replacing varargs with variadic templates may give nontrivial performance improvement if you are willing to accept small increase in binary code. If you want to keep your binaries as small as possible, but still want to enjoy somewhat improved type safety compared to printf, check out pprintpp.

Not relying on printf also facilitates extensibility and, in particular, formatting of user-defined types, but that’s a topic for another blog post.

Benchmarks used in this post are available in the format-benchmark repository (see vararg-benchmark.cc).

Last modified on 2016-11-05