A few days ago I learned about a cool new project called
printpp
which is a printf
format string
preprocessor. It rewrites format strings from brace-delimited format into
printf
format at compile time. Although the project mentions Python
formatting, the only thing they seem to have in common is delimiters.
Great thing about this approach is that it doesn’t add any overhead to printf
.
For example, the code
int main() {
pprintf("{} hello {s}! {}\n", 1, "world", 2);
}
where pprintf
is a macro defined by pprintpp
, compiles to
0000000000400450 <main>:
400450: 48 83 ec 08 sub $0x8,%rsp
400454: 41 b8 02 00 00 00 mov $0x2,%r8d
40045a: b9 04 06 40 00 mov $0x400604,%ecx
40045f: ba 01 00 00 00 mov $0x1,%edx
400464: be 10 06 40 00 mov $0x400610,%esi
400469: bf 01 00 00 00 mov $0x1,%edi
40046e: 31 c0 xor %eax,%eax
400470: e8 bb ff ff ff callq 400430 <__printf_chk@plt>
400475: 31 c0 xor %eax,%eax
400477: 48 83 c4 08 add $0x8,%rsp
40047b: c3 retq
This is very nice in terms of binary code size, but can we do better in terms
of performance if we don’t limit ourselves to printf
? Notice that the compiler
generated a call to
__printf_chk
.
Let’s have a look at this function:
/* Write formatted output to stdout from the format string FORMAT. */
int
___printf_chk (int flag, const char *format, ...)
{
va_list ap;
int done;
_IO_acquire_lock_clear_flags2 (stdout);
if (flag > 0)
stdout->_flags2 |= _IO_FLAGS2_FORTIFY;
va_start (ap, format);
done = vfprintf (stdout, format, ap);
va_end (ap);
if (flag > 0)
stdout->_flags2 &= ~_IO_FLAGS2_FORTIFY;
_IO_release_lock (stdout);
return done;
}
Unsurprisingly this function does little but call vfprintf
. Calling an
overload that takes va_list
is common for functions that use varargs. Let’s
use Google Benchmark to measure the
effect of this extra call and vararg processing:
int __attribute__((noinline)) test_vprintf(const char *f, std::va_list) {
benchmark::DoNotOptimize(f);
return 0;
}
int test_printf(const char *format, ...) {
std::va_list args;
va_start(args, format);
int result = test_vprintf(format, args);
va_end(args);
return result;
}
void varargs(benchmark::State& state) {
while (state.KeepRunning())
test_printf("%d", 42);
}
BENCHMARK(varargs);
and let’s compare this to a method of passing formatting arguments via variadic templates implemented in the fmt library:
void __attribute__((noinline)) test_vprint(const char *f,
fmt::format_args) {
benchmark::DoNotOptimize(f);
}
typedef fmt::BasicFormatter<char> Formatter;
template <typename ... Args>
inline void test_print(const char *format, const Args & ... args) {
test_vprint(format, fmt::make_format_args<Formatter>(args...));
}
void fmt_variadic(benchmark::State &state) {
while (state.KeepRunning())
test_print("{}", 42);
}
BENCHMARK(fmt_variadic);
Note that I’m using the experimental std
branch of the fmt library for
cleaner C++11 API, but the results should be the same for the stock version
which is C++98 compatible.
The test_print
function calls make_format_args
to place formatting arguments in an fmt::format_args
object that is passed to
test_vprint
. This looks very similar to the printf
case but because
test_print
and fmt::make_format_args
are inline, compiler optimizes them away
and the effect is that the arguments are placed into fmt::format_args
at the caller site. And since fmt::format_args
is not parameterized on
format argument types, there is no template code bloat.
Running this benchmark on my Early 2015 MacBook Pro gives the following result:
Run on (4 X 3100 MHz CPU s)
2016-11-05 17:32:39
Benchmark Time CPU Iterations
-------------------------------------------------------
varargs 5 ns 5 ns 133920031
fmt_variadic 2 ns 2 ns 355192920
The fmt_variadic
method that uses variadic templates is 2.5 times faster but
the absolute difference is just 3ns per call. Does it really matter? To answer
this question, let’s compare it to the total formatting time:
void test_sprintf(benchmark::State &state) {
char buffer[64];
while (state.KeepRunning())
std::sprintf(buffer, "%d", 42);
}
BENCHMARK(test_sprintf);
void test_format(benchmark::State &state) {
while (state.KeepRunning())
fmt::format("{}", 42);
}
BENCHMARK(test_format);
Benchmark Time CPU Iterations
-------------------------------------------------------
varargs 5 ns 5 ns 133920031
fmt_variadic 2 ns 2 ns 355192920
test_sprintf 73 ns 72 ns 9571341
test_format 47 ns 47 ns 14373805
As you can see formatting is so fast that even varargs overhead is nontrivial.
Using varargs would slow down fmt::format
by more than 6% and this is not
even the most efficient API fmt provides, because it returns std::string
.
The situation becomes even more interesting when you take into account
positional arguments. The arguments in va_list
can only be accessed
sequentially and need to be copied by a printf
-like function to an array for
efficient access by index.
With variadic templates nothing prevents storing arguments in an array on the caller side and this is exactly what fmt does. The effect is easy to see on this benchmark:
void test_sprintf_pos(benchmark::State &state) {
char buffer[64];
while (state.KeepRunning())
std::sprintf(buffer, "%1$d", 42);
}
BENCHMARK(test_sprintf_pos);
void test_format_pos(benchmark::State &state) {
while (state.KeepRunning())
fmt::format("{0}", 42);
}
BENCHMARK(test_format_pos);
Benchmark Time CPU Iterations
-------------------------------------------------------
varargs 5 ns 5 ns 133920031
fmt_variadic 2 ns 2 ns 355192920
test_sprintf 73 ns 72 ns 9571341
test_format 47 ns 47 ns 14373805
test_sprintf_pos 97 ns 97 ns 7242253
test_format_pos 49 ns 49 ns 13841652
Using positional arguments slows down sprintf
by 33% and fmt::format
by
only 4%.
Of course you don’t get all this performance improvement completely for free. The formatting code
int main() {
fmt::print("{} hello {}! {}\n", 1, "world", 2);
}
compiles to the following binary code:
$ g++ test.cc -O3 -Lfmt -lfmt -fno-stack-protector
$ objdump -d a.out
...
0000000000401210 <main>:
401210: 48 83 ec 38 sub $0x38,%rsp
401214: be a2 02 00 00 mov $0x2a2,%esi # <- type info
401219: bf c4 3a 41 00 mov $0x413ac4,%edi
# <- "{} hello {}! {}\n"
40121e: 48 89 e2 mov %rsp,%rdx # <- arg array ptr
401221: c7 04 24 01 00 00 00 movl $0x1,(%rsp) 3 <- 1 (1st arg)
401228: 48 c7 44 24 10 d5 3a movq $0x413ad5,0x10(%rsp)
40122f: 41 00 # <- "world" (2nd arg)
401231: c7 44 24 20 02 00 00 movl $0x2,0x20(%rsp)
401238: 00 # <- 2 (3rd arg)
401239: e8 32 0c 00 00 callq 401e70 <_ZN3fmt5printENS_15BasicCStringRefIcEENS_7ArgListE>
40123e: 31 c0 xor %eax,%eax
401240: 48 83 c4 38 add $0x38,%rsp
401244: c3 retq
Since the earlier printpp
’s objdump
output seem to be obtained with GCC on
Linux I use the same platform here to get a meaningful comparison. I also
passed -fno-stack-protector
to gcc because there was no stack protection code
in printpp
’s case either.
Compared to printf
, fmt::print
requires 9 extra bytes to place arguments
in the array on stack, 3 more bytes to pass the pointer to the argument array,
and 3 fewer bytes due to no return value. So the total difference is just 9
bytes and the number of CPU instructions is the same.
The cool thing is that the type information is effectively passed for free
(compared to printf
) because it takes the same amount of code as passing the
flag
argument to __printf_chk
.
So replacing varargs with variadic templates may give nontrivial performance
improvement if you are willing to accept small increase in binary code.
If you want to keep your binaries as small as possible, but still want to enjoy
somewhat improved type safety compared to printf
, check out
pprintpp.
Not relying on printf
also facilitates extensibility and, in particular,
formatting of user-defined types, but that’s a topic for another blog post.
Benchmarks used in this post are available in the format-benchmark repository
(see vararg-benchmark.cc
).
Last modified on 2016-11-05