blog rss feed

When working in a large C++ codebase, compile times become a major wory. One reason for slow compile times will be transitive header dependencies. Every header file should include what it needs. The preprocessor will then collect all the files and build a huge source file (the compilation unit) that will be fed to the actual compiler. This creates lots of disk I/O and a ton of C++ code which must be parsed by the compiler frontend, for which code will be generated and finally most of that will be eliminated by the linker again.

One way to reduce the includes of a header is the PIMPL idiom. Instead of putting the implementation details of classes in the header, you move it into a source file (compilation unit).

Let's look at a simple example class that wraps stdio.h file handling:

/** \file File.h */

#include <stdio.h>
#include <string>

class File {
public:
  File();
  ~File();

  bool open(std::string path, std::string openmode);

private:
  std::string m_path;
  std::string m_openmode;
  FILE *m_handle;
};

The open(...) and close() methods would be implemented in the .cpp file something like this:

/** \file File.cpp */

bool File::open(std::string path, std::string openmode) {
  if(m_handle != NULL) {
    //close();
  }

  FILE* f = fopen(path.c_str(), openmode.c_str());

  if(f!=NULL) {
    m_handle = f;
    m_path = path;
    m_openmode = openmode;
  }
}

This imports the stdio.h header and everything that imports your beautiful new File class header will now get its global namespace polluted with C symbols. In this specific case a simple forward declaration for FILE struct could help but forward declarations are not always possible or a good solution.

A common pattern to completly hide the implementation of a class is to use the PIMPL ("private implementation" or "pointer to implementation") idiom. When using the PIMPL idiom, you put a shallow wrapper class in the header file. It has all the methods of the real class, but only one member: A pointer to a forward declared class. There are variations around the exact implementation technique. I like the following C++11 std::unique_ptr based version the most. For brevity, I will omit the required copy- and move-construtors and operator= signatures. Since a PIMPL class essentially becomes a container for a single object, you have to add these - see also The Rule of Three/Five. As a side-effect, this also makes it a trivially moveable object, giving you fast and exception-safe move semantics.

/** \file File.h */

#include <string>
#include <memory>

class File {
public:
  File();
  ~File();
  //copy- and move- ctors/operator= methods here

  bool open(std::string path, std::string openmode);

private:
  class FilePIMPL;
  std::unique_ptr<FilePIMPL> m_impl;
};

The complete implementation will be moved into a compilation unit.

/** \file File.cpp */

#include <File.h>
#include <stdio.h>

class File::FilePIMPL {
public:
  FilePIMPL() :
    m_handle(NULL)
  {}

  bool open(std::string path, std::string openmode) {
    //... implementation here
  }

private:
  std::string m_path;
  std::string m_openmode;
  FILE *m_handle;
};

File::File() :
  m_impl(new FilePIMPL())
{}

File::~File() {
}

bool File::open(std::string path, std::string openmode) {
  return m_impl->open(path, openmode);
}

This solved the problem of the hiding the complete implementation but introduced a new one: An additional allocation (new FilePIMPL()) of the PIMPL class on the heap in the default constructor File::File(). Sometimes this is not a big deal, especially when dealing with plumbing code or business logic. But allocations are not free. They take precious clock cycles, may grab a lock on the heap and thus limit parallelization or can fail and throw exceptions. When writing performance sensitive code, an additional allocation may be prohibitively expensive. But there is a solution in one of the more dangerous but magically enabling features of C++: placement new. Instead of constructing the PIMPL class on the heap with the regular new operator, we use the placement new operator and create the object in a embedded buffer inside the File class. There are two important details here: The buffer where our implementation class will be created must be

  1. big enough to hold the object, and
  2. properly aligned.

Since the compiler cannot see the implementation, it cannot know the size nor the the proper alignment and we must manually choose both.

/** \file File.h */

#include <string>
#include <cstddef>
#include <type_traits>

class File {
public:
  File();
  ~File();

  bool open(std::string path, std::string openmode);

private:
  static constexpr std::size_t c_impl_size = 128;
  static constexpr std::size_t c_impl_alignment = std::alignment_of<std::max_align_t>::value;
  typedef std::aligned_storage<c_impl_size, c_impl_alignment>::type aligned_storage_type;
  aligned_storage_type m_impl;
};

To make working with placement new a little more pleasant, I will add a few templated wrapper functions:

/** \file placement_new.h */

#include <cstddef>

///create an object of type T at a given address
template<typename T>
void placement_new(void *buffer, std::size_t buffer_size) {
  new(buffer) T();
}

///cast a given address to a pointer to type T
template<typename T>
T* placement_cast(void *buffer) {
  return reinterpret_cast<T*>(buffer);
}

///call the destructor of type T at a given address
template<typename T>
void placement_delete(void *buffer) {
  placement_cast<T>(buffer)->~T();
}

We will new use the placement_* functions to construct objects inside

/** \file File.cpp */

#include <File.h>
#include <stdio.h>
#include <placement_new.h>

class FilePIMPL {
  //full implementation in this class
};


File::File() {
  //construct a FilePIMPL instance in m_data
  placement_new<FilePIMPL>(&m_impl, sizeof(m_impl));
}

File::~File() {
  //destruct the FilePIMPL
  placement_delete<FilePIMPL>(&m_impl);      
}

bool File::open(std::string path, std::string openmode) {
  return placement_cast<FilePIMPL>(&m_impl)->open(path, openmode);
}

This is the basic mechanics of what I call the "Static PIMPL" idiom or pattern. I have not seen it used or called like this anywhere else, so I declare myself the inventor. But I'm probably wrong and it's already been used in a lot of places, maybe even inside STL implementations. If it has another name, please drop me a line.

(Update: Turns out Herb Sutter actually posted something like this as GotW#028 and calls it "Fast PIMPL" and if you google for Fast PIMPL, you will find multiple other authors describing more or less exactly what I'm lining out here)

As said above, the compiler cannot know the size and alignment requirements, so we must choose them manually. For the alignment, you can choose the platforms maximum alignment: std::alignment_of<std::max_align_t>::value. This is safe, but may cost a few bytes through over-alignment. For the size, you can measure the size of the implementation classes on the platforms you support and write some preprocessor logic to define a constant per platform. What also tends to work reaonably well is to measure the size on one platform, then devide it by sizeof(void*) and express the size constant as a multiple of sizeof(void*). When size and/or alignment are wrong, you are doomed and your code will fail in mysterious ways. Effects range from just slight performance penalties up to broken atomics (and thus invisible race conditions). Messing with alignment and getting it wrong is also a common source for CPU exceptions when the compiler generates vectorized code. And if it doesn't happen today, it can happen any time in the future with a new compiler. So it is essential to guarantee size and alignment.

To guarantee that, you have to add a few asserts and static_asserts.

/** \file placement_new.h */

template<typename T>
void placement_new(void *buffer, std::size_t buffer_size) {
  //check the size of the buffer (at runtime, in debug builds)
  assert(sizeof(T) <= buffer_size);
  //check the alignment of the buffer (at runtime, in debug builds)
  assert(std::align(std::alignment_of<T>::value, sizeof(T), buffer, buffer_size) == buffer );
  new(buffer) T();
}

/** \file File.cpp */
File::File() {

  static_assert(sizeof(m_impl) >= sizeof(FilePIMPL),
    "buffer not big enough to hold FilePIMPL");

  static_assert(
    std::alignment_of<aligned_storage_type>::value
    >=
    std::alignment_of<FilePIMPL>::value),
    "alignment requirements of FilePIMPL not fulfilled");

  //construct a FilePIMPL instance in m_data
  placement_new<FilePIMPL>(m_impl, sizeof(m_impl));
}

This is not a technique to use lightly. Only use it when you really have performance requirements that trump maintainability concerns. If your implementation is big or changes often, a classic PIMPL might be more appropriate as adjusting the buffer sizes will become a tedious activity. I used it successfully to implement Mutex classes that are used throughout the codebase and are implemented using each platforms native facilities (Win32 vs. pthread).

Hash functions are an essential part of computer science. They map arbitrary length inputs into a fixed length outputs. There are general purpose and cryptographic hash functions. Cryptographic hash functions are one-way functions that provide certain guarantees on the complexity to find collisions. Non-cryptographic or general purpose hash functions do not provide such guarantees and are used for implementing hashtables. Their output should be evenly distributed and look random so that few collisions in the output occur.

One of those functions is called DJBX33A. It is used in the hash table implementation of Python (although as of 3.4 only for very short inputs) and many other products. It's a very simple algorithm:

hash_i+1 = hash_i * 33 + c_i
hash_0 = 5381

Where c_i is the current input byte and all operations are on 32bit signed integers. This is so simple, effective and fast it almost makes you cry. Even a trivial implementation on a modest compiler will probably produce a decent performance.

It has only one flaw: It is hard to vectorize on MMX/SSE/Neon/... . This is due to the iterative nature where each round depends on the output of the round before. The auto-vectorization engines of modern C/C++ compilers can do many magical things but this is still out of reach.

So what can be done? There is a simple mechanism that can be used here: Just run this algorithm four times and iterate the input bytes over the four hash states. This will effectively split the input bytestream into four seperate streams that can be hashed in parallel with vector instructions. The output will be 128bit which then can be hashed down into 64 or 32 bit output using regular DJBX33A.

I hereby call this algorithm X4DJBX33A.

I had some time during the last days and implemented X4DJBX33A using Intel intrinsics in several instructions sets and variants. Here are some benchmark results using an Intel Core-i7 4960HQ 2.6GHz CPU with gcc-4.8 on linux/amd64:

  • DJBX33A reference C: 1198 MiB/s
  • DJBX33A optimized C: 1614 MiB/s
  • X4DJBX33A reference C: 1039 MiB/s
  • X4DJBX33A optimized C: 3119 MiB/s
  • X4DJBX33A mmx vectorized: 3484 MiB/s
  • X4DJBX33A sse2 vectorized: 6196 MiB/s
  • X4DJBX33A ssse3 vectorized: 6658 MiB/s

Not bad, huh?

I published the code and the benchmarks on github: cleeus/hashx4. I started looking into other hash functions that are used in important hashtable implementations (namely siphash24 that is used in python 3.4 and later).

Looking at the benchmark results, there are also some important lessons to be learned:

  • If it's not an emberrassingly parallel algorithm, make it one (and then vectorize).

The DJBX33A algorithm can be implemented very fast but it is hard to vectorize. The modified version which I call X4DJBX33A seems to be well suited for running on 128bit or 64bit wide vector registers. When it is not vectorized, it is slower than a well implemented vanilla DJBX33A algorithm.

  • The SSSE3 version is an experiment.

SSSE3 has number of new integer intrinsics. Among them a mighty _mm_shuffle_epi8 that can be used to reorder all 16 bytes in a 128bit xmm register into an arbitrary permutation. Using this opcode can lead to an alternative implementation that doesn't use the two _mm_unpack instructions. On some CPUs this seems to be faster, on most slower CPUs it is not.

  • Opcode scheduling is important, so use intrinsics.

Benchmarks with -O0 and -O1 builds have shown that even the MMX and SSE implementations get noticeably slower when not optimized. A look at the disassembled binaries of the -O2 and -O3 optimized builds shows that the compiler reorders instructions. It probably does this to enhance the instruction level parallelism and provide the CPU with useful instructions while it is still waiting for some RAM I/O to complete. Using intrinsics instead of raw assembler code allows the developer to leverage the wisdom of the compiler.

  • The C implementation performance varies widely with the compiler.

MSVC and GCC seem to produce very different code from the C implementations. This is not surprising as the research on auto-vectorization and codegen is still ongoing. The SSE2 version seems to be much more stable across compilers and platforms.

  • Know your instruction set.

I struggled quite a bit with the SSE2 instruction set and initially failed to produce a vectorized version that was faster than the scalar one. This was due to insufficient knowledge of the instruction set. In the end learning all the intrinsics and their performance characteristics is what enables a developer to find a good solution.

  • Alignment matters.

Besides the reference C implementations, I produced optimized (but still C-based) versions of DJBX33A and X4DJBX33A. A major optimization was to hash the input with a simple implementation until an alignment boundary of 16 bytes in the input memory block was reached. Then the compiler gets a hint to assume that the pointer is aligned to a 16 byte boundary. After the hint, an inner loop which hashes 16 byte chunks and an outer loop which iterates the inner loop is run. This keeps the alignment assumption. This assumption allows the compiler to use opcodes that rely on alignment and enables auto-vectorization.

  • SSE2 is everywhere. Use it.

If you are on a 64bit X86 processor, you are guaranteed to have SSE2. On 32bit X86, every processor sold in the last 10 years has SSE2. From an economic point of view you can probably ignore non-SSE2 x86 CPUs or just provide one C implementation.

Hey kids, come here - grampa is telling a story from the war... ;)

I'm currently hunting a bunch of race conditions in a rather large C/C++ code base. Some of them manifest quite visibly as heap memory corruptions (the reason I'm searching), others are subtle and just visible in one-in-a-thousand slightly off output values.

Now ten years ago, when you were facing such bugs, probably all you could do was to read all thet code, verify the locking structure and look out for racyness. But things have changed substantially. We have valgrind.

Valgrind is a binary code instrumentation framework and a collection of tools on top of that. It will take your binary and instrument loads, stores and a bunch of important C library functions (memory management, thread synchronization, ...). It can then perform various checks on your running code. So it belongs in the category of dynamic analysis tools.

Luckily the buggy code also runs on linux where valgrind is available so I went on to try it. And boy did it find subtle bugs...

There are two different tools available for race conditions: helgrind and drd. I'm not yet 100% sure how both of them work but as far as I understand, they are tracking memory loads and stores together with synchronization primitives and build a runs-before graph of code segments. When code segments access shared memory in a conflicting way (reader/writer problem) without having a strong runs-before relationsship these are race conditions and will be reported.

Sadly I'm also finding real races in well known open source libraries that are supposed to be thread safe and work correctly ... seems they do not. This tells me, that the developers are not using all the tools available to check their code.

And this is my message: If you are writing an open source library or application that only remotly has something to do with threads, save yourself and your users a lot of trouble and use valgrind!

When your security measures go too far, your users will work around it.

physical security fail

(seen at a train station in Berlin)

My guess here is, that the person who was planning the locking system just forgot the break room of the train drivers or did not update the locking system when it was moved behind this door.

For various reasons I've been studying all kinds of authentication protocols, lately. A good authentication protocol should, for my purposes, fulfill the following properties:

  1. It does not disclose the secret to an eavesdropper over the wire.
  2. It does not require the server to store the secret.
  3. It is reasonably simple to implement.

As always, it seems to be a case of "choose two".

A little surprise for me was CRAM-MD5. Of course MD5 should not be used anymore, but CRAM-MD5 can trivially be changed into a CRAM-SHA256 just by replacing the underlying hash function. So let's keep the weak hash function out for the purpose of the discussion.

The point is, CRAM-MD5 fulfills 1. and 3. but NOT 2. This often is not immediately obvious and users and server administrators might not be aware of it. E.g. when you look at the native user database of a CRAM-MD5 enabled dovecot server, you will see something like this:

username:{CRAM-MD5}652f707b0195b2730073e116652e22f20125ec6413a957773b65ebe33d7b3ad0:1001:1001::/home/username

This looks like a hash of the password you use to authenticate to the dovecot server. But CRAM-MD5 is more or less just an alias for a classic challenge-response protocol based on HMAC-MD5. In a classic HMAC based authentication protocol the server sends the client a random nonce value and the client is supposed to respond with HMAC(secret, nonce). Then the server must also calculate HMAC(secret, nonce) on his side and compare the clients response with the expected result. If they match, he can know that the client also knew the secret.

As you can already see, the server MUST know the secret (unless there is some magical trick somewhere). So I looked into the relevant RFC-2159 for the CRAM protocol and RFC-2104 for the HMAC details. And deep in there in section "4. Implementation Note" in RFC-2104 you will find the answer.

Let's look at it in detail. First we need the definition of the HMAC function.

HMAC(secret, nonce) =
  H( secret ^ opad,
    H( secret ^ ipad, nonce )
  )

Where H(m1, m2, ...) is a cryptographic hash function applied on it's concatenated arguments. opad and ipad are padding values that depend on the length of secret. ^ is a bitwise XOR. Now most hash functions can also be implemented with an additional initialization vector (IV) argument which contains the hash functions last internal state. With such an IV, a previous hash calculation can be continued at the place where it was stopped. If you only store the IVs for H( secret ^ opad ) and H( secret ^ ipad ) you can still calculate the complete HMAC while not storing the secret in plain.

HMAC(secret, nonce) =
  H( IV_opad,
    H( IV_ipad, nonce )
  )

Turns out this is exactly what dovecot and probably any other sane CRAM-MD5 implementation does and has to do.

While this protects the original secret (your password) to some degree, the stored IVs are exactly as powerfull as the secret. If an attacker manages to get the stored CRAM-MD5 IV he can use it to log in into any other account that supports CRAM-MD5 and uses the same password.

That means, unlike crypt or bcrypt hashed password databases, a database with CRAM-MD5 IVs must be kept secret from any non-authorized user.

So while the CRAM-MD5 challenge-response authentication looks like beeing more secure than a plaintext authentication, it actually depends on the attacker model. You might be better off with just plaintext authentication over SSL (and a proper bcrypt password-hash database).