A simple guide to atomics in C++
There’s often confusion around when something in computer science is called “atomic”. In most situations, it may just mean some process happens in a single step or operation. However, within C++ , the definition of atomicity is far more specific. Rather, using the std::atomic
classes and types do not ensure all code is truly atomic. Although atomic types are part of the C++ language, atomic operations must be supported by whatever hardware a program runs on. This guide serves as a simple guide to understanding exactly what’s atomic in C++.
The Types
In C++, the std::atomic<>
template class can be used to wrap many other types in order to facilitate atomic operations on that type. The template by no means guarantees any operations will actually be atomic though. If any atomic operations are not supported by the current CPU, the compiler will use mutex-based fallbacks. Thankfully, there's a useful function and guaranteed boolean member of atomic types to help you check if the CPU supports atomic operations for them or not
The above notion also points out another important fact about atomics, only operations are atomic, not types or data. Nothing about int
is different than std::atomic<int>
in terms of the underlying data it represents. Only operations on data can ever be atomic. Additionally, the std::atomic<>
types are designed in such a way where only atomic operations are meant to be applied to the data a type represents, and never to intermix atomic and non-atomic operations.
Loads and Stores
The most basic atomic operations are loads and stores. Although you might think of the phrase “storing a value in a variable”, that isn’t really what happens with an atomic store. First, remember that the purpose of atomics allow multiple threads to modify data in a thread-safe manner. For that to be possible, such data must exist in shared memory or cache. Thus, an atomic load loads data from shared memory to either a register or thread-specific memory, depending on the processor architecture. Atomic stores move data into shared memory atomically.
To really understand the atomic nature of operations like a load or store, let's the first layout what atomicity looks like.
Atomicity, a point in time
Take a variable a, which has the value 3, at time t1. If an atomic load was applied to a at t1, the load would retrieve the value 3. However, if a load was applied at the very next point in time, t2, a different value than 3 might be retrieved, simply because another operation could have taken place at t1 instead.
Say that variable a is initialized with the value 1, such that the representation of a before any operations are applied must be 1. If a store operation, then a load operation are applied to a, there is absolutely no guarantee that the load operation will retrieve the value stored from the store operation. This is because the load and store are happening at two different points in time. A rule of thumb with atomics is, there can be an infinite number of operations between any two points in time.
Now let’s put that together into one code example. Here, two threads try to both store then load a value into the same atomic integer. A boolean flag coordinates the two threads to both start at almost exactly the same time.
If you compile and run this program, you might get a result like the following
t1 0 - 3
t1 1 - 3
t1 2 - 3
t1 3 - 3
t1 4 - 3
t1 5 - 3
t1 6 - 3
t1 7 - 3
t1 8 - 3
t1 9 - 3
t2 0 - 6
t2 1 - 6
t2 2 - 6
t2 3 - 6
t2 4 - 6
t2 5 - 6
t2 6 - 6
t2 7 - 6
t2 8 - 6
t2 9 - 6
This may seem odd since it shows that each respective thread was able to load what is stored. A possible order of operations for this program is thread t1
loading the start variable, then completing all of its loads and stores, and then t2
loading the start variable, and completes all of its loads and stores. To test that, we can add some “spin” between the store and load. This amounts to extra work a thread does that is not in the shared memory space, like for (int k = 0; k < 1000; ++k);
. If we do that and run the program, variations between what each thread loaded begin to appear.
Exchanges
An exchange also referred to as a swap, is an atomic operation that replaces some value in an atomic variable, and returns the value that was previously present. An exchange functions as a store, then a load in a single, atomic operation. It does not, however, check the value of the atomic variable prior to the operation. Thus, an exchange could also be considered unconditional. Exchanges do not provide an atomic solution to load then store.
The exchange operation is also something called cache coherent. This means that, after an exchange, any operation following that reflects the value the exchange writes to the variable. To illustrate this, an example with a single, worker thread can be used.
In the above, the worker thread is given a large task of exchanging a value 100,000 times. During that time, the main function changes the value stored in foobar
to 14
. Every subsequent call after the main function’s exchange now return 14
.
Fetch Operations
Fetch operations, such as fetch and add, or fetch and subtract, apply some operation to an atomic variable, and fetch the value stored before the operation was applied. Fetch operations work similar to exchanges, in the sense an atomic exchange is just writing a value and “fetching” the previous one. There’s several type of fetch operations, of which the following are supported in C++:
fetch_add
fetch_sub
fetch_and
fetch_or
fetch_xor
One limitation of the atomic fetch operations is that they only allow the equivalent of a postfix operation, such as x++;
. Fetch operations only return the value prior to the operation, never the value after. Returning the value after the operation would require an additional atomic load, which would make the two operations not atomic with respect to the value of the variable. The example below implements a counter based on the fetch_add
and fetch_sub
operations
which prints out, if compiled and run,
0
1
2
1
The reason the fetch_sub
returns 2 at first is because fetch_add
returns the value before incrementing it. The next call to fetch_sub
returns 1, indicating the previous call did subtract 1 after fetching the previous value.
You might notice typical arithmetic operations like multiplication and division don’t have fetch
equivalents. The reason for that relates back to the fundamental idea behind atomics, they need hardware support. There’s no hardware that supports atomic multiplication or division. Best effort approaches do exist that involve implementing those in software. One example is to load an integer atomically, multiply it, then store it back atomically, such as
std::atomic<int> a(3);
int b = a.load() * 3;
a.store(b);
But there’s a problem with this. Between the 2nd and 3rd statements above, it is possible that another thread changes the value of a
, such that the goal of “multiplying by 3” won’t work because b
is a product of the previous value of a
and 3. Simply storing b
in a
might lead to an inconsistent state of data. There’s nothing dangerous in this scenario, there’s no possibility of a memory leak or a segmentation fault. Yet, in a multithreaded program, each thread needs to have a consistent view of data with atomics to successfully complete it’s respective tasks. This requires the concept of “failure” in an atomic operation.
Compare Exchange
The compare exchange also called compare and swap (CAS) is the most powerful atomic operation available in C++. In most cases, it allows an atomic comparison of the current value of an atomic variable, and if that comparison is true, it then attempts to store the desired value. Despite being an atomic operation, a compare exchange can certainly fail, if another thread changes the value of the variable between the time the compare exchange reads and before it writes. A compare exchange operation is otherwise known as a read-modify-write operation (RMW).
Processors implement compare exchange in different ways. In some, strongly ordered processors, like x86 ones, compare exchange is done in a single assembly instruction. What this means is that compare exchange operations only fail if another thread truly changed the value of the atomic variable before the compare exchange operation is completed. On weakly ordered processors, compare exchange is implemented with two assembly instructions, usually locked load and conditional store (LLCS) . LLCS can fail transiently due to using two instructions, such as for a thread being context switched.
In C++, there are two compare exchange functions, compare_exchange_weak
and compare_exchange_strong
. The weak version is better suited for situations where you call the operation in a loop. Calling compare exchange in a loop arises commonly in implementing lock-free data structures. For an example, let’s look at the simplest lock free data structure, a stack.
In the above, the stack has two methods, push
an pop
. Each of them satisfies the main criteria of lock freedom, one thread always make progress and completes it’s push or pop task. If many threads call compare_exchange_weak
, only one of them will succeed. All other threads experience a failed compare_exchange_weak
, meaning they will load the value that is currently stored in the variable. The loop allows the compare exchange operation to essentially “repeat on last known value” of an atomic variable.
A stack only has one point from which it can add or remove data from. Thus, any operation depends on the value of that point. A push operation can only succeed in the atomic sense if a compare_exchange_weak
call succeeds on the last known top node of the stack. The same goes for a pop operation.