How fast is Atomic compared to mutex?

included in C++

2022-02-06 6 minutes

Contents

Test Goal

How fast is Atomic compared to mutex?
When using a mutex, is the overhead of lock_guard significant?
When using atomic, is there any meaningful benefits with memory_order_relaxed?

Settings

void workerWithLock(int work_count, int work_size) {
  thread_local mt19937 gen(random_device{}());
  thread_local normal_distribution<float> nd(0, 10);

  for (int i = 0; i < work_count; i++) {
    int work = 0;
    for (int j = 0; j < work_size; j++)
      work += static_cast<int>(nd(gen));

    mtx.lock();
    critical_data += work;
    mtx.unlock();
  }
}

In this manner, I placed the RNG in thread_local, and used two parameters, which is work_count and work_size. The conditions were the following four.

mutex self lock & unlock
manage mutex with lock_guard
atomic add with default memory_order_seq_cst
atomic add with memory_order_relaxed

// test code
void test(int work_count, int work_size) {
  cout << "work count : " << work_count << \
        " / work size : " << work_size << endl;

  Stopwatch t1;
  thread thread1(workerWithLock, work_count, work_size);
  thread thread2(workerWithLock, work_count, work_size);
  thread1.join();
  thread2.join();
  cout << "    with mutex : " << t1.toc() << "ms" << endl;
  ...

For each test, I created two contesting workers. work_size represents the amount of work that can be done without contention at one time, so the smaller the work_size, the more frequently contention occurs. To keep the total time relatively uniform, work_count was added. The compilation was done with the O3 flag.

Result

work_size is big.

work count : 100 / work size : 100000
    with mutex : 105.623ms
    with lockguard : 105.629ms
    with atomic : 105.555ms
    with atomic Relaxed : 106.172ms

work_size is small

work count : 2000000 / work size : 5
    with mutex : 262.214ms
    with lockguard : 260.848ms
    with atomic : 120.265ms
    with atomic Relaxed : 120.621ms

Whole results

Lock free atomic is supported
More work count - more race condition
==============================================
work count : 100 / work size : 100000
    with mutex : 105.623ms
    with lockguard : 105.629ms
    with atomic : 105.555ms
    with atomic Relaxed : 106.172ms
==============================================
work count : 1000 / work size : 10000
    with mutex : 106.651ms
    with lockguard : 105.044ms
    with atomic : 108.993ms
    with atomic Relaxed : 105.876ms
==============================================
work count : 10000 / work size : 1000
    with mutex : 105.706ms
    with lockguard : 106.616ms
    with atomic : 106.648ms
    with atomic Relaxed : 106.754ms
==============================================
work count : 100000 / work size : 100
    with mutex : 107.912ms
    with lockguard : 109.505ms
    with atomic : 107.884ms
    with atomic Relaxed : 106.551ms
==============================================
work count : 1000000 / work size : 10
    with mutex : 166.493ms
    with lockguard : 165.845ms
    with atomic : 117.055ms
    with atomic Relaxed : 117.342ms
==============================================
work count : 2000000 / work size : 5
    with mutex : 262.214ms
    with lockguard : 260.848ms
    with atomic : 120.265ms
    with atomic Relaxed : 120.621ms
==============================================
work count : 10000000 / work size : 1
    with mutex : 901.735ms
    with lockguard : 905.303ms
    with atomic : 197.539ms
    with atomic Relaxed : 209.97ms

Conclusions

Naturally, when the race condition is high, atomics become increasingly faster. However, when each work takes about 1ms, the difference between mutex and atomic wasn’t that significant! Additionally, depending on the experiment, it’s worth noting that the overhead of the lock_guard constructor and destructor is almost negligible. If there’s a possibility of making mistakes, it’s better to use the RAII pattern rather than managing the mutex directly.

Personally, I was surprised that memory_order_relaxed didn’t provide any advantages, and in the more extreme cases below, this tendency was clear.

work count : 10000000 / work size : 1
    with mutex : 901.735ms
    with lockguard : 905.303ms
    with atomic : 197.539ms
    with atomic Relaxed : 209.97ms

Under extreme contention conditions, for some reason, memory_order_relaxed became slower…? Overall, it became slower as the race condition intensified, which needs further investigation. However, for now, it can be reasonably concluded that there’s no need to use relaxed memory order for optimization even under such contention conditions.

Even when contention occurs at around 0.01ms, the mutex was surprisingly fast. The critical section isn’t as expensive as expected. When dealing with sensor data in research areas at around 1000Hz, there doesn’t seem to be much to worry about. While there is ample room for differences depending on the number of threads, at least in single producer-consumer scenarios, it doesn’t seem significant.

Don’t worry too much about critical sections.
RAII is always the right choice. Actively utilize it and use lock_guard.
Regarding the experimental results of memory_order_relaxed, I’m not entirely sure, but it seems acceptable to use the default value.

Entire test code

#include <atomic>
#include <chrono>
#include <iostream>
#include <mutex>
#include <random>
#include <thread>

using namespace std;
using namespace std::chrono_literals;

mutex mtx;
atomic<int> atomic_data{0};
int critical_data = 0.0f;

class Stopwatch {
 public:
  Stopwatch() { tic(); }

  void tic() { start = chrono::high_resolution_clock::now(); }

  // returns time elapsed, in ms as default.
  template <typename Duration = chrono::milliseconds>
  double toc() const {
    auto end = chrono::high_resolution_clock::now();
    chrono::duration<double, typename Duration::period> elapsed = end - start;
    return elapsed.count();
  }

 private:
  chrono::high_resolution_clock::time_point start;
};

void workerWithLock(int work_count, int work_size) {
  thread_local mt19937 gen(random_device{}());
  thread_local normal_distribution<float> nd(0, 10);

  for (int i = 0; i < work_count; i++) {
    int work = 0;
    for (int j = 0; j < work_size; j++) work += static_cast<int>(nd(gen));

    mtx.lock();
    critical_data += work;
    mtx.unlock();
  }
}

void workerWithLockGuard(int work_count, int work_size) {
  thread_local mt19937 gen(random_device{}());
  thread_local normal_distribution<float> nd(0, 10);

  for (int i = 0; i < work_count; i++) {
    int work = 0;
    for (int j = 0; j < work_size; j++) work += static_cast<int>(nd(gen));

    lock_guard<mutex> lock(mtx);
    critical_data += work;
  }
}

void workerWithAtomic(int work_count, int work_size) {
  thread_local mt19937 gen(random_device{}());
  thread_local normal_distribution<float> nd(0, 10);

  for (int i = 0; i < work_count; i++) {
    int work = 0;
    for (int j = 0; j < work_size; j++) work += static_cast<int>(nd(gen));

    atomic_data.fetch_add(work);
  }
}

void workerWithAtomicRelaxed(int work_count, int work_size) {
  thread_local mt19937 gen(random_device{}());
  thread_local normal_distribution<float> nd(0, 10);

  for (int i = 0; i < work_count; i++) {
    int work = 0;
    for (int j = 0; j < work_size; j++) work += static_cast<int>(nd(gen));

    atomic_data.fetch_add(work, memory_order_relaxed);
  }
}

// test code
void test(int work_count, int work_size) {
  cout << "work count : " << work_count << " / work size : " << work_size << endl;

  Stopwatch t1;
  thread thread1(workerWithLock, work_count, work_size);
  thread thread2(workerWithLock, work_count, work_size);
  thread1.join();
  thread2.join();
  cout << "    with mutex : " << t1.toc() << "ms" << endl;

  Stopwatch t2;
  thread thread3(workerWithLockGuard, work_count, work_size);
  thread thread4(workerWithLockGuard, work_count, work_size);
  thread3.join();
  thread4.join();
  cout << "    with lockguard : " << t2.toc() << "ms" << endl;

  Stopwatch t3;
  thread thread5(workerWithAtomic, work_count, work_size);
  thread thread6(workerWithAtomic, work_count, work_size);
  thread5.join();
  thread6.join();
  cout << "    with atomic : " << t3.toc() << "ms" << endl;

  Stopwatch t4;
  thread thread7(workerWithAtomicRelaxed, work_count, work_size);
  thread thread8(workerWithAtomicRelaxed, work_count, work_size);
  thread7.join();
  thread8.join();
  cout << "    with atomic Relaxed : " << t4.toc() << "ms" << endl;
}

int main() {
  if (atomic_data.is_lock_free()) cout << "Lock free atomic is supported" << endl;

  cout << "More work count - more race condition" << endl;

  cout << "==============================================" << endl;
  test(100, 100000);

  cout << "==============================================" << endl;
  test(1000, 10000);

  cout << "==============================================" << endl;
  test(10000, 1000);

  cout << "==============================================" << endl;
  test(100000, 100);

  cout << "==============================================" << endl;
  test(1000000, 10);

  cout << "==============================================" << endl;
  test(2000000, 5);

  cout << "==============================================" << endl;
  test(10000000, 1);

  return 0;
}