Every concurrent abstraction has a cost. The path from "thread-safe" to "fast" requires understanding where those costs hide — and they hide in places you wouldn't expect.
This article climbs the performance ladder of shared state in Rust: from hidden costs that silently kill performance, through the tradeoffs of atomics versus mutexes, up to the realization that the fastest synchronization is often no synchronization at all. Each rung reveals a new optimization — and the benchmarks show exactly how much it matters.
All benchmarks were run on an Intel Core i9-14900K (24 cores, 32 threads) with Rust 1.93.0.
False Sharing: The Hidden Contention
You're building a system with per-thread statistics. Each thread maintains its own counter — no shared state, no locks, no contention. You add more threads expecting linear scaling. Instead, performance collapses:
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
struct PerThreadCounters {
counters: [AtomicU64; 8],
}
fn main() {
let counters = Arc::new(PerThreadCounters {
counters: std::array::from_fn(|_| AtomicU64::new(0)),
});
let handles: Vec<_> = (0..8)
.map(|i| {
let c = Arc::clone(&counters);
thread::spawn(move || {
// Each thread increments ONLY its own counter
for _ in 0..10_000_000 {
c.counters[i].fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for h in handles {
h.join().unwrap();
}
}
Eight threads, eight separate counters, zero data sharing. Yet this runs over 10x slower than it should. The threads aren't contending on data — they're contending on cache lines.
Modern CPUs don't access memory byte by byte. Each core has its own L1/L2 cache, and when one core writes to a memory location, other cores must invalidate their cached copy. This is the MESI protocol (Modified, Exclusive, Shared, Invalid).
Cache lines are typically 64 bytes. When you access any byte in a
cache line, the CPU loads all 64 bytes. Our eight AtomicU64
counters total exactly 64 bytes — they all fit in a single cache line.
When thread 0 writes to counters[0], it invalidates the
cache line for all other cores, even though they're accessing different
indices.
This is called false sharing — the threads aren't actually sharing data, but the hardware doesn't know that.
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;
// Counters packed together — will false share
struct PackedCounters {
counters: [AtomicU64; 8], // 64 bytes total, fits in one cache line
}
// Counters padded to separate cache lines
#[repr(align(64))]
struct PaddedCounter {
value: AtomicU64,
}
struct PaddedCounters {
counters: [PaddedCounter; 8], // 512 bytes, each in its own cache line
}
fn main() {
let num_threads = 8;
let iterations = 10_000_000;
// Packed (false sharing)
let packed = Arc::new(PackedCounters {
counters: std::array::from_fn(|_| AtomicU64::new(0)),
});
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|i| {
let p = Arc::clone(&packed);
thread::spawn(move || {
for _ in 0..iterations {
p.counters[i].fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for h in handles {
h.join().unwrap();
}
let packed_time = start.elapsed();
// Padded (no false sharing)
let padded = Arc::new(PaddedCounters {
counters: std::array::from_fn(|_| PaddedCounter {
value: AtomicU64::new(0),
}),
});
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|i| {
let p = Arc::clone(&padded);
thread::spawn(move || {
for _ in 0..iterations {
p.counters[i].value.fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for h in handles {
h.join().unwrap();
}
let padded_time = start.elapsed();
println!("Packed (false sharing): {:?}", packed_time);
println!("Padded (isolated): {:?}", padded_time);
println!("False sharing penalty: {:.1}x",
packed_time.as_nanos() as f64 / padded_time.as_nanos() as f64);
}
Each thread is incrementing its own counter. They never read or write each other's data. Yet the packed version is over 10x slower because all eight counters fit in one cache line, causing constant invalidation traffic.
The fix is simple but not obvious: add padding to push each counter
into its own cache line. The #[repr(align(64))] attribute
tells the compiler to align each PaddedCounter to a
64-byte boundary.
True Contention: Sharing the Same Atomic
False sharing is when threads access different data that happens to share a cache line. True contention is when threads access the same atomic variable. Both cause cache line bouncing, but true contention is unavoidable if you need a single shared counter.
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;
fn main() {
let num_threads = num_cpus::get();
let iterations_per_thread = 1_000_000;
// Single-threaded baseline
let counter = AtomicU64::new(0);
let start = Instant::now();
for _ in 0..iterations_per_thread * num_threads {
counter.fetch_add(1, Ordering::Relaxed);
}
let single_threaded = start.elapsed();
// Multi-threaded (contended)
let counter = Arc::new(AtomicU64::new(0));
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let c = Arc::clone(&counter);
thread::spawn(move || {
for _ in 0..iterations_per_thread {
c.fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for h in handles {
h.join().unwrap();
}
let multi_threaded = start.elapsed();
println!("Threads: {}", num_threads);
println!("Single-threaded: {:?}", single_threaded);
println!("Multi-threaded: {:?}", multi_threaded);
}
With 32 threads hammering the same atomic, performance is 3x worse than single-threaded — despite using all those cores. The cache line bounces between cores faster than useful work gets done. This is why high-performance systems avoid shared counters entirely.
The CAS Retry Storm
Beyond simple counters, many lock-free algorithms use Compare-And-Swap (CAS) operations. CAS atomically compares a value to an expected value and, only if they match, replaces it with a new value. If they don't match, the operation fails and you must retry.
// Typical CAS loop pattern
loop {
let current = value.load(Ordering::Relaxed);
let new_value = compute_new(current);
match value.compare_exchange_weak(
current,
new_value,
Ordering::Relaxed,
Ordering::Relaxed,
) {
Ok(_) => break, // Success
Err(_) => continue, // Someone else modified it, retry
}
}
Under high contention, CAS operations can fail repeatedly. Each failure
means wasted computation — you calculated new_value for
nothing. Let's measure how bad this gets:
use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;
fn main() {
let num_threads = num_cpus::get();
let iterations_per_thread = 100_000;
let counter = Arc::new(AtomicU64::new(0));
let total_retries = Arc::new(AtomicUsize::new(0));
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let c = Arc::clone(&counter);
let r = Arc::clone(&total_retries);
thread::spawn(move || {
let mut local_retries = 0usize;
for _ in 0..iterations_per_thread {
loop {
let current = c.load(Ordering::Relaxed);
match c.compare_exchange_weak(
current,
current + 1,
Ordering::Relaxed,
Ordering::Relaxed,
) {
Ok(_) => break,
Err(_) => local_retries += 1,
}
}
}
r.fetch_add(local_retries, Ordering::Relaxed);
})
})
.collect();
for h in handles {
h.join().unwrap();
}
let elapsed = start.elapsed();
let total_ops = num_threads * iterations_per_thread;
let retries = total_retries.load(Ordering::Relaxed);
println!("Threads: {}", num_threads);
println!("Total operations: {}", total_ops);
println!("Total CAS retries: {}", retries);
println!("Retries per success: {:.2}", retries as f64 / total_ops as f64);
println!("Wasted work: {:.1}% of CAS attempts failed",
(retries as f64 / (total_ops + retries) as f64) * 100.0);
}
With 32 threads, nearly 40% of CAS attempts fail. That's significant
wasted work — loading the value, computing the result, then throwing
it away. For simple increments, fetch_add avoids this
entirely. But complex lock-free data structures often have no choice
but to use CAS loops.
Atomics vs Mutex: Speed vs Correctness
Intuition suggests mutex is sometimes faster than atomics for complex operations. Let's test that. Consider updating a statistics struct with min, max, sum, and count:
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Instant;
// Atomic version — multiple CAS loops
struct AtomicStats {
min: AtomicU64,
max: AtomicU64,
sum: AtomicU64,
count: AtomicU64,
}
impl AtomicStats {
fn new() -> Self {
Self {
min: AtomicU64::new(u64::MAX),
max: AtomicU64::new(0),
sum: AtomicU64::new(0),
count: AtomicU64::new(0),
}
}
fn record(&self, value: u64) {
// Update min
loop {
let current = self.min.load(Ordering::Relaxed);
if value >= current { break; }
if self.min.compare_exchange_weak(
current, value, Ordering::Relaxed, Ordering::Relaxed
).is_ok() { break; }
}
// Update max
loop {
let current = self.max.load(Ordering::Relaxed);
if value <= current { break; }
if self.max.compare_exchange_weak(
current, value, Ordering::Relaxed, Ordering::Relaxed
).is_ok() { break; }
}
self.sum.fetch_add(value, Ordering::Relaxed);
self.count.fetch_add(1, Ordering::Relaxed);
}
}
// Mutex version — simple and correct
struct MutexStats {
inner: Mutex<(u64, u64, u64, u64)>, // min, max, sum, count
}
impl MutexStats {
fn new() -> Self {
Self { inner: Mutex::new((u64::MAX, 0, 0, 0)) }
}
fn record(&self, value: u64) {
let mut guard = self.inner.lock().unwrap();
guard.0 = guard.0.min(value);
guard.1 = guard.1.max(value);
guard.2 += value;
guard.3 += 1;
}
}
Atomics won decisively. But why? The min/max CAS loops rarely retry
because most values don't update the extremes. And fetch_add
for sum and count is a single hardware instruction — no retry possible.
"But surely with high collision rates, mutex wins?" Let's test pure CAS increment where every operation collides:
// CAS loop version — retry on collision
let cas_counter = Arc::new(AtomicU64::new(0));
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let c = Arc::clone(&cas_counter);
thread::spawn(move || {
for _ in 0..iterations_per_thread {
loop {
let current = c.load(Ordering::Relaxed);
if c.compare_exchange_weak(
current,
current + 1,
Ordering::Relaxed,
Ordering::Relaxed,
).is_ok() {
break;
}
}
}
})
})
.collect();
// Mutex version — acquire, increment, release
let mutex_counter = Arc::new(Mutex::new(0u64));
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let c = Arc::clone(&mutex_counter);
thread::spawn(move || {
for _ in 0..iterations_per_thread {
let mut guard = c.lock().unwrap();
*guard += 1;
}
})
})
.collect();
CAS still wins 2-4x, even with 100% collision rate. On modern CPUs with efficient cache coherency protocols, atomic operations are remarkably fast. The "mutex beats atomics" intuition comes from older hardware or higher-level languages with heavier lock implementations.
So when should you use mutex? Correctness and simplicity. The atomic version updates min, max, sum, and count separately. Between updates, another thread might read inconsistent state — count not matching sum, or min temporarily exceeding max. Mutex guarantees all fields update atomically together. Sometimes correct and clear beats fast and subtle.
But if you do choose mutex, which implementation should you use?
Choosing a Mutex: std vs parking_lot
A common recommendation is to use parking_lot::Mutex instead
of std::sync::Mutex for better performance. Let's test that
with a simple increment loop:
// std::sync::Mutex
let mutex = Arc::new(std::sync::Mutex::new(0u64));
for _ in 0..1_000_000 {
let mut guard = mutex.lock().unwrap();
*guard += 1;
}
// parking_lot::Mutex
let mutex = Arc::new(parking_lot::Mutex::new(0u64));
for _ in 0..1_000_000 {
let mut guard = mutex.lock();
*guard += 1;
}
parking_lot wins at low thread counts but collapses at high contention — 35x slower at 32 threads. What's happening?
The difference is fairness overhead. Even though critical sections here are nanoseconds (well under the 0.5ms threshold), parking_lot's fairness timer still fires approximately every 0.5ms. With 32 threads hammering the lock, that means thousands of forced fair unlocks per second, each causing a context switch. These context switches accumulate and destroy throughput.
std::sync::Mutex (using futex on Linux) is always unfair:
whichever thread happens to be running can "steal" the lock before sleeping
threads wake up. No forced context switches, no fairness overhead. This is
faster under contention but can starve threads (as we'll see below).
The critical section length also matters:
// Simulate work inside critical section
fn do_work(iterations: u32) -> u64 {
let mut sum = 0u64;
for i in 0..iterations {
sum = sum.wrapping_add(i as u64);
}
sum
}
// Lock, do work, unlock
for _ in 0..10_000 {
let mut guard = mutex.lock();
*guard = guard.wrapping_add(do_work(work_iterations));
}
With zero work inside the lock (just acquire/release), parking_lot's
optimized fast path wins. But as soon as there's real work,
std::sync::Mutex pulls ahead. The fair locking overhead
outweighs any fast-path advantage.
Fairness under asymmetric load: What about scenarios where one thread aggressively holds the lock while others compete? parking_lot's fairness timer (0.5ms) should prevent monopolization — but only if critical sections are long enough to trigger it. Let's test with one "hog" thread holding the lock for 500µs vs five normal threads:
// Hog thread — holds lock for 500µs, immediately reacquires
thread::spawn(move || {
while running.load(Ordering::Relaxed) {
let mut guard = mutex.lock();
*guard += 1;
thread::sleep(Duration::from_micros(500)); // Hold for 500µs
drop(guard);
// No delay — immediately try to reacquire
}
});
// Normal threads — quick critical section, yield between acquisitions
thread::spawn(move || {
while running.load(Ordering::Relaxed) {
let mut guard = mutex.lock();
*guard += 1;
drop(guard);
thread::yield_now(); // Politely yield
}
});
The difference is stark. Here's why:
std::Mutex (barging): When the hog releases the lock, normal
threads begin waking up from kernel sleep. But waking up takes time — the
scheduler must context-switch to them. The hog, already running on CPU,
immediately calls lock() again. Since std allows "barging"
(any running thread can grab a free lock), the hog wins before sleeping
threads finish waking. Normal threads wake up, find the lock taken, and
go back to sleep. This cycle repeats indefinitely.
parking_lot (handoff): When the fairness timer fires (every ~0.5ms) or when critical sections exceed 1ms, parking_lot uses a handoff mechanism. Instead of simply releasing the lock, it keeps the lock bit set and directly transfers ownership to the next thread in queue. The hog cannot barge — the lock is never "free" for it to steal. This guarantees the next waiter gets the lock.
Important: With very short critical sections (nanoseconds), the fairness timer fires less often relative to the operation rate, so both mutexes behave more similarly. The fairness advantage manifests with longer hold times that approach or exceed the 0.5ms threshold.
Bursty workloads: What about workloads with bursts of activity followed by idle periods? This tests how mutexes handle transitions between contended and uncontended states:
for _burst in 0..20 {
// Spawn 8 threads, each doing 50K lock/unlock cycles
let handles: Vec<_> = (0..8)
.map(|_| {
let m = Arc::clone(&mutex);
thread::spawn(move || {
for _ in 0..50_000 {
let mut guard = m.lock();
*guard += 1;
}
})
})
.collect();
for h in handles { h.join().unwrap(); }
// Idle period — mutex becomes uncontended
thread::sleep(Duration::from_millis(10));
}
Here parking_lot wins. Why? At the start of each burst, contention builds gradually. parking_lot's adaptive spinning lets early threads acquire the lock without syscalls — they spin briefly and often succeed before needing to park. std::Mutex immediately calls into the kernel (futex_wait) when the lock is taken, paying syscall overhead even for short waits. Over 20 bursts, these saved syscalls add up.
When to use each:
- std::sync::Mutex — Short critical sections with high throughput requirements, when starvation is acceptable or impossible (e.g., all threads behave identically)
- parking_lot::Mutex — Longer critical sections, when fairness matters, bursty workloads, or when its smaller size (1 byte vs ~8 bytes) matters for fine-grained locking
Note: parking_lot's fairness timer (0.5ms) only helps with critical sections approaching that duration. For nanosecond-level operations, both mutexes behave similarly. Always benchmark your actual workload.
We've optimized our mutex choice. But we're still funneling all threads through a single lock. What if we could split the lock itself?
Sharding: Split the Bottleneck
A single lock protecting a large data structure is a serialization point. Sharding splits the data into independent partitions, each with its own lock. Threads accessing different partitions never contend.
DashMap applies this to
concurrent HashMaps. Instead of one RwLock<HashMap>,
it maintains multiple shards (default: CPU count × 4). Each key hashes
to a shard, and only that shard's lock is acquired.
// RwLock<HashMap> - single lock for entire map
let map = Arc::new(RwLock::new(HashMap::new()));
// Read (80% of operations)
let _ = map.read().unwrap().get(&key);
// Write (20% of operations)
map.write().unwrap().insert(key, value);
// DashMap - internally sharded
let map = Arc::new(DashMap::new());
// Read - only locks one shard
let _ = map.get(&key);
// Write - only locks one shard
map.insert(key, value);
At 32 threads, DashMap is 44x faster. Why?
With RwLock<HashMap>, every operation — read or write —
must acquire the single lock. Worse, RwLock typically prioritizes
writers to prevent writer starvation: when a writer is waiting, new readers
must wait too. This turns an 80% read workload into a serialized queue.
DashMap with 128 shards (32 CPUs × 4) means two threads only contend if they hash to the same shard — roughly a 1-in-128 chance. With 32 threads spread across 128 shards, most lock acquisitions are uncontended.
The 1-thread case shows the tradeoff: sharding adds overhead (hashing, shard lookup) that only pays off under contention. At 1 thread, there's no contention to avoid, so RwLock's simpler path wins.
Sharding is powerful, but we're still locking. Can we avoid locks entirely?
Lock-Free: No Locks At All
Lock-free data structures use atomic operations (CAS loops) instead of mutexes. No thread ever blocks waiting for a lock — they retry until they succeed. This eliminates context switches and priority inversion.
crossbeam provides lock-free
channels for message passing. std::sync::mpsc is single-consumer
only; crossbeam's channels natively support multiple producers and
multiple consumers, built on lock-free queues.
// std::sync::mpsc - single consumer only
let (tx, rx) = std::sync::mpsc::channel();
tx.send(value).unwrap(); // producer
let v = rx.recv().unwrap(); // consumer (only one allowed)
// crossbeam::channel - multiple producers AND consumers
let (tx, rx) = crossbeam::channel::unbounded();
tx.send(value).unwrap(); // producer (can clone tx)
let v = rx.recv().unwrap(); // consumer (can clone rx too!)
The MPMC case is 11.5x faster. Here's why:
Mutex<VecDeque> serializes all producers and consumers
through one lock. When a producer holds the lock to push, all consumers
block. When a consumer holds the lock to pop, all producers block.
crossbeam's lock-free queue lets producers and consumers operate simultaneously. Multiple producers can CAS their items onto the queue concurrently. Multiple consumers can CAS items off concurrently. The only "contention" is CAS retries, which are far cheaper than context switches.
Note that std::sync::mpsc doesn't even support multiple
consumers — the "sc" means "single consumer." crossbeam handles MPMC
natively.
Lock-free is fast, but it still requires coordination. What if threads didn't need to coordinate at all?
The Real Answer: Stop Sharing
The fastest synchronization is no synchronization. If each thread maintains its own local counter and only occasionally merges into a global total, you eliminate contention entirely.
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;
fn main() {
let num_threads = num_cpus::get();
let iterations = 10_000_000;
// Shared atomic counter
let shared = Arc::new(AtomicU64::new(0));
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let c = Arc::clone(&shared);
thread::spawn(move || {
for _ in 0..iterations {
c.fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for h in handles { h.join().unwrap(); }
let shared_time = start.elapsed();
// Thread-local counters, merge at end
let final_count = Arc::new(AtomicU64::new(0));
let start = Instant::now();
let handles: Vec<_> = (0..num_threads)
.map(|_| {
let fc = Arc::clone(&final_count);
thread::spawn(move || {
let mut local: u64 = 0;
for _ in 0..iterations {
local += 1; // No synchronization!
}
fc.fetch_add(local, Ordering::Relaxed); // One atomic at the end
})
})
.collect();
for h in handles { h.join().unwrap(); }
let local_time = start.elapsed();
println!("Threads: {}", num_threads);
println!("Shared atomic: {:?}", shared_time);
println!("Thread-local: {:?}", local_time);
println!("Speedup: {:.0}x",
shared_time.as_nanos() as f64 / local_time.as_nanos() as f64);
}
The thread-local version does one atomic operation at the end instead of millions during execution. The speedup is dramatic.
Thread-Local in Practice
Rust provides thread_local! for declaring thread-local
storage. Here's a production-ready metrics collector:
use parking_lot::Mutex;
use std::cell::RefCell;
use std::collections::HashMap;
use std::sync::Arc;
thread_local! {
static LOCAL: RefCell<HashMap<&'static str, u64>> =
RefCell::new(HashMap::new());
}
#[derive(Clone)]
pub struct Metrics {
global: Arc<Mutex<HashMap<&'static str, u64>>>,
}
impl Metrics {
pub fn new() -> Self {
Self {
global: Arc::new(Mutex::new(HashMap::new())),
}
}
#[inline]
pub fn increment(&self, key: &'static str) {
LOCAL.with(|local| {
*local.borrow_mut().entry(key).or_insert(0) += 1;
});
}
pub fn flush(&self) {
LOCAL.with(|local| {
let mut local = local.borrow_mut();
if local.is_empty() { return; }
let mut global = self.global.lock();
for (k, v) in local.drain() {
*global.entry(k).or_insert(0) += v;
}
});
}
pub fn snapshot(&self) -> HashMap<&'static str, u64> {
self.global.lock().clone()
}
}
The increment call touches only thread-local storage — no
synchronization at all. The flush call merges local
counters into the global map, which should be called periodically
(e.g., every second, or when the thread is about to exit).
This pattern is used by the metrics crate and many
high-performance systems. The tradeoff is that reads aren't instant —
you need to aggregate from all threads to get accurate values.
Decision Framework
When choosing a synchronization strategy:
- How often is it written vs read? Write-heavy, read-rarely → thread-local accumulation. Balanced → continue below.
-
Do multiple fields need consistent updates? If readers
must see all-or-nothing changes →
Mutex<T>for correctness. If eventual consistency is fine → individual atomics are faster. - How many threads contend? Few threads (2-4) → atomics are fine. Many threads → consider sharding or thread-local.
- Are atomics in the same cache line? Adjacent atomics in an array or struct → pad to cache line size (typically 64 bytes).
Key Takeaways
- False sharing is sneaky. Pad to cache line size (64 bytes on x86_64, may vary on other architectures).
- Atomics aren't free. Cache coherency has real costs under contention.
- CAS loops waste work but still win. Even with high collision rates, they beat mutex on modern CPUs.
- Use mutex for correctness. When multiple fields must update atomically, mutex is simpler and safer.
- Fairness requires time. parking_lot's 0.5ms timer prevents starvation with longer holds; with nanosecond operations, both mutexes behave similarly.
- Sharding beats single locks. DashMap's 44x speedup comes from splitting one lock into many.
- Lock-free beats locks. crossbeam's 11.5x MPMC speedup comes from eliminating context switches.
- The best sync is no sync. Thread-local accumulation delivers 1886x when you can defer sharing.
Further Reading
- Rust Atomics and Locks — Mara Bos's book on low-level concurrency
- Weak vs. Strong Memory Models — Jeff Preshing on memory ordering
- Inside Rust's std and parking_lot mutexes — detailed fairness and starvation analysis
- crossbeam — lock-free data structures and channels
- dashmap — sharded concurrent HashMap