The Shared State Performance Ladder in Rust

Every concurrent abstraction has a cost. The path from "thread-safe" to "fast" requires understanding where those costs hide — and they hide in places you wouldn't expect.

This article climbs the performance ladder of shared state in Rust: from hidden costs that silently kill performance, through the tradeoffs of atomics versus mutexes, up to the realization that the fastest synchronization is often no synchronization at all. Each rung reveals a new optimization — and the benchmarks show exactly how much it matters.

All benchmarks were run on an Intel Core i9-14900K (24 cores, 32 threads) with Rust 1.93.0.

False Sharing: The Hidden Contention

You're building a system with per-thread statistics. Each thread maintains its own counter — no shared state, no locks, no contention. You add more threads expecting linear scaling. Instead, performance collapses:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;

struct PerThreadCounters {
    counters: [AtomicU64; 8],
}

fn main() {
    let counters = Arc::new(PerThreadCounters {
        counters: std::array::from_fn(|_| AtomicU64::new(0)),
    });

    let handles: Vec<_> = (0..8)
        .map(|i| {
            let c = Arc::clone(&counters);
            thread::spawn(move || {
                // Each thread increments ONLY its own counter
                for _ in 0..10_000_000 {
                    c.counters[i].fetch_add(1, Ordering::Relaxed);
                }
            })
        })
        .collect();

    for h in handles {
        h.join().unwrap();
    }
}

Eight threads, eight separate counters, zero data sharing. Yet this runs over 10x slower than it should. The threads aren't contending on data — they're contending on cache lines.

Modern CPUs don't access memory byte by byte. Each core has its own L1/L2 cache, and when one core writes to a memory location, other cores must invalidate their cached copy. This is the MESI protocol (Modified, Exclusive, Shared, Invalid).

Cache lines are typically 64 bytes. When you access any byte in a cache line, the CPU loads all 64 bytes. Our eight AtomicU64 counters total exactly 64 bytes — they all fit in a single cache line. When thread 0 writes to counters[0], it invalidates the cache line for all other cores, even though they're accessing different indices.

This is called false sharing — the threads aren't actually sharing data, but the hardware doesn't know that.

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;

// Counters packed together — will false share
struct PackedCounters {
    counters: [AtomicU64; 8],  // 64 bytes total, fits in one cache line
}

// Counters padded to separate cache lines
#[repr(align(64))]
struct PaddedCounter {
    value: AtomicU64,
}

struct PaddedCounters {
    counters: [PaddedCounter; 8],  // 512 bytes, each in its own cache line
}

fn main() {
    let num_threads = 8;
    let iterations = 10_000_000;

    // Packed (false sharing)
    let packed = Arc::new(PackedCounters {
        counters: std::array::from_fn(|_| AtomicU64::new(0)),
    });
    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|i| {
            let p = Arc::clone(&packed);
            thread::spawn(move || {
                for _ in 0..iterations {
                    p.counters[i].fetch_add(1, Ordering::Relaxed);
                }
            })
        })
        .collect();
    for h in handles {
        h.join().unwrap();
    }
    let packed_time = start.elapsed();

    // Padded (no false sharing)
    let padded = Arc::new(PaddedCounters {
        counters: std::array::from_fn(|_| PaddedCounter {
            value: AtomicU64::new(0),
        }),
    });
    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|i| {
            let p = Arc::clone(&padded);
            thread::spawn(move || {
                for _ in 0..iterations {
                    p.counters[i].value.fetch_add(1, Ordering::Relaxed);
                }
            })
        })
        .collect();
    for h in handles {
        h.join().unwrap();
    }
    let padded_time = start.elapsed();

    println!("Packed (false sharing): {:?}", packed_time);
    println!("Padded (isolated):      {:?}", padded_time);
    println!("False sharing penalty: {:.1}x",
        packed_time.as_nanos() as f64 / padded_time.as_nanos() as f64);
}

Packed (false sharing): 465.21ms Padded (isolated): 39.85ms False sharing penalty: 11.7x

Each thread is incrementing its own counter. They never read or write each other's data. Yet the packed version is over 10x slower because all eight counters fit in one cache line, causing constant invalidation traffic.

The fix is simple but not obvious: add padding to push each counter into its own cache line. The #[repr(align(64))] attribute tells the compiler to align each PaddedCounter to a 64-byte boundary.

True Contention: Sharing the Same Atomic

False sharing is when threads access different data that happens to share a cache line. True contention is when threads access the same atomic variable. Both cause cache line bouncing, but true contention is unavoidable if you need a single shared counter.

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;

fn main() {
    let num_threads = num_cpus::get();
    let iterations_per_thread = 1_000_000;

    // Single-threaded baseline
    let counter = AtomicU64::new(0);
    let start = Instant::now();
    for _ in 0..iterations_per_thread * num_threads {
        counter.fetch_add(1, Ordering::Relaxed);
    }
    let single_threaded = start.elapsed();

    // Multi-threaded (contended)
    let counter = Arc::new(AtomicU64::new(0));
    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|_| {
            let c = Arc::clone(&counter);
            thread::spawn(move || {
                for _ in 0..iterations_per_thread {
                    c.fetch_add(1, Ordering::Relaxed);
                }
            })
        })
        .collect();
    for h in handles {
        h.join().unwrap();
    }
    let multi_threaded = start.elapsed();

    println!("Threads: {}", num_threads);
    println!("Single-threaded: {:?}", single_threaded);
    println!("Multi-threaded:  {:?}", multi_threaded);
}

Threads: 32 Single-threaded: 107.21ms Multi-threaded: 330.42ms Slowdown: 3.1x

With 32 threads hammering the same atomic, performance is 3x worse than single-threaded — despite using all those cores. The cache line bounces between cores faster than useful work gets done. This is why high-performance systems avoid shared counters entirely.

The CAS Retry Storm

Beyond simple counters, many lock-free algorithms use Compare-And-Swap (CAS) operations. CAS atomically compares a value to an expected value and, only if they match, replaces it with a new value. If they don't match, the operation fails and you must retry.

// Typical CAS loop pattern
loop {
    let current = value.load(Ordering::Relaxed);
    let new_value = compute_new(current);
    match value.compare_exchange_weak(
        current,
        new_value,
        Ordering::Relaxed,
        Ordering::Relaxed,
    ) {
        Ok(_) => break,      // Success
        Err(_) => continue,  // Someone else modified it, retry
    }
}

Under high contention, CAS operations can fail repeatedly. Each failure means wasted computation — you calculated new_value for nothing. Let's measure how bad this gets:

use std::sync::atomic::{AtomicU64, AtomicUsize, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;

fn main() {
    let num_threads = num_cpus::get();
    let iterations_per_thread = 100_000;

    let counter = Arc::new(AtomicU64::new(0));
    let total_retries = Arc::new(AtomicUsize::new(0));

    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|_| {
            let c = Arc::clone(&counter);
            let r = Arc::clone(&total_retries);
            thread::spawn(move || {
                let mut local_retries = 0usize;
                for _ in 0..iterations_per_thread {
                    loop {
                        let current = c.load(Ordering::Relaxed);
                        match c.compare_exchange_weak(
                            current,
                            current + 1,
                            Ordering::Relaxed,
                            Ordering::Relaxed,
                        ) {
                            Ok(_) => break,
                            Err(_) => local_retries += 1,
                        }
                    }
                }
                r.fetch_add(local_retries, Ordering::Relaxed);
            })
        })
        .collect();

    for h in handles {
        h.join().unwrap();
    }
    let elapsed = start.elapsed();

    let total_ops = num_threads * iterations_per_thread;
    let retries = total_retries.load(Ordering::Relaxed);

    println!("Threads: {}", num_threads);
    println!("Total operations: {}", total_ops);
    println!("Total CAS retries: {}", retries);
    println!("Retries per success: {:.2}", retries as f64 / total_ops as f64);
    println!("Wasted work: {:.1}% of CAS attempts failed",
        (retries as f64 / (total_ops + retries) as f64) * 100.0);
}

Threads: 32 Total operations: 3,200,000 Total CAS retries: 2,126,317 Retries per success: 0.66 Wasted work: 39.9% of CAS attempts failed

With 32 threads, nearly 40% of CAS attempts fail. That's significant wasted work — loading the value, computing the result, then throwing it away. For simple increments, fetch_add avoids this entirely. But complex lock-free data structures often have no choice but to use CAS loops.

Atomics vs Mutex: Speed vs Correctness

Intuition suggests mutex is sometimes faster than atomics for complex operations. Let's test that. Consider updating a statistics struct with min, max, sum, and count:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::Instant;

// Atomic version — multiple CAS loops
struct AtomicStats {
    min: AtomicU64,
    max: AtomicU64,
    sum: AtomicU64,
    count: AtomicU64,
}

impl AtomicStats {
    fn new() -> Self {
        Self {
            min: AtomicU64::new(u64::MAX),
            max: AtomicU64::new(0),
            sum: AtomicU64::new(0),
            count: AtomicU64::new(0),
        }
    }

    fn record(&self, value: u64) {
        // Update min
        loop {
            let current = self.min.load(Ordering::Relaxed);
            if value >= current { break; }
            if self.min.compare_exchange_weak(
                current, value, Ordering::Relaxed, Ordering::Relaxed
            ).is_ok() { break; }
        }

        // Update max
        loop {
            let current = self.max.load(Ordering::Relaxed);
            if value <= current { break; }
            if self.max.compare_exchange_weak(
                current, value, Ordering::Relaxed, Ordering::Relaxed
            ).is_ok() { break; }
        }

        self.sum.fetch_add(value, Ordering::Relaxed);
        self.count.fetch_add(1, Ordering::Relaxed);
    }
}

// Mutex version — simple and correct
struct MutexStats {
    inner: Mutex<(u64, u64, u64, u64)>, // min, max, sum, count
}

impl MutexStats {
    fn new() -> Self {
        Self { inner: Mutex::new((u64::MAX, 0, 0, 0)) }
    }

    fn record(&self, value: u64) {
        let mut guard = self.inner.lock().unwrap();
        guard.0 = guard.0.min(value);
        guard.1 = guard.1.max(value);
        guard.2 += value;
        guard.3 += 1;
    }
}

Threads: 32 Atomic stats: 778.12ms Mutex stats: 3.08s Winner: Atomic (4.0x faster)

Atomics won decisively. But why? The min/max CAS loops rarely retry because most values don't update the extremes. And fetch_add for sum and count is a single hardware instruction — no retry possible.

"But surely with high collision rates, mutex wins?" Let's test pure CAS increment where every operation collides:

// CAS loop version — retry on collision
let cas_counter = Arc::new(AtomicU64::new(0));
let handles: Vec<_> = (0..num_threads)
    .map(|_| {
        let c = Arc::clone(&cas_counter);
        thread::spawn(move || {
            for _ in 0..iterations_per_thread {
                loop {
                    let current = c.load(Ordering::Relaxed);
                    if c.compare_exchange_weak(
                        current,
                        current + 1,
                        Ordering::Relaxed,
                        Ordering::Relaxed,
                    ).is_ok() {
                        break;
                    }
                }
            }
        })
    })
    .collect();

// Mutex version — acquire, increment, release
let mutex_counter = Arc::new(Mutex::new(0u64));
let handles: Vec<_> = (0..num_threads)
    .map(|_| {
        let c = Arc::clone(&mutex_counter);
        thread::spawn(move || {
            for _ in 0..iterations_per_thread {
                let mut guard = c.lock().unwrap();
                *guard += 1;
            }
        })
    })
    .collect();

Threads CAS Loop Mutex Winner ---------------------------------------------- 2 14.25ms 29.82ms CAS 2.1x 4 35.65ms 104.52ms CAS 2.9x 8 65.28ms 242.74ms CAS 3.7x 16 139.84ms 585.25ms CAS 4.2x 32 403.46ms 1.17s CAS 2.9x

CAS still wins 2-4x, even with 100% collision rate. On modern CPUs with efficient cache coherency protocols, atomic operations are remarkably fast. The "mutex beats atomics" intuition comes from older hardware or higher-level languages with heavier lock implementations.

So when should you use mutex? Correctness and simplicity. The atomic version updates min, max, sum, and count separately. Between updates, another thread might read inconsistent state — count not matching sum, or min temporarily exceeding max. Mutex guarantees all fields update atomically together. Sometimes correct and clear beats fast and subtle.

But if you do choose mutex, which implementation should you use?

Choosing a Mutex: std vs parking_lot

A common recommendation is to use parking_lot::Mutex instead of std::sync::Mutex for better performance. Let's test that with a simple increment loop:

// std::sync::Mutex
let mutex = Arc::new(std::sync::Mutex::new(0u64));
for _ in 0..1_000_000 {
    let mut guard = mutex.lock().unwrap();
    *guard += 1;
}

// parking_lot::Mutex
let mutex = Arc::new(parking_lot::Mutex::new(0u64));
for _ in 0..1_000_000 {
    let mut guard = mutex.lock();
    *guard += 1;
}

std::sync::Mutex vs parking_lot::Mutex (1M iterations per thread) Threads std::Mutex parking_lot Winner -------------------------------------------------- 1 16.34ms 13.84ms pl 1.2x 2 57.18ms 28.59ms pl 2.0x 4 205.63ms 109.29ms pl 1.9x 8 490.18ms 665.15ms std 1.4x 16 1.18s 6.22s std 5.3x 32 2.39s 85.44s std 35.8x

parking_lot wins at low thread counts but collapses at high contention — 35x slower at 32 threads. What's happening?

The difference is fairness overhead. Even though critical sections here are nanoseconds (well under the 0.5ms threshold), parking_lot's fairness timer still fires approximately every 0.5ms. With 32 threads hammering the lock, that means thousands of forced fair unlocks per second, each causing a context switch. These context switches accumulate and destroy throughput.

std::sync::Mutex (using futex on Linux) is always unfair: whichever thread happens to be running can "steal" the lock before sleeping threads wake up. No forced context switches, no fairness overhead. This is faster under contention but can starve threads (as we'll see below).

The critical section length also matters:

// Simulate work inside critical section
fn do_work(iterations: u32) -> u64 {
    let mut sum = 0u64;
    for i in 0..iterations {
        sum = sum.wrapping_add(i as u64);
    }
    sum
}

// Lock, do work, unlock
for _ in 0..10_000 {
    let mut guard = mutex.lock();
    *guard = guard.wrapping_add(do_work(work_iterations));
}

Varying critical section length (8 threads, 10K iterations each) Work inside lock std::Mutex parking_lot Winner ---------------------------------------------------------- 0 iterations 14.89ms 9.49ms pl 1.6x 100 iterations 17.88ms 19.06ms std 1.1x 1000 iterations 93.92ms 272.04ms std 2.9x 5000 iterations 548.80ms 1.81s std 3.3x

With zero work inside the lock (just acquire/release), parking_lot's optimized fast path wins. But as soon as there's real work, std::sync::Mutex pulls ahead. The fair locking overhead outweighs any fast-path advantage.

Fairness under asymmetric load: What about scenarios where one thread aggressively holds the lock while others compete? parking_lot's fairness timer (0.5ms) should prevent monopolization — but only if critical sections are long enough to trigger it. Let's test with one "hog" thread holding the lock for 500µs vs five normal threads:

// Hog thread — holds lock for 500µs, immediately reacquires
thread::spawn(move || {
    while running.load(Ordering::Relaxed) {
        let mut guard = mutex.lock();
        *guard += 1;
        thread::sleep(Duration::from_micros(500)); // Hold for 500µs
        drop(guard);
        // No delay — immediately try to reacquire
    }
});

// Normal threads — quick critical section, yield between acquisitions
thread::spawn(move || {
    while running.load(Ordering::Relaxed) {
        let mut guard = mutex.lock();
        *guard += 1;
        drop(guard);
        thread::yield_now(); // Politely yield
    }
});

Monopolization scenario: 1 hog (500µs hold) vs 5 normal threads (10 seconds) std::sync::Mutex: Hog thread: 17,880 ops Normal threads: 243 ops total Hog ratio: 368x — near-total starvation parking_lot::Mutex: Hog thread: 17,239 ops Normal threads: 248,459 ops total (~49,692 each) Hog ratio: 0.3x — fair distribution

The difference is stark. Here's why:

std::Mutex (barging): When the hog releases the lock, normal threads begin waking up from kernel sleep. But waking up takes time — the scheduler must context-switch to them. The hog, already running on CPU, immediately calls lock() again. Since std allows "barging" (any running thread can grab a free lock), the hog wins before sleeping threads finish waking. Normal threads wake up, find the lock taken, and go back to sleep. This cycle repeats indefinitely.

parking_lot (handoff): When the fairness timer fires (every ~0.5ms) or when critical sections exceed 1ms, parking_lot uses a handoff mechanism. Instead of simply releasing the lock, it keeps the lock bit set and directly transfers ownership to the next thread in queue. The hog cannot barge — the lock is never "free" for it to steal. This guarantees the next waiter gets the lock.

Important: With very short critical sections (nanoseconds), the fairness timer fires less often relative to the operation rate, so both mutexes behave more similarly. The fairness advantage manifests with longer hold times that approach or exceed the 0.5ms threshold.

Bursty workloads: What about workloads with bursts of activity followed by idle periods? This tests how mutexes handle transitions between contended and uncontended states:

for _burst in 0..20 {
    // Spawn 8 threads, each doing 50K lock/unlock cycles
    let handles: Vec<_> = (0..8)
        .map(|_| {
            let m = Arc::clone(&mutex);
            thread::spawn(move || {
                for _ in 0..50_000 {
                    let mut guard = m.lock();
                    *guard += 1;
                }
            })
        })
        .collect();
    for h in handles { h.join().unwrap(); }

    // Idle period — mutex becomes uncontended
    thread::sleep(Duration::from_millis(10));
}

Bursty workload: 20 bursts of 50K ops, 10ms idle between (8 threads) std::Mutex: 901.14ms (8.9M ops/sec) parking_lot: 819.21ms (9.8M ops/sec) Winner: parking_lot (1.1x faster)

Here parking_lot wins. Why? At the start of each burst, contention builds gradually. parking_lot's adaptive spinning lets early threads acquire the lock without syscalls — they spin briefly and often succeed before needing to park. std::Mutex immediately calls into the kernel (futex_wait) when the lock is taken, paying syscall overhead even for short waits. Over 20 bursts, these saved syscalls add up.

When to use each:

std::sync::Mutex — Short critical sections with high throughput requirements, when starvation is acceptable or impossible (e.g., all threads behave identically)
parking_lot::Mutex — Longer critical sections, when fairness matters, bursty workloads, or when its smaller size (1 byte vs ~8 bytes) matters for fine-grained locking

Note: parking_lot's fairness timer (0.5ms) only helps with critical sections approaching that duration. For nanosecond-level operations, both mutexes behave similarly. Always benchmark your actual workload.

We've optimized our mutex choice. But we're still funneling all threads through a single lock. What if we could split the lock itself?

Sharding: Split the Bottleneck

A single lock protecting a large data structure is a serialization point. Sharding splits the data into independent partitions, each with its own lock. Threads accessing different partitions never contend.

DashMap applies this to concurrent HashMaps. Instead of one RwLock<HashMap>, it maintains multiple shards (default: CPU count × 4). Each key hashes to a shard, and only that shard's lock is acquired.

// RwLock<HashMap> - single lock for entire map
let map = Arc::new(RwLock::new(HashMap::new()));
// Read (80% of operations)
let _ = map.read().unwrap().get(&key);
// Write (20% of operations)
map.write().unwrap().insert(key, value);

// DashMap - internally sharded
let map = Arc::new(DashMap::new());
// Read - only locks one shard
let _ = map.get(&key);
// Write - only locks one shard
map.insert(key, value);

DashMap vs RwLock<HashMap> (100K ops/thread, 80% reads, 20% writes) Threads RwLock<HashMap> DashMap Winner -------------------------------------------------- 1 2.95ms 3.56ms RwLock 1.2x 2 15.67ms 3.89ms DashMap 4.0x 4 25.56ms 5.58ms DashMap 4.6x 8 67.21ms 6.17ms DashMap 10.9x 16 186.91ms 6.65ms DashMap 28.1x 32 440.25ms 10.00ms DashMap 44.0x

At 32 threads, DashMap is 44x faster. Why?

With RwLock<HashMap>, every operation — read or write — must acquire the single lock. Worse, RwLock typically prioritizes writers to prevent writer starvation: when a writer is waiting, new readers must wait too. This turns an 80% read workload into a serialized queue.

DashMap with 128 shards (32 CPUs × 4) means two threads only contend if they hash to the same shard — roughly a 1-in-128 chance. With 32 threads spread across 128 shards, most lock acquisitions are uncontended.

The 1-thread case shows the tradeoff: sharding adds overhead (hashing, shard lookup) that only pays off under contention. At 1 thread, there's no contention to avoid, so RwLock's simpler path wins.

Sharding is powerful, but we're still locking. Can we avoid locks entirely?

Lock-Free: No Locks At All

Lock-free data structures use atomic operations (CAS loops) instead of mutexes. No thread ever blocks waiting for a lock — they retry until they succeed. This eliminates context switches and priority inversion.

crossbeam provides lock-free channels for message passing. std::sync::mpsc is single-consumer only; crossbeam's channels natively support multiple producers and multiple consumers, built on lock-free queues.

// std::sync::mpsc - single consumer only
let (tx, rx) = std::sync::mpsc::channel();
tx.send(value).unwrap();      // producer
let v = rx.recv().unwrap();   // consumer (only one allowed)

// crossbeam::channel - multiple producers AND consumers
let (tx, rx) = crossbeam::channel::unbounded();
tx.send(value).unwrap();      // producer (can clone tx)
let v = rx.recv().unwrap();   // consumer (can clone rx too!)

crossbeam::channel vs std::sync::mpsc (1M messages) Scenario std crossbeam Winner ----------------------------------------------------------------- SPSC (1 prod, 1 cons) 30.07ms 25.10ms crossbeam 1.2x MPSC (4 prod, 1 cons) 63.00ms 17.51ms crossbeam 3.6x MPMC (4 prod, 4 cons) 238.07ms* 20.75ms crossbeam 11.5x * std doesn't support MPMC; compared against Mutex<VecDeque>

The MPMC case is 11.5x faster. Here's why:

Mutex<VecDeque> serializes all producers and consumers through one lock. When a producer holds the lock to push, all consumers block. When a consumer holds the lock to pop, all producers block.

crossbeam's lock-free queue lets producers and consumers operate simultaneously. Multiple producers can CAS their items onto the queue concurrently. Multiple consumers can CAS items off concurrently. The only "contention" is CAS retries, which are far cheaper than context switches.

Note that std::sync::mpsc doesn't even support multiple consumers — the "sc" means "single consumer." crossbeam handles MPMC natively.

Lock-free is fast, but it still requires coordination. What if threads didn't need to coordinate at all?

The Real Answer: Stop Sharing

The fastest synchronization is no synchronization. If each thread maintains its own local counter and only occasionally merges into a global total, you eliminate contention entirely.

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
use std::time::Instant;

fn main() {
    let num_threads = num_cpus::get();
    let iterations = 10_000_000;

    // Shared atomic counter
    let shared = Arc::new(AtomicU64::new(0));
    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|_| {
            let c = Arc::clone(&shared);
            thread::spawn(move || {
                for _ in 0..iterations {
                    c.fetch_add(1, Ordering::Relaxed);
                }
            })
        })
        .collect();
    for h in handles { h.join().unwrap(); }
    let shared_time = start.elapsed();

    // Thread-local counters, merge at end
    let final_count = Arc::new(AtomicU64::new(0));
    let start = Instant::now();
    let handles: Vec<_> = (0..num_threads)
        .map(|_| {
            let fc = Arc::clone(&final_count);
            thread::spawn(move || {
                let mut local: u64 = 0;
                for _ in 0..iterations {
                    local += 1;  // No synchronization!
                }
                fc.fetch_add(local, Ordering::Relaxed);  // One atomic at the end
            })
        })
        .collect();
    for h in handles { h.join().unwrap(); }
    let local_time = start.elapsed();

    println!("Threads: {}", num_threads);
    println!("Shared atomic:  {:?}", shared_time);
    println!("Thread-local:   {:?}", local_time);
    println!("Speedup: {:.0}x",
        shared_time.as_nanos() as f64 / local_time.as_nanos() as f64);
}

Threads: 32 Shared atomic: 2.73s Thread-local: 1.45ms Speedup: 1886x

The thread-local version does one atomic operation at the end instead of millions during execution. The speedup is dramatic.

Thread-Local in Practice

Rust provides thread_local! for declaring thread-local storage. Here's a production-ready metrics collector:

use parking_lot::Mutex;
use std::cell::RefCell;
use std::collections::HashMap;
use std::sync::Arc;

thread_local! {
    static LOCAL: RefCell<HashMap<&'static str, u64>> =
        RefCell::new(HashMap::new());
}

#[derive(Clone)]
pub struct Metrics {
    global: Arc<Mutex<HashMap<&'static str, u64>>>,
}

impl Metrics {
    pub fn new() -> Self {
        Self {
            global: Arc::new(Mutex::new(HashMap::new())),
        }
    }

    #[inline]
    pub fn increment(&self, key: &'static str) {
        LOCAL.with(|local| {
            *local.borrow_mut().entry(key).or_insert(0) += 1;
        });
    }

    pub fn flush(&self) {
        LOCAL.with(|local| {
            let mut local = local.borrow_mut();
            if local.is_empty() { return; }
            let mut global = self.global.lock();
            for (k, v) in local.drain() {
                *global.entry(k).or_insert(0) += v;
            }
        });
    }

    pub fn snapshot(&self) -> HashMap<&'static str, u64> {
        self.global.lock().clone()
    }
}

The increment call touches only thread-local storage — no synchronization at all. The flush call merges local counters into the global map, which should be called periodically (e.g., every second, or when the thread is about to exit).

This pattern is used by the metrics crate and many high-performance systems. The tradeoff is that reads aren't instant — you need to aggregate from all threads to get accurate values.

Decision Framework

When choosing a synchronization strategy:

How often is it written vs read? Write-heavy, read-rarely → thread-local accumulation. Balanced → continue below.
Do multiple fields need consistent updates? If readers must see all-or-nothing changes → Mutex<T> for correctness. If eventual consistency is fine → individual atomics are faster.
How many threads contend? Few threads (2-4) → atomics are fine. Many threads → consider sharding or thread-local.
Are atomics in the same cache line? Adjacent atomics in an array or struct → pad to cache line size (typically 64 bytes).

Key Takeaways

False sharing is sneaky. Pad to cache line size (64 bytes on x86_64, may vary on other architectures).
Atomics aren't free. Cache coherency has real costs under contention.
CAS loops waste work but still win. Even with high collision rates, they beat mutex on modern CPUs.
Use mutex for correctness. When multiple fields must update atomically, mutex is simpler and safer.
Fairness requires time. parking_lot's 0.5ms timer prevents starvation with longer holds; with nanosecond operations, both mutexes behave similarly.
Sharding beats single locks. DashMap's 44x speedup comes from splitting one lock into many.
Lock-free beats locks. crossbeam's 11.5x MPMC speedup comes from eliminating context switches.
The best sync is no sync. Thread-local accumulation delivers 1886x when you can defer sharing.

* * *