Thread-Per-Core Async in Rust

In the previous article, thread-local accumulation delivered a 1886x speedup over shared atomic counters. The conclusion was clear: the best synchronization is no synchronization.

Thread-per-core async takes that principle and applies it to the entire runtime. Instead of optimizing how threads share state, it eliminates the need to share at all — by design. Each core runs its own event loop, owns its own data, and never migrates tasks to other cores. The "stop sharing" philosophy becomes architecture.

All benchmarks were run on an Intel Core i9-14900K (24 cores, 32 threads) with Rust 1.93.0 on Linux.

CPU frequency scaling was set to performance governor. Each benchmark was run multiple times; representative results are shown. The benchmark source code is available alongside this article.

The Cost of Send

Tokio's spawn requires the future to be Send. This single constraint cascades through your entire program: Arc instead of Rc, Mutex instead of RefCell, atomic operations instead of plain reads and writes.

These aren't just API differences. They have measurable cost — even when running single-threaded, even when tasks never actually migrate between threads:

use std::cell::{Cell, RefCell};
use std::collections::HashMap;
use std::hint::black_box;
use std::rc::Rc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::{Arc, Mutex};
use std::time::Instant;

fn main() {
    let iterations = 50_000_000u64;

    // Arc clone+drop vs Rc clone+drop
    let arc = Arc::new(42u64);
    let start = Instant::now();
    for _ in 0..iterations {
        let c = Arc::clone(&arc);
        black_box(&c);
    }
    let arc_time = start.elapsed();

    let rc = Rc::new(42u64);
    let start = Instant::now();
    for _ in 0..iterations {
        let c = Rc::clone(&rc);
        black_box(&c);
    }
    let rc_time = start.elapsed();

    // AtomicU64 fetch_add vs Cell get+set
    let atomic = AtomicU64::new(0);
    let start = Instant::now();
    for _ in 0..iterations {
        atomic.fetch_add(black_box(1), Ordering::Relaxed);
    }
    black_box(atomic.load(Ordering::Relaxed));
    let atomic_time = start.elapsed();

    let cell = Cell::new(0u64);
    let start = Instant::now();
    for _ in 0..iterations {
        cell.set(black_box(cell.get()) + 1);
    }
    black_box(cell.get());
    let cell_time = start.elapsed();

    // Mutex<HashMap> vs RefCell<HashMap>
    let map_iterations = 5_000_000u64;
    let mutex_map: Mutex<HashMap<u64, u64>> = Mutex::new(HashMap::new());
    let start = Instant::now();
    for i in 0..map_iterations {
        let key = i % 1000;
        mutex_map.lock().unwrap().insert(key, i);
        black_box(mutex_map.lock().unwrap().get(&key));
    }
    let mutex_time = start.elapsed();

    let refcell_map: RefCell<HashMap<u64, u64>> = RefCell::new(HashMap::new());
    let start = Instant::now();
    for i in 0..map_iterations {
        let key = i % 1000;
        refcell_map.borrow_mut().insert(key, i);
        black_box(refcell_map.borrow().get(&key).copied());
    }
    let refcell_time = start.elapsed();

    println!("Arc: {:?}, Rc: {:?}", arc_time, rc_time);
    println!("Atomic: {:?}, Cell: {:?}", atomic_time, cell_time);
    println!("Mutex<HM>: {:?}, RefCell<HM>: {:?}", mutex_time, refcell_time);
}

--- Reference Counting (50M iterations) --- Arc clone+drop: 373.53ms Rc clone+drop: 25.01ms Arc overhead: 14.9x --- Counter Increment (50M iterations) --- AtomicU64 fetch_add: 250.34ms Cell get+set: 8.33ms Atomic overhead: 30.0x --- HashMap Access (5M iterations) --- Mutex<HashMap>: 90.42ms RefCell<HashMap>: 52.50ms Mutex overhead: 1.7x

Arc is nearly 15x slower than Rc for clone-and-drop — every clone and drop requires an atomic increment and decrement with memory ordering guarantees. AtomicU64 is 30x slower than Cell for increment — the LOCK XADD instruction that backs fetch_add must lock the cache line and synchronize across cores, even when no other core is looking. Mutex<HashMap> shows a more modest 1.7x overhead because the HashMap operations themselves (hashing, lookup, allocation) dominate the lock acquire/release cost.

This is all single-threaded — no contention, no other threads, just the raw cost of thread-safe primitives.

Send is a tax you pay whether or not tasks actually migrate between threads. In Tokio, every spawned task pays it. In a thread-per-core runtime, none of them do.

Work Stealing vs Thread-Per-Core

The cost of Send is clear. But what architectural choice imposes it in the first place? The answer is the scheduler.

Tokio (work-stealing): M tasks across N worker threads. Any thread can steal tasks from any other thread's queue. Tasks may migrate between threads at any .await point. This requires all spawned futures to be Send — the scheduler must be able to move them.

Thread-per-core: One thread per CPU core. Tasks are pinned to their thread and never migrate. No other thread will ever touch your data. Send is not required.

Two Rust runtimes implement the thread-per-core model: Monoio (ByteDance) and Glommio (Datadog). Both run one event loop per core with no task migration. Throughout this article we use Monoio as the thread-per-core representative and Tokio as the work-stealing baseline — they make the architectural difference concrete, and the benchmark results generalize to the model, not the specific runtime.

The API difference reveals the architectural difference:

// Tokio: Arc + atomics because tasks can migrate
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};

let counter = Arc::new(AtomicU64::new(0));
let c = Arc::clone(&counter);
tokio::spawn(async move {
    c.fetch_add(1, Ordering::Relaxed);
});

// Monoio: Rc + Cell because tasks are pinned
use std::rc::Rc;
use std::cell::Cell;

let counter = Rc::new(Cell::new(0u64));
let c = Rc::clone(&counter);
monoio::spawn(async move {
    c.set(c.get() + 1);
});

Send is a scheduler requirement, not a problem-domain requirement. If your echo server handles a connection from accept to close on a single thread, nothing about the problem requires thread safety. The work-stealing scheduler imposes it.

io_uring: The Kernel Side

Thread-per-core eliminates synchronization in userspace. But I/O still goes through the kernel — and the traditional kernel interface, epoll, doesn't respect core boundaries. io_uring does.

epoll tells you a file descriptor is ready. You then issue the syscall yourself (read, write, accept). The readiness notification is decoupled from the operation — any thread can act on it. This encourages sharing file descriptors across threads.

io_uring works differently. You submit operations to a submission ring (SQ). The kernel executes them asynchronously and posts results to a completion ring (CQ). Each thread gets its own independent ring pair — no cross-thread I/O coordination needed.

This maps directly to thread-per-core: each core has its own io_uring instance, its own submission queue, its own completion queue. I/O operations never cross thread boundaries.

Monoio's buffer API reflects this ownership model. Because io_uring operations are asynchronous — the kernel reads into your buffer while your code continues — the buffer must remain valid and unmodified until the operation completes. Monoio enforces this by taking ownership of the buffer:

use monoio::io::{AsyncReadRent, AsyncWriteRentExt};

// Buffer ownership transfers to the read operation.
// The kernel writes directly into this buffer.
// You get the buffer back when the operation completes.
let buf = vec![0u8; 4096];
let (result, buf) = stream.read(buf).await;
let n = result?;

// Same for writes: buffer moves in, comes back on completion.
let (result, _) = stream.write_all(buf[..n].to_vec()).await;
result?;

This is fundamentally different from Tokio's AsyncRead, where you pass a &mut [u8] reference. With epoll, the buffer just needs to be valid for the duration of the synchronous read syscall — a borrow suffices. With io_uring, the kernel holds onto the buffer asynchronously, so borrowing isn't safe — you could write to the buffer while the kernel is still reading into it. Monoio solves this by moving the buffer into the operation, making it impossible to access until the kernel returns it.

Echo Server: Tokio vs Monoio

The echo server is the canonical network benchmark: accept a connection, read bytes, write them back. Simple enough to isolate runtime overhead.

The Tokio version is straightforward:

use tokio::io::{AsyncReadExt, AsyncWriteExt};
use tokio::net::TcpListener;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let listener = TcpListener::bind("0.0.0.0:8080").await?;

    loop {
        let (mut stream, _) = listener.accept().await?;
        tokio::spawn(async move {
            let mut buf = [0u8; 4096];
            loop {
                let n = match stream.read(&mut buf).await {
                    Ok(0) => return,
                    Ok(n) => n,
                    Err(_) => return,
                };
                if stream.write_all(&buf[..n]).await.is_err() {
                    return;
                }
            }
        });
    }
}

The Monoio version starts one thread per core, each with its own event loop and TCP listener (via SO_REUSEPORT). Connections are distributed across cores by the kernel:

use monoio::io::{AsyncReadRent, AsyncWriteRentExt};
use monoio::net::TcpListener;
use socket2::{Domain, Protocol, Socket, Type};

fn main() {
    let cores = num_cpus::get();
    let addr: std::net::SocketAddr = "0.0.0.0:8081".parse().unwrap();

    let threads: Vec<_> = (0..cores)
        .map(|core_id| {
            std::thread::spawn(move || {
                // Pin thread to core
                core_affinity::set_for_current(
                    core_affinity::CoreId { id: core_id },
                );

                let mut rt = monoio::RuntimeBuilder::
                    <monoio::FusionDriver>::new()
                    .build()
                    .unwrap();

                rt.block_on(async {
                    // Each thread creates its own listener with SO_REUSEPORT
                    let socket = Socket::new(
                        Domain::IPV4, Type::STREAM, Some(Protocol::TCP),
                    ).unwrap();
                    socket.set_reuse_address(true).unwrap();
                    socket.set_reuse_port(true).unwrap();
                    socket.bind(&addr.into()).unwrap();
                    socket.listen(1024).unwrap();
                    socket.set_nonblocking(true).unwrap();

                    let std_listener: std::net::TcpListener = socket.into();
                    let listener = TcpListener::from_std(std_listener).unwrap();

                    loop {
                        let (mut stream, _) = listener.accept().await.unwrap();
                        monoio::spawn(async move {
                            let mut buf = vec![0u8; 4096];
                            loop {
                                let (result, b): (std::io::Result<usize>, Vec<u8>) =
                                    stream.read(buf).await;
                                buf = b;
                                let n = match result {
                                    Ok(0) | Err(_) => return,
                                    Ok(n) => n,
                                };
                                let data = buf[..n].to_vec();
                                let (result, _): (std::io::Result<usize>, Vec<u8>) =
                                    stream.write_all(data).await;
                                if result.is_err() { return; }
                            }
                        });
                    }
                });
            })
        })
        .collect();

    for t in threads { t.join().unwrap(); }
}

Load testing with 100 concurrent connections sending 64-byte payloads:

Echo server throughput (64-byte payload, 100 connections, 24 cores) Runtime Throughput Per-connection ----------------------------------------------------------------- Tokio + epoll (single listener) 284,216 msg/sec 2,842 msg/sec Tokio + epoll (SO_REUSEPORT) 287,951 msg/sec 2,880 msg/sec Monoio + epoll (SO_REUSEPORT) 1,648,251 msg/sec 16,483 msg/sec Monoio + io_uring (SO_REUSEPORT) 1,633,198 msg/sec 16,332 msg/sec

Four configurations, and the results tell a clear story about where the performance comes from.

SO_REUSEPORT doesn't matter here. Rows 1 and 2: adding per-core listeners to Tokio made almost no difference (284K vs 288K). At 100 connections, the single listener is not the bottleneck.

io_uring doesn't matter here either. Rows 3 and 4: Monoio with epoll and Monoio with io_uring are virtually identical (1.65M vs 1.63M). For a simple echo server on loopback with 100 connections, there isn't enough syscall overhead or batching opportunity for io_uring to improve on epoll.

The gap is primarily the runtime architecture. Rows 2 and 3 isolate the scheduler: both use epoll, both use SO_REUSEPORT, yet Monoio is 5.7x faster. The dominant factor is work-stealing (Tokio) vs thread-per-core (Monoio) — scheduler coordination, cross-thread waker notifications, and the overhead of a multi-threaded event loop versus isolated single-threaded loops. Other implementation differences contribute too: Monoio pins threads to cores (better cache affinity), uses an ownership-based buffer API, and has different internal data structures. We can't attribute exact fractions to each factor, but the architectural difference — shared scheduler vs independent loops — is where the bulk of the gap comes from.

Note: The load-test client runs on the same machine as the server, so client and server compete for CPU cores. This depresses absolute throughput for both, but the relative comparison remains valid — both servers face the same constraint. With a dedicated client machine and real network I/O, absolute numbers would be higher and the ratio may differ as network latency becomes a larger fraction of per-request time. io_uring's advantages — batched submissions, multishot accept, registered buffers — would also become more visible with thousands of connections and real network latency.

Why It Scales: Cache Locality

To understand why the architectural difference produces such a large gap, look at the cache level — which, as the previous article showed with false sharing, dominates performance at scale.

Consider what happens when Tokio's work stealer migrates a task from core A to core B. The task's state — its TcpStream internal buffer, the connection's read/write cursors, any per-connection application data — was hot in core A's L1 cache. When the task resumes on core B, every access to that state is an L1 miss. The data must travel from core A's cache (or main memory) through the cache coherency protocol. On the i9-14900K, an L1 hit takes ~1ns; fetching from another core's L2 takes ~10-20ns. For data that's been evicted entirely, main memory access costs ~60-80ns.

An echo server touches the same state on every iteration: read into buffer, write from buffer. In Monoio, this state stays in L1 across every .await — the buffer, the socket file descriptor, the io_uring submission queue entry. In Tokio, any of these could be cold after a migration.

This is the same principle that made false sharing 11.7x slower in the previous article: the MESI protocol's cache-line invalidation across cores is expensive. False sharing causes unnecessary invalidation of adjacent data. Work stealing causes unnecessary invalidation of all task data. Thread-per-core avoids both by keeping everything on one core.

The effect compounds with core count. With 4 cores, a stolen task has 3 possible destinations; cache pressure is moderate. With 24 cores, the working set is spread across more L1/L2 caches, and each migration has a higher chance of hitting cold data. This is why the echo server gap widens at higher core counts — at 24 cores, the cache locality advantage dominates.

Running perf stat on both echo servers under load confirms this directly. The raw counts favor Monoio in absolute terms, but since Monoio processes 5.75x more messages, we normalize per message (using P-core counters):

perf stat per message (P-cores, 100 connections, 10s) Metric Tokio Monoio Difference ---------------------------------------------------------- L1-dcache-loads/msg 5,669 2,595 54% fewer L1-dcache-misses/msg 221 139 37% fewer LLC-load-misses/msg 0.001 0.0002 ~0 (fits in LLC)

Per message, Monoio touches 54% less memory — the work-stealing scheduler's data structures (steal queues, shared state, atomic coordination) simply don't exist. And 37% fewer L1 cache misses per message — task state stays hot in the core's L1 because nothing migrates it away. LLC misses are near zero for both: the working set fits entirely in the shared last-level cache.

But cache locality only helps when the workload cooperates. What happens when it doesn't?

When Work Stealing Wins

Thread-per-core has a fundamental weakness: it pushes load balancing to the application. If workload is uneven — some requests take 50x longer than others — pinned tasks create hot spots.

Consider 1000 tasks: 90% light (fast), 10% heavy (50x more CPU work). With round-robin assignment across cores, some cores receive more heavy tasks than others by chance. The unlucky cores take much longer to finish:

use std::hint::black_box;
use std::time::Instant;

fn compute_work(iterations: u64) -> u64 {
    let mut sum = 0u64;
    for i in 0..iterations {
        sum = sum.wrapping_add(black_box(i));
    }
    sum
}

// Tokio: spawn all tasks, work-stealer balances automatically
async fn tokio_version(tasks: &[u64]) {
    let mut handles = Vec::new();
    for &work in tasks {
        handles.push(tokio::spawn(async move {
            black_box(compute_work(work));
        }));
    }
    for h in handles { h.await.unwrap(); }
}

// Pinned: round-robin assign to threads, no rebalancing
fn pinned_version(tasks: &[u64], num_cores: usize) {
    let mut per_core: Vec<Vec<u64>> = vec![Vec::new(); num_cores];
    for (i, &work) in tasks.iter().enumerate() {
        per_core[i % num_cores].push(work);
    }
    let handles: Vec<_> = per_core.into_iter().enumerate()
        .map(|(core_id, work_items)| {
            std::thread::spawn(move || {
                core_affinity::set_for_current(
                    core_affinity::CoreId { id: core_id },
                );
                for work in work_items {
                    black_box(compute_work(work));
                }
            })
        })
        .collect();
    for h in handles { h.join().unwrap(); }
}

1000 tasks: 900 light, 100 heavy (50x work), 16 cores --- Total Completion Time --- Tokio (work-stealing): 5.88ms Pinned (round-robin): 9.10ms Pinned is 1.55x slower --- Task Latency --- Percentile Tokio Pinned ------------------------------------------- p50 10.91µs 13.16µs p90 339.19µs 315.94µs p99 789.61µs 644.45µs max 790.29µs 730.87µs

The total completion time tells the story: Tokio finishes 1.55x faster. Some cores in the pinned configuration received more heavy tasks by chance, leaving them busy while luckier cores sat idle. Work stealing automatically redistributes those heavy tasks, keeping all cores utilized.

The per-task latency tells a subtler story: pinned tasks actually have better tail latency (p99: 644µs vs 790µs). Without migration overhead — no cache invalidation, no scheduler coordination — individual tasks complete faster. The cost of imbalance shows up in aggregate throughput (idle cores), not in per-request latency.

This is the real tradeoff. Thread-per-core excels when requests are uniform — a web server where every request does roughly the same work, a key-value store where gets and puts take similar time. With uneven workloads, you trade lower per-request overhead for reduced core utilization. Work stealing pays a per-task tax to keep all cores busy.

The Rust Ecosystem

Three thread-per-core runtimes have emerged in Rust, each with different design priorities:

Monoio (ByteDance) is the most pragmatic choice. Its FusionDriver supports both io_uring and epoll, falling back automatically on older kernels. This means your code runs everywhere Linux runs, using io_uring when available and epoll when not. Production-proven at ByteDance's scale.

Glommio (Datadog) was the pioneer. It requires io_uring (no epoll fallback) and adds cooperative scheduling with task quotas — if a task runs too long without yielding, the runtime can preempt it. This prevents a single CPU-bound task from starving I/O on the same core. Glommio also provides built-in primitives for cross-thread communication.

Compio takes a different approach: cross-platform completion-based I/O. It uses io_uring on Linux, IOCP on Windows, and kqueue on macOS/BSD. If you need thread-per-core semantics on non-Linux platforms, Compio is currently the only option.

The thread-per-core model isn't new to Rust. Seastar, the C++ framework behind ScyllaDB and Redpanda, has used this architecture for years. ScyllaDB — a Cassandra-compatible database built on Seastar — consistently demonstrates significantly lower p99 latency compared to Cassandra on equivalent hardware, largely because its shared-nothing architecture eliminates cross-core coordination entirely.

Cross-Thread Communication

Thread-per-core doesn't mean zero communication. Threads sometimes need to coordinate — a user's session state might be on core 3, but the request arrives on core 7. The difference from work-stealing is that cross-thread communication is explicit and intentional rather than implicit and constant.

The standard approach is data partitioning by affinity. Hash the user ID (or connection ID, or key) to a core. All requests for that user go to the same core. The data stays local, and most operations require no cross-core communication at all.

When cross-core communication is unavoidable, use explicit message passing. Each core owns a receiver; other cores send requests to it:

use std::sync::mpsc;

// Each core gets a channel for receiving cross-core requests
struct CoreHandle {
    tx: mpsc::Sender<Request>,
}

// Route request to the core that owns the data
fn route(key: u64, cores: &[CoreHandle]) {
    let core_id = (key % cores.len() as u64) as usize;
    cores[core_id].tx.send(Request { key }).unwrap();
}

// On each core's event loop: drain cross-core requests periodically
fn poll_cross_core(rx: &mpsc::Receiver<Request>) {
    while let Ok(req) = rx.try_recv() {
        handle_request(req); // Data is local — no synchronization needed
    }
}

The important thing is that these messages are the exception, not the rule. In a well-partitioned system, 95%+ of operations are core-local. The channel is a safety valve, not the primary path.

This is exactly how ScyllaDB/Seastar works: data is sharded by partition key across cores. Reads and writes for a partition always go to the owning core. Cross-core requests (like range scans spanning partitions) use explicit message passing between shards.

Decision Framework

Thread-per-core delivers dramatic throughput gains but demands a specific workload shape. Five questions to determine whether it's right for yours:

Is your workload I/O-bound with uniform request sizes? Web servers, proxies, key-value stores where each request does roughly the same work → thread-per-core excels.
Do you have CPU-intensive tasks with variable duration? Image processing, machine learning inference, compilation — workloads where some tasks take 100x longer than others → work-stealing handles this better.
Are you on Linux with kernel 5.11+? io_uring's full feature set (including fixed files, multishot accept) requires recent kernels. Monoio's epoll fallback works on older kernels but loses the io_uring advantage. Glommio requires io_uring.
Can you partition state by core? If your data naturally partitions (by user, by key, by connection), thread-per-core maps cleanly. If everything must be globally accessible, you'll spend more time on message passing than you save.
Is core utilization or per-request overhead your priority? Work-stealing keeps all cores busy by redistributing tasks — higher aggregate throughput with uneven workloads. Thread-per-core has lower per-request overhead (no migration, no scheduler coordination) but wastes cores when load is uneven. For uniform I/O workloads, both achieve high utilization and thread-per-core wins on raw throughput.

Key Takeaways

Send has measurable cost. Arc is 15x slower than Rc, atomics are 30x slower than Cell — even single-threaded, with zero contention.
Thread-per-core eliminates synchronization at the architecture level. Not by optimizing locks, but by removing the need for them entirely.
io_uring completes the picture. Per-thread submission and completion rings give each core independent I/O, matching the thread-per-core ownership model.
Cache locality is the dominant factor at scale. Pinned tasks keep data hot in L1/L2. Work stealing invalidates cache on every migration.
Load imbalance is the real tradeoff. Thread-per-core pushes load balancing to the application. With uneven workloads, total throughput drops as some cores sit idle.
Partition data, not just compute. Hash user/key/connection to a core. Make cross-core messages the exception.
Use the right tool. Tokio for general-purpose async and varied workloads. Monoio or Glommio for I/O-bound, partitionable workloads where throughput at scale matters.

* * *