← index

Heap Fragmentation in Rust

You deploy a long-lived Rust service. RSS climbs over hours in stair-steps — flat, then a sudden jump to a new plateau that never comes back. You run Valgrind. No leaks. Every allocation has a matching free. The problem is heap fragmentation.

The common fix is to swap the allocator — replace glibc's default with jemalloc, and RSS stabilizes. But why does the default allocator fragment under this workload? And why does jemalloc handle it better? The answer requires understanding three layers: the allocator's internal bookkeeping, the async runtime's allocation patterns, and the kernel's virtual memory decisions.

In the previous article, thread-per-core eliminated synchronization overhead by keeping tasks pinned to cores. That same principle extends to memory: when your allocator's thread caches align with your runtime's thread model, fragmentation drops and memory returns to the OS predictably. Even your memory allocator benefits from not sharing.

All benchmarks were run on an Intel Core i9-14900K (24 cores, 32 threads) with Rust 1.93.0 on Linux.

CPU frequency scaling was set to performance governor. Each benchmark was run multiple times; representative results are shown. All code examples are self-contained and runnable.

What Fragmentation Actually Is

Heap fragmentation comes in two forms. External fragmentation: free memory exists but is scattered in small non-contiguous gaps. The allocator has 100 KB free, but the largest contiguous block is 2 KB — a 4 KB allocation forces a new page from the OS. Internal fragmentation: the allocator rounds your 49-byte request to a 64-byte size class. 15 bytes wasted per allocation.

A fragmented heap looks like this:

Memory layout after interleaved alloc/free:

|USED|free|USED|free|USED|free|USED|free|USED|free|USED|
 4KB  64B  4KB  64B  4KB  64B  4KB  64B  4KB  64B  4KB

Total free: 320 bytes scattered across 5 gaps.
Largest contiguous free block: 64 bytes.
A 256-byte allocation requires a new page from the OS.

The key metric is the fragmentation ratio: (resident - allocated) / allocated. The OS has given you resident bytes of physical pages, but only allocated bytes are actually in use. A ratio of 0.0 means every byte the OS gave you is live data. A ratio of 2.0 means the OS has given you 3x more physical pages than you need — two thirds are wasted in holes the allocator cannot coalesce.

Demonstrating this is straightforward: allocate 10,000 objects of alternating sizes, free half of them, and check whether RSS drops:

fn read_rss_kb() -> usize {
    let statm = std::fs::read_to_string("/proc/self/statm").unwrap();
    let pages: usize = statm.split_whitespace().nth(1).unwrap().parse().unwrap();
    pages * 4 // each page is 4KB on x86_64
}

fn main() {
    let mut allocations: Vec<Option<Vec<u8>>> = Vec::with_capacity(10_000);

    // Allocate alternating small (64B) and large (4KB) blocks
    for i in 0..10_000 {
        let size = if i % 2 == 0 { 64 } else { 4096 };
        allocations.push(Some(vec![0u8; size]));
    }
    println!("After alloc: {} KB", read_rss_kb());

    // Free all the small blocks — creates 5,000 tiny holes
    for i in (0..10_000).step_by(2) {
        allocations[i] = None;
    }
    println!("After freeing small blocks: {} KB", read_rss_kb());

    // Free everything
    drop(allocations);
    println!("After freeing all: {} KB", read_rss_kb());
}
After 10K allocations (5K×64B + 5K×4KB): RSS = 25,600 KB (+23,280 KB) After freeing 5K small blocks: RSS = 25,600 KB (barely changed) After freeing ALL blocks: RSS = 25,600 KB (still elevated)

RSS barely moves after freeing half the allocations — and stays elevated even after freeing everything. The allocator retains the pages.

This is single-threaded behavior. glibc's ptmalloc2 serves the main thread from the brk heap, which can only release memory from the top — if the highest-address allocation persists, nothing below it returns to the OS. With multiple threads, each thread gets an mmap'd arena that can be fully unmapped when all its allocations are freed. We will see this difference in the allocator shootout below.

How Allocators Work

Every modern allocator shares a three-layer architecture:

  1. Thread caches (fast path, no locks). Per-thread freelists of pre-sized objects. Most malloc and free calls complete here without synchronization.
  2. Central arenas (slow path, some locking). When thread caches overflow or empty, they transfer objects to/from a shared arena. This requires some coordination.
  3. OS interface (slowest path, syscalls). mmap to get pages, madvise to return them. The most expensive operation by far.

Where allocators differ is how they manage each layer:

Allocator Thread Cache Arenas Return to OS
ptmalloc2 (glibc) Per-arena tcache, 64 entries Up to 8 per core Rarely
jemalloc Per-thread tcache, auto-sized 4 per core (default) Decay-based (10s default)
mimalloc Free-list sharding per 64 KiB page Per-thread heap Eager page reset
snmalloc Per-thread, message-passing Per-thread Batched returns

The critical difference for fragmentation is how each allocator handles cross-thread freeing — what happens when Thread A allocates memory that Thread B later frees. ptmalloc2 returns the freed block to the original arena (Thread A's), but does not eagerly coalesce it with adjacent free blocks across arena boundaries. The freed memory accumulates in arenas that may not be actively allocating, creating dead zones that the allocator cannot reclaim.

jemalloc mitigates this with its decay mechanism. Dirty pages (freed but not yet returned to the OS) are purged on a configurable schedule — by default, 10 seconds of inactivity. Background threads handle the purging so application threads don't pay the madvise cost. mimalloc takes a more aggressive approach: empty pages are immediately marked as unused, reducing RSS at the cost of more frequent OS interaction. snmalloc uses a fundamentally different design: freed objects are sent back to the allocating thread via batched message passing, keeping each thread's arena coherent.

Async Rust Allocation Patterns

Not all allocation patterns fragment equally. Async Rust services create a particularly fragmentation-prone combination:

Varied-size allocations. Each tokio::spawn allocates a Task<T> on the heap. The size depends on the future's state machine — as the first article in this series showed, different async functions produce different enum sizes. An HTTP server spawning handlers for different endpoints creates a mix of future sizes on every request.

Varied lifetimes. Connections arrive and depart independently. Object A is allocated, then B, then A is freed, leaving a hole that B's allocation straddles. Long-lived connections (WebSocket, SSE) pin pages in place while short-lived ones (health checks, simple GETs) create and destroy allocations around them.

Temporary allocation bursts. A single serde_json::from_str call creates many temporary String and Vec allocations during parsing — different sizes, all short-lived. The heap sees a burst of varied allocations followed by a wave of frees, leaving fragmented gaps.

We can simulate this pattern and watch fragmentation build:

// Cargo.toml dependencies:
//   tikv-jemallocator = "0.6"
//   tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }

use std::hint::black_box;

#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

fn print_jemalloc_stats(label: &str) {
    tikv_jemalloc_ctl::epoch::advance().unwrap();
    let allocated = tikv_jemalloc_ctl::stats::allocated::read().unwrap();
    let resident = tikv_jemalloc_ctl::stats::resident::read().unwrap();
    let frag = if allocated > 0 {
        (resident as f64 - allocated as f64) / allocated as f64
    } else { 0.0 };
    println!("[{:<25}] allocated={:>10.1} MB  resident={:>10.1} MB  frag={:.3}",
        label, allocated as f64 / (1024.0 * 1024.0),
        resident as f64 / (1024.0 * 1024.0), frag);
}

fn main() {
    let sizes: [usize; 8] = [128, 256, 512, 1024, 2048, 4096, 8192, 16384];
    let mut live: Vec<Option<(Vec<u8>, u32)>> = Vec::new();
    let mut rng: u64 = 98765;
    let mut total_allocs: u64 = 0;
    let mut total_frees: u64 = 0;

    print_jemalloc_stats("baseline");

    for iteration in 0..500_000u64 {
        rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);

        // New "connections": 2-5 per iteration, varied sizes and lifetimes
        let new = 2 + ((rng >> 48) % 4) as u32;
        for _ in 0..new {
            rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
            let size = sizes[(rng >> 32) as usize % sizes.len()];
            let lifetime = if (rng >> 40) % 10 < 7 { 1 + ((rng >> 16) % 20) as u32 }
                           else { 100 + ((rng >> 8) % 1900) as u32 };
            let mut buf = vec![0u8; size];
            black_box(&mut buf[0]);
            live.push(Some((buf, lifetime)));
            total_allocs += 1;
        }

        // Age and expire
        for slot in live.iter_mut() {
            if let Some((_, remaining)) = slot {
                if *remaining == 0 { *slot = None; total_frees += 1; }
                else { *remaining -= 1; }
            }
        }

        if iteration % 10_000 == 0 && iteration > 0 {
            live.retain(|s| s.is_some()); // compact periodically
        }
        if iteration % 50_000 == 0 {
            let live_count = live.iter().filter(|s| s.is_some()).count();
            print!("iter={:>6}  live={:>6}  ", iteration, live_count);
            print_jemalloc_stats(&format!("iter {}", iteration));
        }
    }

    drop(live);
    print_jemalloc_stats("after drop all");
    println!("Waiting 12s for decay...");
    std::thread::sleep(std::time::Duration::from_secs(12));
    print_jemalloc_stats("after decay");
    println!("Total allocations: {}, frees: {}", total_allocs, total_frees);
}
Async server allocation pattern (500K iterations, jemalloc) Iteration Live objects Allocated Resident Frag ratio ---------------------------------------------------------------------- 0 7 0.1 MB 6.9 MB 119.651 50,000 4,875 5.1 MB 12.5 MB 1.454 100,000 5,096 5.3 MB 12.5 MB 1.358 200,000 5,205 5.3 MB 12.5 MB 1.358 450,000 5,165 5.2 MB 12.5 MB 1.412 After drop all: 0.4 MB 12.5 MB 30.301 After 12s decay: 0.4 MB 6.7 MB 16.197

Under load, jemalloc keeps the fragmentation ratio around 1.4 — the resident memory is about 2.4x the allocated data, a reasonable overhead. The problem appears after the workload stops: allocated drops to 0.4 MB but resident stays at 12.5 MB (ratio 30.3). After jemalloc's decay timer fires, resident drops to 6.7 MB — but that is still 17x the allocated data. These are pages jemalloc cannot return because they contain a mix of metadata and partially-filled size-class runs from the varied allocation pattern. In a real server, this residual grows with each traffic burst.

Measuring Fragmentation

Before comparing allocators, you need to measure fragmentation directly — not just watch RSS in htop. jemalloc exposes detailed statistics through its control API:

use tikv_jemalloc_ctl as jemalloc_ctl;

fn print_jemalloc_stats(label: &str) {
    // Must advance epoch before reading — stats are cached
    jemalloc_ctl::epoch::advance().unwrap();

    let allocated = jemalloc_ctl::stats::allocated::read().unwrap();
    let active = jemalloc_ctl::stats::active::read().unwrap();
    let resident = jemalloc_ctl::stats::resident::read().unwrap();

    let frag = (resident as f64 - allocated as f64) / allocated as f64;
    println!("[{}] allocated={:.1} MB  resident={:.1} MB  frag={:.3}",
        label,
        allocated as f64 / (1024.0 * 1024.0),
        resident as f64 / (1024.0 * 1024.0),
        frag,
    );
}

Three metrics matter:

The gap between resident and allocated is your fragmentation overhead: pages the OS gave you that contain no live data. One subtlety: at startup, jemalloc maps several megabytes for its own metadata (arena structures, extent tables, base allocations) before your first malloc. With only 0.1 MB allocated, a 6.9 MB resident gives a fragmentation ratio over 100 — this is metadata overhead, not heap fragmentation. The ratio becomes meaningful once your application has allocated enough data to dominate jemalloc's fixed overhead.

For allocator-agnostic measurement (including when running with glibc), read /proc/self/statm directly to get RSS in pages. For allocation-site profiling — finding which code path causes the most fragmentation — tools like DHAT (via Valgrind) and heaptrack (via LD_PRELOAD) identify hot allocation sites without code changes.

The Allocator Shootout

Same workload, three allocators. Each thread runs a tight loop: allocate a random-sized buffer (128 B to 8 KB), hold up to 5,000 live objects per thread, free randomly when full. This simulates a server handling connections of varied sizes with overlapping lifetimes.

// Cargo.toml dependencies:
//   num_cpus = "1.16"
//   tikv-jemallocator = "0.6"     (with feature "jemalloc")
//   mimalloc = { version = "0.1", default-features = false }  (with feature "use-mimalloc")
// Run three times:
//   cargo run --release --features jemalloc
//   cargo run --release --features use-mimalloc
//   cargo run --release                         (glibc default)

use std::hint::black_box;

#[cfg(feature = "jemalloc")]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

#[cfg(feature = "use-mimalloc")]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

fn read_rss_kb() -> usize {
    let s = std::fs::read_to_string("/proc/self/statm").unwrap();
    s.split_whitespace().nth(1).unwrap().parse::<usize>().unwrap() * 4
}

fn main() {
    let num_threads = num_cpus::get();
    let max_live = 5_000;
    let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];

    let barrier = std::sync::Arc::new(std::sync::Barrier::new(num_threads));
    // All threads wait here at peak — main thread measures RSS
    let peak_barrier = std::sync::Arc::new(
        std::sync::Barrier::new(num_threads + 1),
    );

    let handles: Vec<_> = (0..num_threads)
        .map(|tid| {
            let barrier = barrier.clone();
            let peak_barrier = peak_barrier.clone();
            std::thread::spawn(move || {
                barrier.wait();
                let mut live: Vec<Option<Vec<u8>>> =
                    (0..max_live).map(|_| None).collect();
                let mut rng: u64 = 31337 + tid as u64 * 7919;

                for _ in 0..200_000usize {
                    rng = rng.wrapping_mul(6364136223846793005)
                             .wrapping_add(1);
                    let size = sizes[(rng >> 32) as usize % 7];
                    let mut v = vec![0u8; size];
                    black_box(&mut v[0]);

                    rng = rng.wrapping_mul(6364136223846793005)
                             .wrapping_add(1);
                    live[(rng >> 32) as usize % max_live] = Some(v);
                }

                peak_barrier.wait(); // hold live data
                drop(live);
            })
        })
        .collect();

    peak_barrier.wait();
    let under_load = read_rss_kb();
    for h in handles { h.join().unwrap(); }
    let after_drop = read_rss_kb();

    println!("RSS under load: {:.1} MB", under_load as f64 / 1024.0);
    println!("RSS after drop: {:.1} MB", after_drop as f64 / 1024.0);
}
Allocator shootout (200K allocs/thread, 32 threads, mixed sizes 128B-8KB) Allocator Peak RSS Alloc/sec RSS after drop Returned --------------------------------------------------------------------- glibc 391.3 MB 16.7M/sec 6.5 MB 99% jemalloc 544.8 MB 19.5M/sec 123.9 MB 78% mimalloc 451.6 MB 18.0M/sec 447.6 MB 1%

Each allocator makes a different tradeoff. glibc returns 99% of memory when arenas drain completely — it munmaps the per-thread mmap'd arena when all allocations are freed. This is glibc's best case: every thread's arena drains completely at once. In a real server with mixed-lifetime objects, some allocations in each arena persist indefinitely, and glibc cannot unmap a partially-occupied arena.

jemalloc is 17% faster than glibc but retains 124 MB post-drop. jemalloc purges dirty pages via MADV_FREE, which marks pages as lazily reclaimable — the kernel can reclaim them under memory pressure, but without pressure RSS stays elevated. The 78% "returned" reflects pages that jemalloc has MADV_DONTNEED'd (via its muzzy decay), which immediately reduces RSS.

mimalloc shows only 1% returned. A caveat: this does not necessarily mean mimalloc is holding all the memory in active use. mimalloc may have internally marked pages as unused via MADV_FREE, but RSS does not drop until the kernel reclaims those pages under memory pressure. What we can say is that mimalloc does not eagerly munmap or MADV_DONTNEED its segments — it keeps the virtual address space mapped for fast reuse if new allocations arrive.

The throughput column matters: fragmentation reduction is useless if the allocator is slower. jemalloc delivers the highest throughput (19.5M allocs/sec) while actively returning memory to the OS via its decay mechanism.

Cross-Thread Freeing: The Work-Stealing Tax

This is the connection to the previous article's architectural argument. In a work-stealing runtime like Tokio, tasks migrate between threads at .await points. Memory allocated on Thread A may be freed on Thread B. This cross-thread freeing is the single most damaging pattern for allocator fragmentation.

The mechanism: Thread A allocates a 4 KB buffer from its arena. The task yields at an .await and is stolen by Thread B. Thread B completes the work and frees the buffer. But the freed memory belongs to Thread A's arena. Thread A's arena now has a 4 KB hole that Thread A may never reuse (it might be allocating different sizes). Thread B's arena is unaffected — it never had those pages. Over time, every arena accumulates holes from buffers freed by other threads.

We can measure this directly. Mode 1: each thread allocates and frees its own buffers (simulating thread-per-core). Mode 2: each thread sends its allocations to a neighbor thread for freeing (simulating work-stealing):

// Cargo.toml dependencies:
//   tikv-jemallocator = "0.6"
//   tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
//   crossbeam = "0.8"
//   num_cpus = "1.16"
//   core_affinity = "0.8"

use std::hint::black_box;
use std::time::Instant;

#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

fn read_rss_kb() -> usize {
    let s = std::fs::read_to_string("/proc/self/statm").unwrap();
    s.split_whitespace().nth(1).unwrap().parse::<usize>().unwrap() * 4
}

fn print_jemalloc_stats(label: &str) {
    tikv_jemalloc_ctl::epoch::advance().unwrap();
    let allocated = tikv_jemalloc_ctl::stats::allocated::read().unwrap();
    let resident = tikv_jemalloc_ctl::stats::resident::read().unwrap();
    let frag = if allocated > 0 {
        (resident as f64 - allocated as f64) / allocated as f64
    } else { 0.0 };
    println!("  [{:<35}] allocated={:>8.1} MB  resident={:>8.1} MB  frag={:.3}",
        label, allocated as f64 / (1024.0 * 1024.0),
        resident as f64 / (1024.0 * 1024.0), frag);
}

/// Mode 1: Each thread allocates and frees its own buffers (thread-per-core).
fn same_thread_workload(num_threads: usize, duration_secs: u64) -> (usize, usize) {
    let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];
    let max_live = 2_000;
    let barrier = std::sync::Arc::new(std::sync::Barrier::new(num_threads));
    let peak_rss = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
    let handles: Vec<_> = (0..num_threads).map(|tid| {
        let barrier = barrier.clone();
        let peak_rss = peak_rss.clone();
        std::thread::spawn(move || {
            core_affinity::set_for_current(core_affinity::CoreId { id: tid });
            barrier.wait();
            let deadline = Instant::now()
                + std::time::Duration::from_secs(duration_secs);
            let mut live: Vec<Option<Vec<u8>>> =
                (0..max_live).map(|_| None).collect();
            let mut slot = 0usize;
            let mut rng: u64 = 9999 + tid as u64 * 6151;
            let mut ops = 0u64;
            while Instant::now() < deadline {
                rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
                let size = sizes[(rng >> 32) as usize % sizes.len()];
                let mut v = vec![0u8; size];
                black_box(&mut v[0]);
                live[slot] = Some(v); // old buffer freed here (same thread)
                slot = (slot + 1) % max_live;
                ops += 1;
                if tid == 0 && ops % 50_000 == 0 {
                    peak_rss.fetch_max(read_rss_kb(),
                        std::sync::atomic::Ordering::Relaxed);
                }
            }
            ops
        })
    }).collect();
    let total_ops: u64 = handles.into_iter().map(|h| h.join().unwrap()).sum();
    let final_rss = read_rss_kb();
    let peak = peak_rss.load(std::sync::atomic::Ordering::Relaxed).max(final_rss);
    println!("  Total ops: {} ({:.1}M/sec)", total_ops,
        total_ops as f64 / duration_secs as f64 / 1_000_000.0);
    (peak, final_rss)
}

/// Mode 2: Thread i allocates, sends evicted buffer to thread (i+1) % N.
fn cross_thread_workload(num_threads: usize, duration_secs: u64) -> (usize, usize) {
    let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];
    let max_live = 2_000;
    let peak_rss = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));

    let mut senders = Vec::new();
    let mut receivers = Vec::new();
    for _ in 0..num_threads {
        let (tx, rx) = crossbeam::channel::bounded::<Vec<u8>>(256);
        senders.push(tx);
        receivers.push(rx);
    }
    let barrier = std::sync::Arc::new(
        std::sync::Barrier::new(num_threads * 2));

    let alloc_handles: Vec<_> = (0..num_threads).map(|tid| {
        let barrier = barrier.clone();
        let sender = senders[(tid + 1) % num_threads].clone();
        let peak_rss = peak_rss.clone();
        std::thread::spawn(move || {
            barrier.wait();
            let deadline = Instant::now()
                + std::time::Duration::from_secs(duration_secs);
            let mut live: Vec<Option<Vec<u8>>> =
                (0..max_live).map(|_| None).collect();
            let mut slot = 0usize;
            let mut rng: u64 = 7777 + tid as u64 * 3571;
            let mut ops = 0u64;
            while Instant::now() < deadline {
                rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
                let size = sizes[(rng >> 32) as usize % sizes.len()];
                let mut v = vec![0u8; size];
                black_box(&mut v[0]);
                if let Some(old) = live[slot].take() {
                    let _ = sender.send(old); // cross-thread free
                }
                live[slot] = Some(v);
                slot = (slot + 1) % max_live;
                ops += 1;
                if tid == 0 && ops % 50_000 == 0 {
                    peak_rss.fetch_max(read_rss_kb(),
                        std::sync::atomic::Ordering::Relaxed);
                }
            }
            drop(live);
            ops
        })
    }).collect();

    let free_handles: Vec<_> = (0..num_threads).map(|tid| {
        let barrier = barrier.clone();
        let receiver = receivers[tid].clone();
        std::thread::spawn(move || {
            barrier.wait();
            while let Ok(buf) = receiver.recv() { drop(buf); }
        })
    }).collect();

    let total_ops: u64 = alloc_handles.into_iter()
        .map(|h| h.join().unwrap()).sum();
    drop(senders);
    for h in free_handles { h.join().unwrap(); }
    let final_rss = read_rss_kb();
    let peak = peak_rss.load(std::sync::atomic::Ordering::Relaxed).max(final_rss);
    println!("  Total ops: {} ({:.1}M/sec)", total_ops,
        total_ops as f64 / duration_secs as f64 / 1_000_000.0);
    (peak, final_rss)
}

fn main() {
    let threads = num_cpus::get().min(16);
    let duration = 15;

    println!("--- Same-thread free (thread-per-core) ---");
    print_jemalloc_stats("before TPC");
    let (tpc_peak, tpc_final) = same_thread_workload(threads, duration);
    print_jemalloc_stats("after TPC");
    println!("  Peak RSS: {:.1} MB, Final: {:.1} MB",
        tpc_peak as f64 / 1024.0, tpc_final as f64 / 1024.0);

    println!("\n  Waiting 15s for decay...\n");
    std::thread::sleep(std::time::Duration::from_secs(15));

    println!("--- Cross-thread free (work-stealing) ---");
    print_jemalloc_stats("before work-stealing");
    let (ws_peak, ws_final) = cross_thread_workload(threads, duration);
    print_jemalloc_stats("after work-stealing");
    println!("  Peak RSS: {:.1} MB, Final: {:.1} MB",
        ws_peak as f64 / 1024.0, ws_final as f64 / 1024.0);

    println!("\n=== Summary ===");
    println!("  Thread-per-core: peak {:>6} KB, final {:>6} KB", tpc_peak, tpc_final);
    println!("  Work-stealing:   peak {:>6} KB, final {:>6} KB", ws_peak, ws_final);
    println!("  Overhead: {:.2}x peak, {:.2}x final",
        ws_peak as f64 / tpc_peak as f64, ws_final as f64 / tpc_final as f64);
}
Thread-per-core vs work-stealing allocator impact (15s, 16 threads, jemalloc) Configuration Peak RSS Final RSS ------------------------------------------------------ Thread-per-core + jemalloc 164.6 MB 61.1 MB Work-stealing + jemalloc 228.1 MB 120.7 MB Work-stealing overhead: 1.39x peak, 1.97x final

The same total volume of allocations and frees, but cross-thread freeing produces significantly higher fragmentation. The freed memory is in the wrong arena — it was allocated by Thread A but sits in Thread A's freelist even though Thread B freed it. Thread A may have moved on to different allocation sizes, leaving those freed blocks stranded.

In a Tokio work-stealing pool with N workers, a task has an (N-1)/N probability of being on a different thread at free time than at allocation time. With 16 threads, 94% of frees are cross-thread. With 24 threads, 96%.

This is snmalloc's core insight: instead of freeing into the allocating thread's arena (which that thread may not revisit), snmalloc uses message passing to return freed objects to the allocating thread's queue. The allocating thread can then coalesce and reuse them efficiently. For producer/consumer workloads — and work-stealing is exactly this — snmalloc eliminates the cross-thread fragmentation penalty.

jemalloc Tuning

jemalloc is the most widely deployed alternative allocator in the Rust ecosystem (used by TiKV, Cloudflare, Firefox). Understanding its tuning knobs lets you optimize for your specific workload.

Setup is three lines:

// Cargo.toml: tikv-jemallocator = "0.6"
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

// Optional: compile-time configuration via prefixed symbol
#[allow(non_upper_case_globals)]
#[export_name = "_rjem_malloc_conf"]
pub static malloc_conf: &[u8] =
    b"background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:10000\0";

The key parameter is dirty_decay_ms: how long dirty pages (freed but not returned to OS) wait before being purged via madvise. Lower values return memory faster but increase syscall overhead:

jemalloc dirty_decay_ms impact (8 threads, 500K allocs/thread, 180 MB live) dirty_decay_ms Workload time Post-drop resident After decay ------------------------------------------------------------------ 0 (immediate) 212 ms 14.7 MB 14.8 MB 1,000 183 ms 196.8 MB → 83.3 MB 83.2 MB 10,000 (default) 181 ms 197.1 MB → 129.4 MB 106.7 MB

You can reproduce these results by saving the async server pattern example from above as a standalone binary and running it with different decay settings:
_RJEM_MALLOC_CONF="dirty_decay_ms:0" cargo run --release
_RJEM_MALLOC_CONF="dirty_decay_ms:1000" cargo run --release
_RJEM_MALLOC_CONF="dirty_decay_ms:10000" cargo run --release
Compare the "after decay" resident values across runs.

The workload time shows the cost: dirty_decay_ms=0 is 17% slower because jemalloc calls madvise on every deallocation. But the post-drop difference is striking: 14.7 MB vs 83.2 MB vs 106.7 MB of irreducible resident memory. With immediate purging, pages return to the OS during the workload — when new allocations arrive, jemalloc requests fresh pages that can be packed efficiently. With delayed purging, dirty pages accumulate during the burst, and the allocator reuses them in-place. These reused pages become partially occupied with mixed size classes, creating fragmentation that persists even after all application data is freed. The timing of purging affects the final fragmentation state.

Other important parameters:

You can set these at runtime via the _RJEM_MALLOC_CONF environment variable (tikv-jemallocator uses prefixed symbols, so the standard MALLOC_CONF has no effect) or programmatically via tikv_jemalloc_ctl:

use tikv_jemalloc_ctl::raw;

fn set_decay(dirty_ms: isize, muzzy_ms: isize) {
    unsafe { raw::write(b"arenas.dirty_decay_ms\0", dirty_ms) }.unwrap();
    unsafe { raw::write(b"arenas.muzzy_decay_ms\0", muzzy_ms) }.unwrap();
}

Below the Allocator: Kernel Memory

The allocator manages memory above the OS. But the kernel's decisions about physical pages affect fragmentation independently of what the allocator does.

mmap vs brk. Two ways for allocators to get memory. brk extends the data segment linearly — simple, but can only release memory from the top. If the highest allocation persists, nothing below it can be returned. mmap creates independent mappings that can be returned individually, but each mapping adds a VMA (Virtual Memory Area) entry to the kernel's page table, and munmap is expensive.

MADV_DONTNEED vs MADV_FREE. When an allocator wants to return pages without unmapping the address range, it uses madvise. Two strategies:

The difference matters for allocate-free-reuse cycles. With MADV_FREE, reusing recently-freed pages is nearly free — no page fault, no zero-fill. With MADV_DONTNEED, every reuse pays the full fault cost:

// Cargo.toml dependencies:
//   libc = "0.2"

use std::time::Instant;

unsafe fn bench_madvise(strategy: libc::c_int, size: usize,
                        iters: usize) -> std::time::Duration {
    let ptr = unsafe {
        libc::mmap(std::ptr::null_mut(), size,
            libc::PROT_READ | libc::PROT_WRITE,
            libc::MAP_ANONYMOUS | libc::MAP_PRIVATE, -1, 0)
    };
    let start = Instant::now();
    for _ in 0..iters {
        unsafe { std::ptr::write_bytes(ptr as *mut u8, 0xAA, size) };
        unsafe { libc::madvise(ptr, size, strategy) };
        unsafe { std::ptr::write_bytes(ptr as *mut u8, 0xBB, size) };
    }
    let elapsed = start.elapsed();
    unsafe { libc::munmap(ptr, size) };
    elapsed
}

fn main() {
    let size = 4096 * 256; // 1 MB
    let iterations = 10_000;

    let dontneed = unsafe { bench_madvise(libc::MADV_DONTNEED, size, iterations) };
    let free = unsafe { bench_madvise(libc::MADV_FREE, size, iterations) };

    println!("MADV_DONTNEED: {:?} ({:.1} µs/iter)",
        dontneed, dontneed.as_micros() as f64 / iterations as f64);
    println!("MADV_FREE:     {:?} ({:.1} µs/iter)",
        free, free.as_micros() as f64 / iterations as f64);
    println!("DONTNEED/FREE: {:.2}x",
        dontneed.as_nanos() as f64 / free.as_nanos() as f64);
}
MADV_DONTNEED vs MADV_FREE (1 MB region, 10K write-advise-rewrite cycles) Strategy Time Per iteration ------------------------------------------------- MADV_DONTNEED 388.0 ms 38.8 µs MADV_FREE 123.1 ms 12.3 µs DONTNEED/FREE: 3.15x

Transparent Huge Pages (THP). Linux can collapse 512 contiguous 4 KB pages into a single 2 MB huge page, reducing TLB pressure. But THP requires contiguous physical memory — exactly what fragmentation destroys. The khugepaged kernel thread scans for collapsible regions, causing unpredictable latency spikes. For long-running services with fragmented heaps, THP churns without benefit and adds tail latency.

Recommendation: disable THP for latency-sensitive services (echo madvise > /sys/kernel/mm/transparent_hugepage/enabled), or use explicit MADV_HUGEPAGE hints only on known-large contiguous allocations.

Thread-Per-Core: The Allocator Wins Too

The previous article showed thread-per-core delivering 5.75x higher echo server throughput, primarily from eliminating scheduler coordination and improving cache locality. But there is a third advantage: allocator behavior.

In thread-per-core, each thread's allocator state is perfectly aligned with its workload. Allocations and frees happen on the same thread. The thread cache never accumulates foreign freed objects. Adjacent freed blocks belong to the same arena, so the allocator can coalesce them. Pages return to the OS faster because the allocator's view of "free" and the application's view of "done" are synchronized.

The benchmark above already demonstrates this: the same allocator (jemalloc), same buffer sizes, same total allocation volume — but thread-per-core produces 1.39x lower peak RSS and 1.97x lower final RSS. The 120.7 MB retained by work-stealing versus 61.1 MB by thread-per-core is entirely due to cross-thread freeing scattering dirty pages across arenas that no thread revisits.

This completes the thread-per-core advantage from the previous article: it is not just about cache locality and scheduler overhead. The memory allocator itself performs better when tasks stay on one thread. Three independent systems — the scheduler, the CPU cache, and the allocator — all reward the same architectural choice.

Switching Allocators in Rust

Rust's #[global_allocator] attribute makes switching allocators trivial — three lines of code, no other changes. Every Box, Vec, String, and HashMap automatically uses the new allocator.

jemalloc (best general-purpose choice for servers):

// Cargo.toml: tikv-jemallocator = "0.6"
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;

mimalloc (fast allocation, compact metadata):

// Cargo.toml: mimalloc = { version = "0.1", default-features = false }
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

snmalloc (best for cross-thread workloads):

// Cargo.toml: snmalloc-rs = "0.3"
#[global_allocator]
static GLOBAL: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;

Measure before and after with /proc/self/statm or jemalloc's stats API. If your service's RSS-to-allocated ratio exceeds 1.5, switching allocators is the lowest-effort fix.

Key Takeaways

  1. Fragmentation is not a leak. Every allocation has a matching free. The allocator retains pages because freed blocks are scattered across arenas in holes too small or too dispersed to coalesce.
  2. Allocators have three layers. Thread caches (fast, no sync), central arenas (slow, some sync), OS interface (slowest, syscalls). Most malloc/free calls never leave the thread cache.
  3. Cross-thread freeing causes fragmentation. Work-stealing moves tasks between threads. Memory allocated on Thread A, freed on Thread B, accumulates as unreclaimable holes in A's arena.
  4. Thread-per-core aligns allocator caches with the workload. Allocations and frees on the same thread mean perfect thread cache utilization and efficient coalescing. Three systems — scheduler, CPU cache, allocator — all reward the same design.
  5. jemalloc's decay mechanism is the key differentiator. It returns memory to the OS on a tunable schedule instead of only when arenas fully drain (glibc) or through expensive immediate unmapping.
  6. The kernel participates. MADV_FREE avoids refault costs that MADV_DONTNEED incurs. THP competes with fragmented heaps for contiguous physical memory. Disable THP for latency-sensitive services.
  7. Switching allocators is three lines of code. tikv-jemallocator or mimalloc as #[global_allocator]. No other code changes needed.
  8. Architecture shapes allocator behavior. Cross-thread freeing with jemalloc retains 1.97x the memory of same-thread freeing with the same allocator and the same workload. Choosing where frees happen matters alongside choosing which allocator runs them.
* * *

Further Reading