You deploy a long-lived Rust service. RSS climbs over hours in stair-steps — flat, then a sudden jump to a new plateau that never comes back. You run Valgrind. No leaks. Every allocation has a matching free. The problem is heap fragmentation.
The common fix is to swap the allocator — replace glibc's default with jemalloc, and RSS stabilizes. But why does the default allocator fragment under this workload? And why does jemalloc handle it better? The answer requires understanding three layers: the allocator's internal bookkeeping, the async runtime's allocation patterns, and the kernel's virtual memory decisions.
In the previous article, thread-per-core eliminated synchronization overhead by keeping tasks pinned to cores. That same principle extends to memory: when your allocator's thread caches align with your runtime's thread model, fragmentation drops and memory returns to the OS predictably. Even your memory allocator benefits from not sharing.
All benchmarks were run on an Intel Core i9-14900K (24 cores, 32 threads) with Rust 1.93.0 on Linux.
CPU frequency scaling was set to performance governor.
Each benchmark was run multiple times; representative results are
shown. All code examples are self-contained and runnable.
What Fragmentation Actually Is
Heap fragmentation comes in two forms. External fragmentation: free memory exists but is scattered in small non-contiguous gaps. The allocator has 100 KB free, but the largest contiguous block is 2 KB — a 4 KB allocation forces a new page from the OS. Internal fragmentation: the allocator rounds your 49-byte request to a 64-byte size class. 15 bytes wasted per allocation.
A fragmented heap looks like this:
Memory layout after interleaved alloc/free:
|USED|free|USED|free|USED|free|USED|free|USED|free|USED|
4KB 64B 4KB 64B 4KB 64B 4KB 64B 4KB 64B 4KB
Total free: 320 bytes scattered across 5 gaps.
Largest contiguous free block: 64 bytes.
A 256-byte allocation requires a new page from the OS.
The key metric is the fragmentation ratio:
(resident - allocated) / allocated. The OS has given
you resident bytes of physical pages, but only
allocated bytes are actually in use. A ratio of 0.0
means every byte the OS gave you is live data. A ratio of 2.0 means
the OS has given you 3x more physical pages than you need — two
thirds are wasted in holes the allocator cannot coalesce.
Demonstrating this is straightforward: allocate 10,000 objects of alternating sizes, free half of them, and check whether RSS drops:
fn read_rss_kb() -> usize {
let statm = std::fs::read_to_string("/proc/self/statm").unwrap();
let pages: usize = statm.split_whitespace().nth(1).unwrap().parse().unwrap();
pages * 4 // each page is 4KB on x86_64
}
fn main() {
let mut allocations: Vec<Option<Vec<u8>>> = Vec::with_capacity(10_000);
// Allocate alternating small (64B) and large (4KB) blocks
for i in 0..10_000 {
let size = if i % 2 == 0 { 64 } else { 4096 };
allocations.push(Some(vec![0u8; size]));
}
println!("After alloc: {} KB", read_rss_kb());
// Free all the small blocks — creates 5,000 tiny holes
for i in (0..10_000).step_by(2) {
allocations[i] = None;
}
println!("After freeing small blocks: {} KB", read_rss_kb());
// Free everything
drop(allocations);
println!("After freeing all: {} KB", read_rss_kb());
}
RSS barely moves after freeing half the allocations — and stays elevated even after freeing everything. The allocator retains the pages.
This is single-threaded behavior. glibc's ptmalloc2
serves the main thread from the brk heap, which can
only release memory from the top — if the highest-address allocation
persists, nothing below it returns to the OS. With multiple threads,
each thread gets an mmap'd arena that can
be fully unmapped when all its allocations are freed. We will see
this difference in the allocator shootout below.
How Allocators Work
Every modern allocator shares a three-layer architecture:
-
Thread caches (fast path, no locks). Per-thread
freelists of pre-sized objects. Most
mallocandfreecalls complete here without synchronization. - Central arenas (slow path, some locking). When thread caches overflow or empty, they transfer objects to/from a shared arena. This requires some coordination.
-
OS interface (slowest path, syscalls).
mmapto get pages,madviseto return them. The most expensive operation by far.
Where allocators differ is how they manage each layer:
| Allocator | Thread Cache | Arenas | Return to OS |
|---|---|---|---|
| ptmalloc2 (glibc) | Per-arena tcache, 64 entries | Up to 8 per core | Rarely |
| jemalloc | Per-thread tcache, auto-sized | 4 per core (default) | Decay-based (10s default) |
| mimalloc | Free-list sharding per 64 KiB page | Per-thread heap | Eager page reset |
| snmalloc | Per-thread, message-passing | Per-thread | Batched returns |
The critical difference for fragmentation is how each allocator handles cross-thread freeing — what happens when Thread A allocates memory that Thread B later frees. ptmalloc2 returns the freed block to the original arena (Thread A's), but does not eagerly coalesce it with adjacent free blocks across arena boundaries. The freed memory accumulates in arenas that may not be actively allocating, creating dead zones that the allocator cannot reclaim.
jemalloc mitigates this with its decay mechanism.
Dirty pages (freed but not yet returned to the OS) are purged on a
configurable schedule — by default, 10 seconds of inactivity.
Background threads handle the purging so application threads don't
pay the madvise cost. mimalloc takes a more aggressive
approach: empty pages are immediately marked as unused, reducing RSS
at the cost of more frequent OS interaction. snmalloc uses a
fundamentally different design: freed objects are sent back to the
allocating thread via batched message passing, keeping each thread's
arena coherent.
Async Rust Allocation Patterns
Not all allocation patterns fragment equally. Async Rust services create a particularly fragmentation-prone combination:
Varied-size allocations. Each
tokio::spawn allocates a Task<T>
on the heap. The size depends on the future's state machine — as
the first article
in this series showed, different async functions produce different
enum sizes. An HTTP server spawning handlers for different endpoints
creates a mix of future sizes on every request.
Varied lifetimes. Connections arrive and depart independently. Object A is allocated, then B, then A is freed, leaving a hole that B's allocation straddles. Long-lived connections (WebSocket, SSE) pin pages in place while short-lived ones (health checks, simple GETs) create and destroy allocations around them.
Temporary allocation bursts. A single
serde_json::from_str call creates many temporary
String and Vec allocations during
parsing — different sizes, all short-lived. The heap sees a burst
of varied allocations followed by a wave of frees, leaving
fragmented gaps.
We can simulate this pattern and watch fragmentation build:
// Cargo.toml dependencies:
// tikv-jemallocator = "0.6"
// tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
use std::hint::black_box;
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
fn print_jemalloc_stats(label: &str) {
tikv_jemalloc_ctl::epoch::advance().unwrap();
let allocated = tikv_jemalloc_ctl::stats::allocated::read().unwrap();
let resident = tikv_jemalloc_ctl::stats::resident::read().unwrap();
let frag = if allocated > 0 {
(resident as f64 - allocated as f64) / allocated as f64
} else { 0.0 };
println!("[{:<25}] allocated={:>10.1} MB resident={:>10.1} MB frag={:.3}",
label, allocated as f64 / (1024.0 * 1024.0),
resident as f64 / (1024.0 * 1024.0), frag);
}
fn main() {
let sizes: [usize; 8] = [128, 256, 512, 1024, 2048, 4096, 8192, 16384];
let mut live: Vec<Option<(Vec<u8>, u32)>> = Vec::new();
let mut rng: u64 = 98765;
let mut total_allocs: u64 = 0;
let mut total_frees: u64 = 0;
print_jemalloc_stats("baseline");
for iteration in 0..500_000u64 {
rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
// New "connections": 2-5 per iteration, varied sizes and lifetimes
let new = 2 + ((rng >> 48) % 4) as u32;
for _ in 0..new {
rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
let size = sizes[(rng >> 32) as usize % sizes.len()];
let lifetime = if (rng >> 40) % 10 < 7 { 1 + ((rng >> 16) % 20) as u32 }
else { 100 + ((rng >> 8) % 1900) as u32 };
let mut buf = vec![0u8; size];
black_box(&mut buf[0]);
live.push(Some((buf, lifetime)));
total_allocs += 1;
}
// Age and expire
for slot in live.iter_mut() {
if let Some((_, remaining)) = slot {
if *remaining == 0 { *slot = None; total_frees += 1; }
else { *remaining -= 1; }
}
}
if iteration % 10_000 == 0 && iteration > 0 {
live.retain(|s| s.is_some()); // compact periodically
}
if iteration % 50_000 == 0 {
let live_count = live.iter().filter(|s| s.is_some()).count();
print!("iter={:>6} live={:>6} ", iteration, live_count);
print_jemalloc_stats(&format!("iter {}", iteration));
}
}
drop(live);
print_jemalloc_stats("after drop all");
println!("Waiting 12s for decay...");
std::thread::sleep(std::time::Duration::from_secs(12));
print_jemalloc_stats("after decay");
println!("Total allocations: {}, frees: {}", total_allocs, total_frees);
}
Under load, jemalloc keeps the fragmentation ratio around 1.4 — the resident memory is about 2.4x the allocated data, a reasonable overhead. The problem appears after the workload stops: allocated drops to 0.4 MB but resident stays at 12.5 MB (ratio 30.3). After jemalloc's decay timer fires, resident drops to 6.7 MB — but that is still 17x the allocated data. These are pages jemalloc cannot return because they contain a mix of metadata and partially-filled size-class runs from the varied allocation pattern. In a real server, this residual grows with each traffic burst.
Measuring Fragmentation
Before comparing allocators, you need to measure fragmentation
directly — not just watch RSS in htop. jemalloc
exposes detailed statistics through its control API:
use tikv_jemalloc_ctl as jemalloc_ctl;
fn print_jemalloc_stats(label: &str) {
// Must advance epoch before reading — stats are cached
jemalloc_ctl::epoch::advance().unwrap();
let allocated = jemalloc_ctl::stats::allocated::read().unwrap();
let active = jemalloc_ctl::stats::active::read().unwrap();
let resident = jemalloc_ctl::stats::resident::read().unwrap();
let frag = (resident as f64 - allocated as f64) / allocated as f64;
println!("[{}] allocated={:.1} MB resident={:.1} MB frag={:.3}",
label,
allocated as f64 / (1024.0 * 1024.0),
resident as f64 / (1024.0 * 1024.0),
frag,
);
}
Three metrics matter:
stats.allocated: bytes your application currently holds. The "useful" memory.stats.active: bytes in jemalloc's active extents. Includes internal metadata and page-aligned rounding.stats.resident: physical pages the OS has given to the process. The number you see inhtopRSS.
The gap between resident and allocated
is your fragmentation overhead: pages the OS gave you that contain
no live data. One subtlety: at startup, jemalloc maps several
megabytes for its own metadata (arena structures, extent tables,
base allocations) before your first malloc. With
only 0.1 MB allocated, a 6.9 MB resident gives a fragmentation
ratio over 100 — this is metadata overhead, not heap fragmentation.
The ratio becomes meaningful once your application has allocated
enough data to dominate jemalloc's fixed overhead.
For allocator-agnostic measurement (including when running with
glibc), read /proc/self/statm directly to get RSS
in pages. For allocation-site profiling — finding which
code path causes the most fragmentation — tools like
DHAT (via Valgrind) and
heaptrack (via LD_PRELOAD) identify
hot allocation sites without code changes.
The Allocator Shootout
Same workload, three allocators. Each thread runs a tight loop: allocate a random-sized buffer (128 B to 8 KB), hold up to 5,000 live objects per thread, free randomly when full. This simulates a server handling connections of varied sizes with overlapping lifetimes.
// Cargo.toml dependencies:
// num_cpus = "1.16"
// tikv-jemallocator = "0.6" (with feature "jemalloc")
// mimalloc = { version = "0.1", default-features = false } (with feature "use-mimalloc")
// Run three times:
// cargo run --release --features jemalloc
// cargo run --release --features use-mimalloc
// cargo run --release (glibc default)
use std::hint::black_box;
#[cfg(feature = "jemalloc")]
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
#[cfg(feature = "use-mimalloc")]
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
fn read_rss_kb() -> usize {
let s = std::fs::read_to_string("/proc/self/statm").unwrap();
s.split_whitespace().nth(1).unwrap().parse::<usize>().unwrap() * 4
}
fn main() {
let num_threads = num_cpus::get();
let max_live = 5_000;
let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];
let barrier = std::sync::Arc::new(std::sync::Barrier::new(num_threads));
// All threads wait here at peak — main thread measures RSS
let peak_barrier = std::sync::Arc::new(
std::sync::Barrier::new(num_threads + 1),
);
let handles: Vec<_> = (0..num_threads)
.map(|tid| {
let barrier = barrier.clone();
let peak_barrier = peak_barrier.clone();
std::thread::spawn(move || {
barrier.wait();
let mut live: Vec<Option<Vec<u8>>> =
(0..max_live).map(|_| None).collect();
let mut rng: u64 = 31337 + tid as u64 * 7919;
for _ in 0..200_000usize {
rng = rng.wrapping_mul(6364136223846793005)
.wrapping_add(1);
let size = sizes[(rng >> 32) as usize % 7];
let mut v = vec![0u8; size];
black_box(&mut v[0]);
rng = rng.wrapping_mul(6364136223846793005)
.wrapping_add(1);
live[(rng >> 32) as usize % max_live] = Some(v);
}
peak_barrier.wait(); // hold live data
drop(live);
})
})
.collect();
peak_barrier.wait();
let under_load = read_rss_kb();
for h in handles { h.join().unwrap(); }
let after_drop = read_rss_kb();
println!("RSS under load: {:.1} MB", under_load as f64 / 1024.0);
println!("RSS after drop: {:.1} MB", after_drop as f64 / 1024.0);
}
Each allocator makes a different tradeoff. glibc returns 99% of
memory when arenas drain completely — it munmaps the
per-thread mmap'd arena when all allocations are
freed. This is glibc's best case: every thread's arena
drains completely at once. In a real server with mixed-lifetime
objects, some allocations in each arena persist indefinitely, and
glibc cannot unmap a partially-occupied arena.
jemalloc is 17% faster than glibc but retains 124 MB post-drop.
jemalloc purges dirty pages via MADV_FREE, which
marks pages as lazily reclaimable — the kernel can reclaim them
under memory pressure, but without pressure RSS stays elevated.
The 78% "returned" reflects pages that jemalloc has
MADV_DONTNEED'd (via its muzzy decay), which
immediately reduces RSS.
mimalloc shows only 1% returned. A caveat: this does not
necessarily mean mimalloc is holding all the memory in active use.
mimalloc may have internally marked pages as unused via
MADV_FREE, but RSS does not drop until the kernel
reclaims those pages under memory pressure. What we can say is
that mimalloc does not eagerly munmap or
MADV_DONTNEED its segments — it keeps the virtual
address space mapped for fast reuse if new allocations arrive.
The throughput column matters: fragmentation reduction is useless if the allocator is slower. jemalloc delivers the highest throughput (19.5M allocs/sec) while actively returning memory to the OS via its decay mechanism.
Cross-Thread Freeing: The Work-Stealing Tax
This is the connection to the
previous article's
architectural argument. In a work-stealing runtime like Tokio,
tasks migrate between threads at .await points.
Memory allocated on Thread A may be freed on Thread B. This
cross-thread freeing is the single most damaging pattern for
allocator fragmentation.
The mechanism: Thread A allocates a 4 KB buffer from its arena.
The task yields at an .await and is stolen by Thread B.
Thread B completes the work and frees the buffer. But the freed
memory belongs to Thread A's arena. Thread A's arena now has a
4 KB hole that Thread A may never reuse (it might be allocating
different sizes). Thread B's arena is unaffected — it never had
those pages. Over time, every arena accumulates holes from buffers
freed by other threads.
We can measure this directly. Mode 1: each thread allocates and frees its own buffers (simulating thread-per-core). Mode 2: each thread sends its allocations to a neighbor thread for freeing (simulating work-stealing):
// Cargo.toml dependencies:
// tikv-jemallocator = "0.6"
// tikv-jemalloc-ctl = { version = "0.6", features = ["stats"] }
// crossbeam = "0.8"
// num_cpus = "1.16"
// core_affinity = "0.8"
use std::hint::black_box;
use std::time::Instant;
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
fn read_rss_kb() -> usize {
let s = std::fs::read_to_string("/proc/self/statm").unwrap();
s.split_whitespace().nth(1).unwrap().parse::<usize>().unwrap() * 4
}
fn print_jemalloc_stats(label: &str) {
tikv_jemalloc_ctl::epoch::advance().unwrap();
let allocated = tikv_jemalloc_ctl::stats::allocated::read().unwrap();
let resident = tikv_jemalloc_ctl::stats::resident::read().unwrap();
let frag = if allocated > 0 {
(resident as f64 - allocated as f64) / allocated as f64
} else { 0.0 };
println!(" [{:<35}] allocated={:>8.1} MB resident={:>8.1} MB frag={:.3}",
label, allocated as f64 / (1024.0 * 1024.0),
resident as f64 / (1024.0 * 1024.0), frag);
}
/// Mode 1: Each thread allocates and frees its own buffers (thread-per-core).
fn same_thread_workload(num_threads: usize, duration_secs: u64) -> (usize, usize) {
let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];
let max_live = 2_000;
let barrier = std::sync::Arc::new(std::sync::Barrier::new(num_threads));
let peak_rss = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let handles: Vec<_> = (0..num_threads).map(|tid| {
let barrier = barrier.clone();
let peak_rss = peak_rss.clone();
std::thread::spawn(move || {
core_affinity::set_for_current(core_affinity::CoreId { id: tid });
barrier.wait();
let deadline = Instant::now()
+ std::time::Duration::from_secs(duration_secs);
let mut live: Vec<Option<Vec<u8>>> =
(0..max_live).map(|_| None).collect();
let mut slot = 0usize;
let mut rng: u64 = 9999 + tid as u64 * 6151;
let mut ops = 0u64;
while Instant::now() < deadline {
rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
let size = sizes[(rng >> 32) as usize % sizes.len()];
let mut v = vec![0u8; size];
black_box(&mut v[0]);
live[slot] = Some(v); // old buffer freed here (same thread)
slot = (slot + 1) % max_live;
ops += 1;
if tid == 0 && ops % 50_000 == 0 {
peak_rss.fetch_max(read_rss_kb(),
std::sync::atomic::Ordering::Relaxed);
}
}
ops
})
}).collect();
let total_ops: u64 = handles.into_iter().map(|h| h.join().unwrap()).sum();
let final_rss = read_rss_kb();
let peak = peak_rss.load(std::sync::atomic::Ordering::Relaxed).max(final_rss);
println!(" Total ops: {} ({:.1}M/sec)", total_ops,
total_ops as f64 / duration_secs as f64 / 1_000_000.0);
(peak, final_rss)
}
/// Mode 2: Thread i allocates, sends evicted buffer to thread (i+1) % N.
fn cross_thread_workload(num_threads: usize, duration_secs: u64) -> (usize, usize) {
let sizes = [128, 256, 512, 1024, 2048, 4096, 8192];
let max_live = 2_000;
let peak_rss = std::sync::Arc::new(std::sync::atomic::AtomicUsize::new(0));
let mut senders = Vec::new();
let mut receivers = Vec::new();
for _ in 0..num_threads {
let (tx, rx) = crossbeam::channel::bounded::<Vec<u8>>(256);
senders.push(tx);
receivers.push(rx);
}
let barrier = std::sync::Arc::new(
std::sync::Barrier::new(num_threads * 2));
let alloc_handles: Vec<_> = (0..num_threads).map(|tid| {
let barrier = barrier.clone();
let sender = senders[(tid + 1) % num_threads].clone();
let peak_rss = peak_rss.clone();
std::thread::spawn(move || {
barrier.wait();
let deadline = Instant::now()
+ std::time::Duration::from_secs(duration_secs);
let mut live: Vec<Option<Vec<u8>>> =
(0..max_live).map(|_| None).collect();
let mut slot = 0usize;
let mut rng: u64 = 7777 + tid as u64 * 3571;
let mut ops = 0u64;
while Instant::now() < deadline {
rng = rng.wrapping_mul(6364136223846793005).wrapping_add(1);
let size = sizes[(rng >> 32) as usize % sizes.len()];
let mut v = vec![0u8; size];
black_box(&mut v[0]);
if let Some(old) = live[slot].take() {
let _ = sender.send(old); // cross-thread free
}
live[slot] = Some(v);
slot = (slot + 1) % max_live;
ops += 1;
if tid == 0 && ops % 50_000 == 0 {
peak_rss.fetch_max(read_rss_kb(),
std::sync::atomic::Ordering::Relaxed);
}
}
drop(live);
ops
})
}).collect();
let free_handles: Vec<_> = (0..num_threads).map(|tid| {
let barrier = barrier.clone();
let receiver = receivers[tid].clone();
std::thread::spawn(move || {
barrier.wait();
while let Ok(buf) = receiver.recv() { drop(buf); }
})
}).collect();
let total_ops: u64 = alloc_handles.into_iter()
.map(|h| h.join().unwrap()).sum();
drop(senders);
for h in free_handles { h.join().unwrap(); }
let final_rss = read_rss_kb();
let peak = peak_rss.load(std::sync::atomic::Ordering::Relaxed).max(final_rss);
println!(" Total ops: {} ({:.1}M/sec)", total_ops,
total_ops as f64 / duration_secs as f64 / 1_000_000.0);
(peak, final_rss)
}
fn main() {
let threads = num_cpus::get().min(16);
let duration = 15;
println!("--- Same-thread free (thread-per-core) ---");
print_jemalloc_stats("before TPC");
let (tpc_peak, tpc_final) = same_thread_workload(threads, duration);
print_jemalloc_stats("after TPC");
println!(" Peak RSS: {:.1} MB, Final: {:.1} MB",
tpc_peak as f64 / 1024.0, tpc_final as f64 / 1024.0);
println!("\n Waiting 15s for decay...\n");
std::thread::sleep(std::time::Duration::from_secs(15));
println!("--- Cross-thread free (work-stealing) ---");
print_jemalloc_stats("before work-stealing");
let (ws_peak, ws_final) = cross_thread_workload(threads, duration);
print_jemalloc_stats("after work-stealing");
println!(" Peak RSS: {:.1} MB, Final: {:.1} MB",
ws_peak as f64 / 1024.0, ws_final as f64 / 1024.0);
println!("\n=== Summary ===");
println!(" Thread-per-core: peak {:>6} KB, final {:>6} KB", tpc_peak, tpc_final);
println!(" Work-stealing: peak {:>6} KB, final {:>6} KB", ws_peak, ws_final);
println!(" Overhead: {:.2}x peak, {:.2}x final",
ws_peak as f64 / tpc_peak as f64, ws_final as f64 / tpc_final as f64);
}
The same total volume of allocations and frees, but cross-thread freeing produces significantly higher fragmentation. The freed memory is in the wrong arena — it was allocated by Thread A but sits in Thread A's freelist even though Thread B freed it. Thread A may have moved on to different allocation sizes, leaving those freed blocks stranded.
In a Tokio work-stealing pool with N workers, a task has an (N-1)/N probability of being on a different thread at free time than at allocation time. With 16 threads, 94% of frees are cross-thread. With 24 threads, 96%.
This is snmalloc's core insight: instead of freeing into the allocating thread's arena (which that thread may not revisit), snmalloc uses message passing to return freed objects to the allocating thread's queue. The allocating thread can then coalesce and reuse them efficiently. For producer/consumer workloads — and work-stealing is exactly this — snmalloc eliminates the cross-thread fragmentation penalty.
jemalloc Tuning
jemalloc is the most widely deployed alternative allocator in the Rust ecosystem (used by TiKV, Cloudflare, Firefox). Understanding its tuning knobs lets you optimize for your specific workload.
Setup is three lines:
// Cargo.toml: tikv-jemallocator = "0.6"
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
// Optional: compile-time configuration via prefixed symbol
#[allow(non_upper_case_globals)]
#[export_name = "_rjem_malloc_conf"]
pub static malloc_conf: &[u8] =
b"background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:10000\0";
The key parameter is dirty_decay_ms:
how long dirty pages (freed but not returned to OS) wait before
being purged via madvise. Lower values return memory
faster but increase syscall overhead:
You can reproduce these results by saving the async server pattern
example from above as a standalone binary and running it with
different decay settings:
_RJEM_MALLOC_CONF="dirty_decay_ms:0" cargo run --release
_RJEM_MALLOC_CONF="dirty_decay_ms:1000" cargo run --release
_RJEM_MALLOC_CONF="dirty_decay_ms:10000" cargo run --release
Compare the "after decay" resident values across runs.
The workload time shows the cost: dirty_decay_ms=0
is 17% slower because jemalloc calls madvise on
every deallocation. But the post-drop difference is striking:
14.7 MB vs 83.2 MB vs 106.7 MB of irreducible resident memory.
With immediate purging, pages return to the OS during the
workload — when new allocations arrive, jemalloc requests fresh
pages that can be packed efficiently. With delayed purging,
dirty pages accumulate during the burst, and the allocator reuses
them in-place. These reused pages become partially occupied with
mixed size classes, creating fragmentation that persists even
after all application data is freed. The timing of purging
affects the final fragmentation state.
Other important parameters:
-
muzzy_decay_ms: second-tier decay. After dirty pages are purged withMADV_FREE, they become "muzzy" — the kernel may reclaim them under pressure, but they can be reused without a page fault if still present. Default 10 seconds. -
background_thread: whentrue, jemalloc spawns its own threads for decay purging. Without this, purging happens piggy-backed onmalloc/freecalls, adding latency jitter to your application. -
narenas: number of arenas. Default is 4x CPU count. Fewer arenas mean less fragmentation (fewer places for memory to hide) but more contention on the arena locks.
You can set these at runtime via the
_RJEM_MALLOC_CONF environment variable
(tikv-jemallocator uses prefixed symbols, so the
standard MALLOC_CONF has no effect) or
programmatically via tikv_jemalloc_ctl:
use tikv_jemalloc_ctl::raw;
fn set_decay(dirty_ms: isize, muzzy_ms: isize) {
unsafe { raw::write(b"arenas.dirty_decay_ms\0", dirty_ms) }.unwrap();
unsafe { raw::write(b"arenas.muzzy_decay_ms\0", muzzy_ms) }.unwrap();
}
Below the Allocator: Kernel Memory
The allocator manages memory above the OS. But the kernel's decisions about physical pages affect fragmentation independently of what the allocator does.
mmap vs brk. Two ways for allocators to get memory.
brk extends the data segment linearly — simple, but can
only release memory from the top. If the highest allocation
persists, nothing below it can be returned. mmap creates
independent mappings that can be returned individually, but each
mapping adds a VMA (Virtual Memory Area) entry to the kernel's
page table, and munmap is expensive.
MADV_DONTNEED vs MADV_FREE. When an allocator
wants to return pages without unmapping the address range, it uses
madvise. Two strategies:
-
MADV_DONTNEED: the kernel immediately unmaps the physical pages. Next access causes a page fault — the kernel allocates a fresh zero-filled page. ptmalloc2 (glibc) uses this. -
MADV_FREE(Linux 4.5+): the kernel marks pages as lazily reclaimable. If no memory pressure, the pages retain their content and can be reused without a fault. If memory pressure arrives, the kernel reclaims them. jemalloc uses this.
The difference matters for allocate-free-reuse cycles. With
MADV_FREE, reusing recently-freed pages is nearly
free — no page fault, no zero-fill. With MADV_DONTNEED,
every reuse pays the full fault cost:
// Cargo.toml dependencies:
// libc = "0.2"
use std::time::Instant;
unsafe fn bench_madvise(strategy: libc::c_int, size: usize,
iters: usize) -> std::time::Duration {
let ptr = unsafe {
libc::mmap(std::ptr::null_mut(), size,
libc::PROT_READ | libc::PROT_WRITE,
libc::MAP_ANONYMOUS | libc::MAP_PRIVATE, -1, 0)
};
let start = Instant::now();
for _ in 0..iters {
unsafe { std::ptr::write_bytes(ptr as *mut u8, 0xAA, size) };
unsafe { libc::madvise(ptr, size, strategy) };
unsafe { std::ptr::write_bytes(ptr as *mut u8, 0xBB, size) };
}
let elapsed = start.elapsed();
unsafe { libc::munmap(ptr, size) };
elapsed
}
fn main() {
let size = 4096 * 256; // 1 MB
let iterations = 10_000;
let dontneed = unsafe { bench_madvise(libc::MADV_DONTNEED, size, iterations) };
let free = unsafe { bench_madvise(libc::MADV_FREE, size, iterations) };
println!("MADV_DONTNEED: {:?} ({:.1} µs/iter)",
dontneed, dontneed.as_micros() as f64 / iterations as f64);
println!("MADV_FREE: {:?} ({:.1} µs/iter)",
free, free.as_micros() as f64 / iterations as f64);
println!("DONTNEED/FREE: {:.2}x",
dontneed.as_nanos() as f64 / free.as_nanos() as f64);
}
Transparent Huge Pages (THP). Linux can collapse
512 contiguous 4 KB pages into a single 2 MB huge page, reducing TLB
pressure. But THP requires contiguous physical memory — exactly what
fragmentation destroys. The khugepaged kernel thread
scans for collapsible regions, causing unpredictable latency spikes.
For long-running services with fragmented heaps, THP churns without
benefit and adds tail latency.
Recommendation: disable THP for latency-sensitive services
(echo madvise > /sys/kernel/mm/transparent_hugepage/enabled),
or use explicit MADV_HUGEPAGE hints only on known-large
contiguous allocations.
Thread-Per-Core: The Allocator Wins Too
The previous article showed thread-per-core delivering 5.75x higher echo server throughput, primarily from eliminating scheduler coordination and improving cache locality. But there is a third advantage: allocator behavior.
In thread-per-core, each thread's allocator state is perfectly aligned with its workload. Allocations and frees happen on the same thread. The thread cache never accumulates foreign freed objects. Adjacent freed blocks belong to the same arena, so the allocator can coalesce them. Pages return to the OS faster because the allocator's view of "free" and the application's view of "done" are synchronized.
The benchmark above already demonstrates this: the same allocator (jemalloc), same buffer sizes, same total allocation volume — but thread-per-core produces 1.39x lower peak RSS and 1.97x lower final RSS. The 120.7 MB retained by work-stealing versus 61.1 MB by thread-per-core is entirely due to cross-thread freeing scattering dirty pages across arenas that no thread revisits.
This completes the thread-per-core advantage from the previous article: it is not just about cache locality and scheduler overhead. The memory allocator itself performs better when tasks stay on one thread. Three independent systems — the scheduler, the CPU cache, and the allocator — all reward the same architectural choice.
Switching Allocators in Rust
Rust's #[global_allocator] attribute makes switching
allocators trivial — three lines of code, no other changes. Every
Box, Vec, String, and
HashMap automatically uses the new allocator.
jemalloc (best general-purpose choice for servers):
// Cargo.toml: tikv-jemallocator = "0.6"
#[global_allocator]
static GLOBAL: tikv_jemallocator::Jemalloc = tikv_jemallocator::Jemalloc;
mimalloc (fast allocation, compact metadata):
// Cargo.toml: mimalloc = { version = "0.1", default-features = false }
#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;
snmalloc (best for cross-thread workloads):
// Cargo.toml: snmalloc-rs = "0.3"
#[global_allocator]
static GLOBAL: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
Measure before and after with /proc/self/statm or
jemalloc's stats API. If your service's RSS-to-allocated ratio
exceeds 1.5, switching allocators is the lowest-effort fix.
Key Takeaways
- Fragmentation is not a leak. Every allocation has a matching free. The allocator retains pages because freed blocks are scattered across arenas in holes too small or too dispersed to coalesce.
- Allocators have three layers. Thread caches (fast, no sync), central arenas (slow, some sync), OS interface (slowest, syscalls). Most malloc/free calls never leave the thread cache.
- Cross-thread freeing causes fragmentation. Work-stealing moves tasks between threads. Memory allocated on Thread A, freed on Thread B, accumulates as unreclaimable holes in A's arena.
- Thread-per-core aligns allocator caches with the workload. Allocations and frees on the same thread mean perfect thread cache utilization and efficient coalescing. Three systems — scheduler, CPU cache, allocator — all reward the same design.
- jemalloc's decay mechanism is the key differentiator. It returns memory to the OS on a tunable schedule instead of only when arenas fully drain (glibc) or through expensive immediate unmapping.
- The kernel participates. MADV_FREE avoids refault costs that MADV_DONTNEED incurs. THP competes with fragmented heaps for contiguous physical memory. Disable THP for latency-sensitive services.
- Switching allocators is three lines of code.
tikv-jemallocatorormimallocas#[global_allocator]. No other code changes needed. - Architecture shapes allocator behavior. Cross-thread freeing with jemalloc retains 1.97x the memory of same-thread freeing with the same allocator and the same workload. Choosing where frees happen matters alongside choosing which allocator runs them.
Further Reading
- tikv-jemallocator — jemalloc bindings for Rust (used by TiKV, Cloudflare)
- mimalloc — Microsoft Research's compact allocator with eager page reset
- snmalloc-rs — message-passing allocator for cross-thread workloads
- jemalloc(3) — man page with all configuration options
- jemalloc tuning guide — official tuning recommendations
- Cloudflare: switching to TCMalloc — 3x memory reduction from glibc in production
- Transparent Huge Pages — Linux kernel documentation
- madvise(2) — MADV_DONTNEED vs MADV_FREE semantics