When you write an async fn in Rust, the compiler transforms it
into a state machine. This isn't just an implementation detail — understanding
how this transformation works explains why Rust's async is zero-cost, why
futures have the sizes they do, and why early implementations had a bug that
caused 400 KB state machines.
Async Functions Are Enums
Consider a simple async function that makes a few network calls:
async fn fetch_user_data(id: u64) -> UserData {
let token = get_auth_token().await;
let profile = fetch_profile(id, &token).await;
let settings = fetch_settings(id, &token).await;
UserData { profile, settings }
}
The compiler transforms this into something conceptually like:
enum FetchUserDataFuture {
GettingToken {
id: u64,
fut: GetAuthTokenFuture,
},
FetchingProfile {
id: u64,
token: Token,
fut: FetchProfileFuture,
},
FetchingSettings {
profile: Profile,
fut: FetchSettingsFuture,
},
Complete,
}
Each .await point becomes a variant in the enum. The future
stores whatever state it needs to resume execution: local variables that
survive across await points, plus the sub-future being awaited.
The Size of an Enum
Here's the key insight: an enum only needs enough space to hold its
largest variant. When our future transitions from GettingToken
to FetchingProfile, the bytes that held GetAuthTokenFuture
get reused for FetchProfileFuture. The compiler knows these
sub-futures are never alive at the same time.
This means the size of a future grows as the maximum of its sub-futures, not the sum. If you await three futures of 100, 200, and 150 bytes, your future is roughly 200 bytes plus overhead — not 450.
Multi-Variant Layout Optimization
The term "multi-variant layout" refers to how the compiler assigns memory locations to variables across the future's state machine. Rather than giving each variable its own dedicated bytes, the compiler analyzes which variables are alive at each await point. Variables that are never alive at the same time can share the same memory location — even if they have different types or sizes.
Consider a loop with an await inside:
async fn process(xs: Vec<i32>) -> i32 {
let mut sum = 0;
for x in xs.iter() {
sum += expensive_computation(*x).await;
}
sum
}
Each iteration of the loop has its own x value and its own
temporary state from expensive_computation. But iteration 1's
temporaries are completely dead before iteration 2 begins. The compiler
recognizes this: the memory used for iteration 1's state can be reused
for iteration 2, then for iteration 3, and so on.
Without this optimization, a loop with 1000 iterations might need space for 1000 copies of the temporary state. With it, the future only needs space for one iteration's state — the rest share that same memory.
This is what makes the layout "multi-variant": the same bytes in memory serve different purposes depending on which state the future is in. The compiler builds a mapping between variables and memory locations that changes as the state machine progresses.
Measuring Real Futures
Let's verify this with actual measurements. All numbers below were taken on Rust 1.93.0 (x86_64, release mode).
To measure a future's size, we use a helper function that exploits monomorphization:
fn size_of_future<F: Future>(_: &F) -> usize {
std::mem::size_of::<F>()
}
// Usage:
let fut = some_async_fn();
println!("{} bytes", size_of_future(&fut));
Our examples use a yield_now() helper that creates a single
await point. This forces the compiler to generate a multi-state future,
which is what we want to measure:
use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};
async fn yield_now() {
struct YieldNow(bool);
impl Future for YieldNow {
type Output = ();
fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
if self.0 {
Poll::Ready(())
} else {
self.0 = true;
cx.waker().wake_by_ref();
Poll::Pending
}
}
}
YieldNow(false).await
}
The implementation is simple: YieldNow is a manual future that
returns Poll::Pending on the first poll (after scheduling a
wake-up), then Poll::Ready(()) on the second. This creates
exactly one suspension point — the async function must save its state,
yield control, and resume later.
Without an await point, an async function compiles to a trivial future
with no state to preserve. By adding yield_now().await, we
force variables declared before it to be stored in the future's state
machine, letting us measure their impact on size.
One more detail: we wrap values in std::hint::black_box().
This is an optimization barrier that prevents the compiler from seeing
through the value and optimizing it away. Without it, the compiler might
notice that our array is just zeros and never actually read, and eliminate
it entirely — giving us misleading size measurements.
use std::hint::black_box;
async fn with_array() -> [u8; 64] {
let arr: [u8; 64] = black_box([0; 64]); // compiler can't optimize this away
yield_now().await;
arr
}
First, basic futures holding different types:
async fn empty() {} // 1 byte
async fn with_u8() -> u8 {
let x: u8 = black_box(1);
yield_now().await;
x
} // 4 bytes
async fn with_u64() -> u64 {
let x: u64 = black_box(1);
yield_now().await;
x
} // 16 bytes
async fn with_u128() -> u128 {
let x: u128 = black_box(1);
yield_now().await;
x
} // 32 bytes
The pattern: variable size plus a few bytes of overhead (discriminant, sub-future state, padding for alignment). An empty future still needs 1 byte for the discriminant.
With arrays:
async fn with_array_64() -> [u8; 64] {
let arr: [u8; 64] = black_box([0; 64]);
yield_now().await;
arr
} // 67 bytes
async fn with_array_1024() -> [u8; 1024] {
let arr: [u8; 1024] = black_box([0; 1024]);
yield_now().await;
arr
} // 1027 bytes
The overhead is consistent: about 3 bytes beyond the array size.
Memory Reuse in Practice
Here's the proof that the compiler reuses memory across states:
async fn buffers_sequential() -> u8 {
let r1 = {
let buf1: [u8; 256] = black_box([1; 256]);
yield_now().await;
buf1[0]
};
let r2 = {
let buf2: [u8; 256] = black_box([2; 256]);
yield_now().await;
buf2[0]
};
r1.wrapping_add(r2)
} // 260 bytes
async fn buffers_overlapping() -> u8 {
let buf1: [u8; 256] = black_box([1; 256]);
let buf2: [u8; 256] = black_box([2; 256]);
yield_now().await;
buf1[0].wrapping_add(buf2[0])
} // 515 bytes
The sequential version uses 260 bytes — roughly one buffer
plus overhead. The overlapping version uses 515 bytes — both
buffers must exist simultaneously. The compiler recognizes that buf1
and buf2 in the sequential case are never alive at the same time
and reuses the memory.
Conceptually, the compiler generates:
enum BuffersSequentialFuture {
Awaiting1 { buf1: [u8; 256] },
Awaiting2 { buf2: [u8; 256], r1: u8 },
Done,
}
An enum only needs space for its largest variant, so the 256-byte buffers share the same memory region.
The Exponential Growth Bug
Early Rust async implementations (around 2018-2019) had a critical bug: future sizes grew exponentially with nesting depth. If function A awaited function B, which awaited function C, which awaited function D — each level doubled the size instead of taking the maximum.
Consider what should happen with nested awaits. When outer awaits inner, the outer future's state machine has variants like:
enum OuterFuture {
BeforeAwait { local_data: [u8; 100] },
AwaitingInner { inner: InnerFuture, local_data: [u8; 100] },
Done,
}
The AwaitingInner variant contains the entire
InnerFuture. If InnerFuture is 200 bytes
and local_data is 100 bytes, the variant is 300 bytes.
That's correct — they're alive simultaneously.
The bug was that the compiler wasn't recognizing when nested sub-futures could share memory with other variants. The layout algorithm treated each await point independently, allocating fresh space for each sub-future even when they couldn't be alive at the same time. This caused memory usage to compound at each nesting level.
The Fuchsia project at Google discovered this when some async functions produced state machines over 400 KB. The fix required substantial compiler work — the layout code assumed every multi-variant type was a simple enum with one discriminant and no overlap. Generators needed something more sophisticated: layouts where the same bytes could belong to different variants depending on the current state.
Today, nesting is linear:
async fn level_0() -> u8 {
let x: u8 = black_box(1);
yield_now().await;
x
} // 4 bytes
async fn level_1() -> u8 { level_0().await } // 5 bytes (+1)
async fn level_2() -> u8 { level_1().await } // 6 bytes (+1)
async fn level_3() -> u8 { level_2().await } // 7 bytes (+1)
async fn level_4() -> u8 { level_3().await } // 8 bytes (+1)
async fn level_5() -> u8 { level_4().await } // 9 bytes (+1)
Each nesting level adds 1 byte — just enough for the discriminant of the wrapper future. Linear growth, not exponential.
Cache Lines: The 128-Byte Target
Modern CPUs don't fetch memory byte by byte — they load entire cache lines, typically 64 bytes at a time. When you access a single byte, the CPU loads all 64 bytes surrounding it into L1 cache. This is why data locality matters so much for performance.
For async runtimes, this has direct implications. When the executor polls a future, it needs to access the future's state. If that state fits in one cache line (64 bytes), a single memory fetch gets everything. If it spans two cache lines (65-128 bytes), you need two fetches. Beyond that, performance degrades further.
| Data Size | Future Size | Cache Lines |
|---|---|---|
| 56 bytes | 59 bytes | 1 |
| 120 bytes | 123 bytes | 2 |
| 180 bytes | 183 bytes | 3 |
The 128-byte target isn't arbitrary. Tokio's scheduler stores tasks in a linked list where each node contains the future plus metadata (waker, state flags, links). The scheduler is carefully designed so that the "hot" data — the fields accessed on every poll — fits in the first cache line. When your future is small, the entire task often fits in two cache lines, making scheduling decisions fast.
Larger futures don't just use more memory; they cause more cache misses during execution, which can dominate runtime in I/O-heavy applications where futures are polled frequently. Futures between 128 bytes and a few kilobytes work fine — they just won't be as cache-efficient as smaller ones.
Boxing: The Escape Hatch
When a future is too large, Box::pin() moves it to the heap:
async fn large_future() -> [u8; 4096] {
let data: [u8; 4096] = black_box([0; 4096]);
yield_now().await;
data
}
// Inline: 4099 bytes on stack
// Boxed: 16 bytes on stack (pointer to heap)
The tradeoff: you pay for a heap allocation, but the stack footprint drops from 4 KB to 16 bytes (one fat pointer on 64-bit systems). This matters when futures are nested — a chain of 10 inline 4KB futures would consume 40KB of stack, but boxed they use only 160 bytes.
Boxing is also required for recursive async functions. Since the future's size must be known at compile time, and a recursive type has infinite size, you must box the recursive call:
async fn recursive(n: u32) -> u32 {
if n == 0 {
0
} else {
// Box::pin breaks the infinite size cycle
Box::pin(recursive(n - 1)).await + 1
}
}
Tokio handles this automatically for spawned tasks. When you pass a
large future to tokio::spawn, it detects this and boxes
it for you — no manual intervention needed. You can also
opt in explicitly:
// Explicit boxing for trait objects
fn boxed_future() -> Pin<Box<dyn Future<Output = i32> + Send>> {
Box::pin(async { 42 })
}
The futures crate provides a convenient type alias if you
add it to your dependencies (futures = "0.3"):
use futures::future::BoxFuture;
fn another_boxed() -> BoxFuture<'static, i32> {
Box::pin(async { 42 })
}
The general guidance: keep hot-path futures small and inline for cache efficiency. Box futures that are large, recursive, or need to be trait objects.
The u128 Alignment Fix (LLVM 18 / Rust 1.78)
Alignment determines where a value can be placed in memory. A type with 8-byte alignment must start at an address divisible by 8. Proper alignment matters because CPUs access aligned data faster — misaligned access may require multiple memory operations or even trap on some architectures.
The x86_64 System V ABI specifies 16-byte alignment for i128/u128,
but before Rust 1.78, these types were incorrectly aligned to 8 bytes in practice.1
This was a long-standing LLVM bug — the codegen ignored the ABI requirement.
The problem: a u128 could be placed at an offset that causes it
to straddle a cache line boundary.
#[repr(C)]
struct Data {
_offset: [u8; 56],
value: u128,
}
// Before LLVM 18 fix (align = 8, size = 72)
//
// 0 56 64 72
// ├────────────────────────────────┼───────┼───────┤
// │ _offset │ value │
// └────────────────────────────────┴───────┴───────┘
// ↑
// cache line boundary
// (value crosses it!)
//
// After LLVM 18 fix (align = 16, size = 80)
//
// 0 56 64 80
// ├────────────────────────────────┼───────┼───────────────┤
// │ _offset │ (pad) │ value │
// └────────────────────────────────┴───────┴───────────────┘
// ↑
// cache line boundary
// (value starts here)
When a u128 straddles cache lines, reading it requires
fetching two cache lines and combining the bytes. LLVM 18 (shipped with
Rust 1.78) finally corrected the alignment to match the x86_64 ABI.
Benchmarks showed up to 12% performance improvement for code using
128-bit integers.
// Verify the fix
assert_eq!(std::mem::size_of::<u128>(), 16);
assert_eq!(std::mem::align_of::<u128>(), 16); // LLVM bug caused 8 before 1.78
For futures, this matters when your state contains u128,
i128, or any type with 16-byte alignment. The future's
overall alignment becomes at least 16 bytes, and the compiler adds
padding to maintain alignment — potentially increasing the future's size.
Why Pin Exists
The state machine transformation is what makes Rust async "zero-cost" in the systems programming sense. There's no implicit heap allocation, no hidden runtime overhead, no garbage collector. A future is just an enum that knows how to resume itself.
But there's a complication. Consider this async function:
async fn self_referential() {
let data = [0u8; 1024];
let reference = &data[0]; // points into data
yield_now().await;
println!("{}", reference); // reference must still be valid
}
After the await point, both data and reference
must be stored in the future's state. But reference points
to data — it's a self-referential struct. If you move this
future to a different memory address, reference becomes a
dangling pointer.
This is why Pin exists. A Pin<&mut F> is a
pointer to a future that promises not to move it. The Future
trait requires Pin<&mut Self> in its poll method:
pub trait Future {
type Output;
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}
Once you pin a future and start polling it, you've committed to not
moving it. The compiler generates self-referential state machines
knowing that Pin will protect them. This is also why
Box::pin() is common — putting the future on the heap
and pinning it there ensures it won't move even if the Box
itself moves around.
Manual Enum Comparison
To verify the compiler's behavior, we can compare against hand-written enums.
The type_layout crate visualizes struct layouts (it doesn't support
enums directly, but we can examine each variant's data):
use type_layout::TypeLayout;
// Visualize each variant's data as a struct
#[derive(TypeLayout)]
#[repr(C)]
struct Awaiting1Data {
buffer: [u8; 64],
state: u8,
}
#[derive(TypeLayout)]
#[repr(C)]
struct Awaiting2Data {
result: [u8; 128],
state: u8,
}
// The actual enum
enum SimulatedFuture {
Start,
Awaiting1 { buffer: [u8; 64], state: u8 },
Awaiting2 { result: [u8; 128], state: u8 },
Done,
}
fn main() {
println!("{}", Awaiting1Data::type_layout());
println!("{}", Awaiting2Data::type_layout());
println!("Enum size: {} bytes", std::mem::size_of::<SimulatedFuture>());
}
Output:
Awaiting1Data (size 65, alignment 1)
| Offset | Name | Size |
| ------ | ------ | ---- |
| 0 | buffer | 64 |
| 64 | state | 1 |
Awaiting2Data (size 129, alignment 1)
| Offset | Name | Size |
| ------ | ------ | ---- |
| 0 | result | 128 |
| 128 | state | 1 |
Enum size: 130 bytes
The enum is 130 bytes: the largest variant (129 bytes for Awaiting2)
plus 1 byte for the discriminant. The Start and Done
variants use no additional space — they share the same memory region as the
larger variants. This is exactly how compiler-generated futures work.
Inspecting the Real State Machine
You can see the actual compiler-generated state machine using nightly Rust's
-Z print-type-sizes flag:
cargo +nightly rustc -- -Z print-type-sizes
This reveals the true variant structure. Here's what our buffer examples produce:
// buffers_sequential: 260 bytes
type: `{async fn body of buffers_sequential()}`: 260 bytes, alignment: 1 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
variant `Suspend0`: 259 bytes
padding: 1 bytes
local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
local `.buf1`: 256 bytes
variant `Suspend1`: 259 bytes
local `.r1`: 1 bytes
local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
local `.buf2`: 256 bytes
variant `Returned`: 0 bytes
variant `Panicked`: 0 bytes
// buffers_overlapping: 515 bytes
type: `{async fn body of buffers_overlapping()}`: 515 bytes, alignment: 1 bytes
discriminant: 1 bytes
variant `Unresumed`: 0 bytes
variant `Suspend0`: 514 bytes
local `.buf1`: 256 bytes
local `.buf2`: 256 bytes
local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
variant `Returned`: 0 bytes
variant `Panicked`: 0 bytes
This confirms the memory reuse we discussed. In buffers_sequential,
the compiler generates two suspend variants: Suspend0 holds
buf1, while Suspend1 holds buf2 plus the
result r1 from the first block. Since these are separate variants,
the 256-byte buffers share the same memory slot.
In buffers_overlapping, both buffers appear in the same
Suspend0 variant because they're both alive at the await point.
No sharing is possible, hence the doubled size.
The actual variant names are Unresumed (before first poll),
Suspend0/Suspend1/etc. (one per await point),
Returned (completed), and Panicked (unwinding).
The .__awaitee field holds the sub-future being awaited.
Practical Guidelines
-
Drop large data before awaiting. If you're done
with a buffer, let it go out of scope before the next
.await. - Use sequential scopes. Structure your code so large variables don't overlap in lifetime. The compiler will reuse the memory.
- Target ≤128 bytes. Futures under two cache lines are optimal. If you can't get there, at least stay under Tokio's auto-boxing threshold.
-
Box explicitly when needed. For very large futures,
Box::pin()keeps the stack bounded at a small cost. -
Measure, don't guess. Use
std::mem::size_of_val(&fut)or thesize_of_future()helper shown earlier to check actual sizes. The compiler is smart, but sometimes surprising.
Notes
1 The x86_64 System V ABI specifies 16-byte alignment for __int128.
This alignment enables efficient use of SSE instructions (movaps, movdqa)
for 128-bit loads/stores, ensures 128-bit atomic operations work correctly (cmpxchg16b
requires 16-byte alignment), and guarantees a 16-byte value never crosses a 64-byte cache line
boundary.
Further Reading
- Asynchronous Programming in Rust — the official async book
- How Rust optimizes async/await — Tyler Mandry's deep dive into the compiler work
- Rust 1.78: Performance Impact of the 128-bit Memory Alignment Fix — the u128 alignment story
- Making the Tokio scheduler 10x faster — cache line optimization in practice