The Memory Layout of Rust Futures

When you write an async fn in Rust, the compiler transforms it into a state machine. This isn't just an implementation detail — understanding how this transformation works explains why Rust's async is zero-cost, why futures have the sizes they do, and why early implementations had a bug that caused 400 KB state machines.

Async Functions Are Enums

Consider a simple async function that makes a few network calls:

async fn fetch_user_data(id: u64) -> UserData {
    let token = get_auth_token().await;
    let profile = fetch_profile(id, &token).await;
    let settings = fetch_settings(id, &token).await;
    UserData { profile, settings }
}

The compiler transforms this into something conceptually like:

enum FetchUserDataFuture {
    GettingToken {
        id: u64,
        fut: GetAuthTokenFuture,
    },
    FetchingProfile {
        id: u64,
        token: Token,
        fut: FetchProfileFuture,
    },
    FetchingSettings {
        profile: Profile,
        fut: FetchSettingsFuture,
    },
    Complete,
}

Each .await point becomes a variant in the enum. The future stores whatever state it needs to resume execution: local variables that survive across await points, plus the sub-future being awaited.

The Size of an Enum

Here's the key insight: an enum only needs enough space to hold its largest variant. When our future transitions from GettingToken to FetchingProfile, the bytes that held GetAuthTokenFuture get reused for FetchProfileFuture. The compiler knows these sub-futures are never alive at the same time.

This means the size of a future grows as the maximum of its sub-futures, not the sum. If you await three futures of 100, 200, and 150 bytes, your future is roughly 200 bytes plus overhead — not 450.

Multi-Variant Layout Optimization

The term "multi-variant layout" refers to how the compiler assigns memory locations to variables across the future's state machine. Rather than giving each variable its own dedicated bytes, the compiler analyzes which variables are alive at each await point. Variables that are never alive at the same time can share the same memory location — even if they have different types or sizes.

Consider a loop with an await inside:

async fn process(xs: Vec<i32>) -> i32 {
    let mut sum = 0;
    for x in xs.iter() {
        sum += expensive_computation(*x).await;
    }
    sum
}

Each iteration of the loop has its own x value and its own temporary state from expensive_computation. But iteration 1's temporaries are completely dead before iteration 2 begins. The compiler recognizes this: the memory used for iteration 1's state can be reused for iteration 2, then for iteration 3, and so on.

Without this optimization, a loop with 1000 iterations might need space for 1000 copies of the temporary state. With it, the future only needs space for one iteration's state — the rest share that same memory.

This is what makes the layout "multi-variant": the same bytes in memory serve different purposes depending on which state the future is in. The compiler builds a mapping between variables and memory locations that changes as the state machine progresses.

Measuring Real Futures

Let's verify this with actual measurements. All numbers below were taken on Rust 1.93.0 (x86_64, release mode).

To measure a future's size, we use a helper function that exploits monomorphization:

fn size_of_future<F: Future>(_: &F) -> usize {
    std::mem::size_of::<F>()
}

// Usage:
let fut = some_async_fn();
println!("{} bytes", size_of_future(&fut));

Our examples use a yield_now() helper that creates a single await point. This forces the compiler to generate a multi-state future, which is what we want to measure:

use std::future::Future;
use std::pin::Pin;
use std::task::{Context, Poll};

async fn yield_now() {
    struct YieldNow(bool);

    impl Future for YieldNow {
        type Output = ();

        fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
            if self.0 {
                Poll::Ready(())
            } else {
                self.0 = true;
                cx.waker().wake_by_ref();
                Poll::Pending
            }
        }
    }

    YieldNow(false).await
}

The implementation is simple: YieldNow is a manual future that returns Poll::Pending on the first poll (after scheduling a wake-up), then Poll::Ready(()) on the second. This creates exactly one suspension point — the async function must save its state, yield control, and resume later.

Without an await point, an async function compiles to a trivial future with no state to preserve. By adding yield_now().await, we force variables declared before it to be stored in the future's state machine, letting us measure their impact on size.

One more detail: we wrap values in std::hint::black_box(). This is an optimization barrier that prevents the compiler from seeing through the value and optimizing it away. Without it, the compiler might notice that our array is just zeros and never actually read, and eliminate it entirely — giving us misleading size measurements.

use std::hint::black_box;

async fn with_array() -> [u8; 64] {
    let arr: [u8; 64] = black_box([0; 64]);  // compiler can't optimize this away
    yield_now().await;
    arr
}

First, basic futures holding different types:

async fn empty() {}                        //  1 byte

async fn with_u8() -> u8 {
    let x: u8 = black_box(1);
    yield_now().await;
    x
}                                          //  4 bytes

async fn with_u64() -> u64 {
    let x: u64 = black_box(1);
    yield_now().await;
    x
}                                          // 16 bytes

async fn with_u128() -> u128 {
    let x: u128 = black_box(1);
    yield_now().await;
    x
}                                          // 32 bytes

The pattern: variable size plus a few bytes of overhead (discriminant, sub-future state, padding for alignment). An empty future still needs 1 byte for the discriminant.

With arrays:

async fn with_array_64() -> [u8; 64] {
    let arr: [u8; 64] = black_box([0; 64]);
    yield_now().await;
    arr
}                                          //   67 bytes

async fn with_array_1024() -> [u8; 1024] {
    let arr: [u8; 1024] = black_box([0; 1024]);
    yield_now().await;
    arr
}                                          // 1027 bytes

The overhead is consistent: about 3 bytes beyond the array size.

Memory Reuse in Practice

Here's the proof that the compiler reuses memory across states:

async fn buffers_sequential() -> u8 {
    let r1 = {
        let buf1: [u8; 256] = black_box([1; 256]);
        yield_now().await;
        buf1[0]
    };
    let r2 = {
        let buf2: [u8; 256] = black_box([2; 256]);
        yield_now().await;
        buf2[0]
    };
    r1.wrapping_add(r2)
}                                          // 260 bytes

async fn buffers_overlapping() -> u8 {
    let buf1: [u8; 256] = black_box([1; 256]);
    let buf2: [u8; 256] = black_box([2; 256]);
    yield_now().await;
    buf1[0].wrapping_add(buf2[0])
}                                          // 515 bytes

The sequential version uses 260 bytes — roughly one buffer plus overhead. The overlapping version uses 515 bytes — both buffers must exist simultaneously. The compiler recognizes that buf1 and buf2 in the sequential case are never alive at the same time and reuses the memory.

Conceptually, the compiler generates:

enum BuffersSequentialFuture {
    Awaiting1 { buf1: [u8; 256] },
    Awaiting2 { buf2: [u8; 256], r1: u8 },
    Done,
}

An enum only needs space for its largest variant, so the 256-byte buffers share the same memory region.

The Exponential Growth Bug

Early Rust async implementations (around 2018-2019) had a critical bug: future sizes grew exponentially with nesting depth. If function A awaited function B, which awaited function C, which awaited function D — each level doubled the size instead of taking the maximum.

Consider what should happen with nested awaits. When outer awaits inner, the outer future's state machine has variants like:

enum OuterFuture {
    BeforeAwait { local_data: [u8; 100] },
    AwaitingInner { inner: InnerFuture, local_data: [u8; 100] },
    Done,
}

The AwaitingInner variant contains the entire InnerFuture. If InnerFuture is 200 bytes and local_data is 100 bytes, the variant is 300 bytes. That's correct — they're alive simultaneously.

The bug was that the compiler wasn't recognizing when nested sub-futures could share memory with other variants. The layout algorithm treated each await point independently, allocating fresh space for each sub-future even when they couldn't be alive at the same time. This caused memory usage to compound at each nesting level.

The Fuchsia project at Google discovered this when some async functions produced state machines over 400 KB. The fix required substantial compiler work — the layout code assumed every multi-variant type was a simple enum with one discriminant and no overlap. Generators needed something more sophisticated: layouts where the same bytes could belong to different variants depending on the current state.

Today, nesting is linear:

async fn level_0() -> u8 {
    let x: u8 = black_box(1);
    yield_now().await;
    x
}                                          //  4 bytes

async fn level_1() -> u8 { level_0().await } //  5 bytes (+1)
async fn level_2() -> u8 { level_1().await } //  6 bytes (+1)
async fn level_3() -> u8 { level_2().await } //  7 bytes (+1)
async fn level_4() -> u8 { level_3().await } //  8 bytes (+1)
async fn level_5() -> u8 { level_4().await } //  9 bytes (+1)

Each nesting level adds 1 byte — just enough for the discriminant of the wrapper future. Linear growth, not exponential.

Cache Lines: The 128-Byte Target

Modern CPUs don't fetch memory byte by byte — they load entire cache lines, typically 64 bytes at a time. When you access a single byte, the CPU loads all 64 bytes surrounding it into L1 cache. This is why data locality matters so much for performance.

For async runtimes, this has direct implications. When the executor polls a future, it needs to access the future's state. If that state fits in one cache line (64 bytes), a single memory fetch gets everything. If it spans two cache lines (65-128 bytes), you need two fetches. Beyond that, performance degrades further.

Data Size	Future Size	Cache Lines
56 bytes	59 bytes	1
120 bytes	123 bytes	2
180 bytes	183 bytes	3

The 128-byte target isn't arbitrary. Tokio's scheduler stores tasks in a linked list where each node contains the future plus metadata (waker, state flags, links). The scheduler is carefully designed so that the "hot" data — the fields accessed on every poll — fits in the first cache line. When your future is small, the entire task often fits in two cache lines, making scheduling decisions fast.

Larger futures don't just use more memory; they cause more cache misses during execution, which can dominate runtime in I/O-heavy applications where futures are polled frequently. Futures between 128 bytes and a few kilobytes work fine — they just won't be as cache-efficient as smaller ones.

Boxing: The Escape Hatch

When a future is too large, Box::pin() moves it to the heap:

async fn large_future() -> [u8; 4096] {
    let data: [u8; 4096] = black_box([0; 4096]);
    yield_now().await;
    data
}

// Inline: 4099 bytes on stack
// Boxed:    16 bytes on stack (pointer to heap)

The tradeoff: you pay for a heap allocation, but the stack footprint drops from 4 KB to 16 bytes (one fat pointer on 64-bit systems). This matters when futures are nested — a chain of 10 inline 4KB futures would consume 40KB of stack, but boxed they use only 160 bytes.

Boxing is also required for recursive async functions. Since the future's size must be known at compile time, and a recursive type has infinite size, you must box the recursive call:

async fn recursive(n: u32) -> u32 {
    if n == 0 {
        0
    } else {
        // Box::pin breaks the infinite size cycle
        Box::pin(recursive(n - 1)).await + 1
    }
}

Tokio handles this automatically for spawned tasks. When you pass a large future to tokio::spawn, it detects this and boxes it for you — no manual intervention needed. You can also opt in explicitly:

// Explicit boxing for trait objects
fn boxed_future() -> Pin<Box<dyn Future<Output = i32> + Send>> {
    Box::pin(async { 42 })
}

The futures crate provides a convenient type alias if you add it to your dependencies (futures = "0.3"):

use futures::future::BoxFuture;

fn another_boxed() -> BoxFuture<'static, i32> {
    Box::pin(async { 42 })
}

The general guidance: keep hot-path futures small and inline for cache efficiency. Box futures that are large, recursive, or need to be trait objects.

The u128 Alignment Fix (LLVM 18 / Rust 1.78)

Alignment determines where a value can be placed in memory. A type with 8-byte alignment must start at an address divisible by 8. Proper alignment matters because CPUs access aligned data faster — misaligned access may require multiple memory operations or even trap on some architectures.

The x86_64 System V ABI specifies 16-byte alignment for i128/u128, but before Rust 1.78, these types were incorrectly aligned to 8 bytes in practice.¹ This was a long-standing LLVM bug — the codegen ignored the ABI requirement. The problem: a u128 could be placed at an offset that causes it to straddle a cache line boundary.

#[repr(C)]
struct Data {
    _offset: [u8; 56],
    value: u128,
}

// Before LLVM 18 fix (align = 8, size = 72)
//
//  0                                56      64      72
//  ├────────────────────────────────┼───────┼───────┤
//  │            _offset             │    value      │
//  └────────────────────────────────┴───────┴───────┘
//                                       ↑
//                              cache line boundary
//                              (value crosses it!)
//
// After LLVM 18 fix (align = 16, size = 80)
//
//  0                                56      64              80
//  ├────────────────────────────────┼───────┼───────────────┤
//  │            _offset             │ (pad) │     value     │
//  └────────────────────────────────┴───────┴───────────────┘
//                                           ↑
//                                  cache line boundary
//                                  (value starts here)

When a u128 straddles cache lines, reading it requires fetching two cache lines and combining the bytes. LLVM 18 (shipped with Rust 1.78) finally corrected the alignment to match the x86_64 ABI. Benchmarks showed up to 12% performance improvement for code using 128-bit integers.

// Verify the fix
assert_eq!(std::mem::size_of::<u128>(), 16);
assert_eq!(std::mem::align_of::<u128>(), 16);  // LLVM bug caused 8 before 1.78

For futures, this matters when your state contains u128, i128, or any type with 16-byte alignment. The future's overall alignment becomes at least 16 bytes, and the compiler adds padding to maintain alignment — potentially increasing the future's size.

Why Pin Exists

The state machine transformation is what makes Rust async "zero-cost" in the systems programming sense. There's no implicit heap allocation, no hidden runtime overhead, no garbage collector. A future is just an enum that knows how to resume itself.

But there's a complication. Consider this async function:

async fn self_referential() {
    let data = [0u8; 1024];
    let reference = &data[0];  // points into data
    yield_now().await;
    println!("{}", reference); // reference must still be valid
}

After the await point, both data and reference must be stored in the future's state. But reference points to data — it's a self-referential struct. If you move this future to a different memory address, reference becomes a dangling pointer.

This is why Pin exists. A Pin<&mut F> is a pointer to a future that promises not to move it. The Future trait requires Pin<&mut Self> in its poll method:

pub trait Future {
    type Output;
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

Once you pin a future and start polling it, you've committed to not moving it. The compiler generates self-referential state machines knowing that Pin will protect them. This is also why Box::pin() is common — putting the future on the heap and pinning it there ensures it won't move even if the Box itself moves around.

Manual Enum Comparison

To verify the compiler's behavior, we can compare against hand-written enums. The type_layout crate visualizes struct layouts (it doesn't support enums directly, but we can examine each variant's data):

use type_layout::TypeLayout;

// Visualize each variant's data as a struct
#[derive(TypeLayout)]
#[repr(C)]
struct Awaiting1Data {
    buffer: [u8; 64],
    state: u8,
}

#[derive(TypeLayout)]
#[repr(C)]
struct Awaiting2Data {
    result: [u8; 128],
    state: u8,
}

// The actual enum
enum SimulatedFuture {
    Start,
    Awaiting1 { buffer: [u8; 64], state: u8 },
    Awaiting2 { result: [u8; 128], state: u8 },
    Done,
}

fn main() {
    println!("{}", Awaiting1Data::type_layout());
    println!("{}", Awaiting2Data::type_layout());
    println!("Enum size: {} bytes", std::mem::size_of::<SimulatedFuture>());
}

Output:

Awaiting1Data (size 65, alignment 1)
| Offset | Name   | Size |
| ------ | ------ | ---- |
| 0      | buffer | 64   |
| 64     | state  | 1    |

Awaiting2Data (size 129, alignment 1)
| Offset | Name   | Size |
| ------ | ------ | ---- |
| 0      | result | 128  |
| 128    | state  | 1    |

Enum size: 130 bytes

The enum is 130 bytes: the largest variant (129 bytes for Awaiting2) plus 1 byte for the discriminant. The Start and Done variants use no additional space — they share the same memory region as the larger variants. This is exactly how compiler-generated futures work.

Inspecting the Real State Machine

You can see the actual compiler-generated state machine using nightly Rust's -Z print-type-sizes flag:

cargo +nightly rustc -- -Z print-type-sizes

This reveals the true variant structure. Here's what our buffer examples produce:

// buffers_sequential: 260 bytes
type: `{async fn body of buffers_sequential()}`: 260 bytes, alignment: 1 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
    variant `Suspend0`: 259 bytes
        padding: 1 bytes
        local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
        local `.buf1`: 256 bytes
    variant `Suspend1`: 259 bytes
        local `.r1`: 1 bytes
        local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
        local `.buf2`: 256 bytes
    variant `Returned`: 0 bytes
    variant `Panicked`: 0 bytes

// buffers_overlapping: 515 bytes
type: `{async fn body of buffers_overlapping()}`: 515 bytes, alignment: 1 bytes
    discriminant: 1 bytes
    variant `Unresumed`: 0 bytes
    variant `Suspend0`: 514 bytes
        local `.buf1`: 256 bytes
        local `.buf2`: 256 bytes
        local `.__awaitee`: 2 bytes, type: {async fn body of yield_now()}
    variant `Returned`: 0 bytes
    variant `Panicked`: 0 bytes

This confirms the memory reuse we discussed. In buffers_sequential, the compiler generates two suspend variants: Suspend0 holds buf1, while Suspend1 holds buf2 plus the result r1 from the first block. Since these are separate variants, the 256-byte buffers share the same memory slot.

In buffers_overlapping, both buffers appear in the same Suspend0 variant because they're both alive at the await point. No sharing is possible, hence the doubled size.

The actual variant names are Unresumed (before first poll), Suspend0/Suspend1/etc. (one per await point), Returned (completed), and Panicked (unwinding). The .__awaitee field holds the sub-future being awaited.

Practical Guidelines

Drop large data before awaiting. If you're done with a buffer, let it go out of scope before the next .await.
Use sequential scopes. Structure your code so large variables don't overlap in lifetime. The compiler will reuse the memory.
Target ≤128 bytes. Futures under two cache lines are optimal. If you can't get there, at least stay under Tokio's auto-boxing threshold.
Box explicitly when needed. For very large futures, Box::pin() keeps the stack bounded at a small cost.
Measure, don't guess. Use std::mem::size_of_val(&fut) or the size_of_future() helper shown earlier to check actual sizes. The compiler is smart, but sometimes surprising.

* * *

Notes

¹ The x86_64 System V ABI specifies 16-byte alignment for __int128. This alignment enables efficient use of SSE instructions (movaps, movdqa) for 128-bit loads/stores, ensures 128-bit atomic operations work correctly (cmpxchg16b requires 16-byte alignment), and guarantees a 16-byte value never crosses a 64-byte cache line boundary.