Zero-Cost Abstraction in Rust: Understanding What the Compiler Actually Does

The Problem

Many of us know Rust as "as fast as C++ but safe". One of the most important mechanisms behind this claim is zero-cost abstraction. But what does this actually mean? Are Iterator chains, closures, Option and Result monadic operations really zero-cost after compilation? Or is this just a marketing slogan?

In this article, we'll look behind the scenes of the Rust compiler. By examining MIR (Mid-level IR) and assembly outputs via Godbolt (Compiler Explorer), we'll see with concrete examples how abstractions are optimized.

What Is Zero-Cost Abstraction?

With Bjarne Stroustrup's definition:

You don't pay for what you don't use
What you do use, you can't write better by hand

Rust takes this principle one step further. While C++'s virtual dispatch, RTTI, and exception handling incur runtime costs, Rust's design eliminates these costs through monomorphization and aggressive inlining.

Monomorphization: The Magic Behind Generics

In Rust, generic functions and structs are optimized through monomorphization. The compiler generates separate code for each concrete type. This is similar to C++ templates but more secure with trait bounds.

fn max<T: PartialOrd>(a: T, b: T) -> T {
    if a > b { a } else { b }
}

fn main() {
    let x = max(3i32, 5i32);
    let y = max(3.14f64, 2.71f64);
}

When this code is compiled, two separate functions are generated for max::<i32> and max::<f64>. Looking at the assembly output:

; Optimized assembly for max::<i32>
example::max_i32:
    cmp     edi, esi
    cmovl   edi, esi
    mov     eax, edi
    ret

; Optimized assembly for max::<f64>  
example::max_f64:
    comisd  xmm0, xmm1
    cmovbe  xmm0, xmm1
    ret

As you can see, type-specific CPU instructions like cmovl for i32 and cmovbe for f64 are used. No vtable, dynamic dispatch, or boxing anywhere.

Iterators: Chained Operations Are Zero-Cost

One of the most misunderstood topics is iterator chains. Does .map().filter().collect() create intermediate lists like in Python? No.

pub fn sum_of_squares_even(v: &[i32]) -> i32 {
    v.iter()
        .filter(|&x| x % 2 == 0)
        .map(|&x| x * x)
        .sum()
}

The optimized assembly for this code:

example::sum_of_squares_even:
    test    rsi, rsi
    je      .LBB0_1
    ; Main loop - single pass!
    xor     eax, eax
.LBB0_3:
    mov     ecx, dword ptr [rdi]
    mov     edx, ecx
    and     edx, 1
    imul    ecx, ecx
    neg     edx
    sbb     edx, edx
    not     edx
    and     ecx, edx
    add     eax, ecx
    add     rdi, 4
    dec     rsi
    jne     .LBB0_3
    ret
.LBB0_1:
    xor     eax, eax
    ret

The compiler merged three separate iterator adaptors (iter, filter, map) into a single loop. No intermediate storage, no function calls. That's zero-cost abstraction.

How is this possible? Iterator adaptors are generic structures, each wrapping the previous iterator:

// Actual type of the iterator chain (what the compiler sees)
type ChainType = std::iter::Map<
    std::iter::Filter<
        std::slice::Iter<'_, i32>,
        fn(&&i32) -> bool
    >,
    fn(&&i32) -> i32
>;

LLVM flattens this nested type structure completely. Result: a single flat loop, zero overhead.

Option and Result: Null Pointer Optimization

Rust's Option<T> and Result<T, E> types carry a discriminant (tag), right? Not always.

pub fn option_size() {
    // These two types are the SAME size!
    assert_eq!(std::mem::size_of::<Option<&i32>>(), std::mem::size_of::<&i32>());
    assert_eq!(std::mem::size_of::<Option<Box<i32>>>(), std::mem::size_of::<Box<i32>>());
    assert_eq!(std::mem::size_of::<Option<std::num::NonZeroI32>>(), std::mem::size_of::<i32>());
}

This is possible thanks to null pointer optimization (NPO). Types like &T, Box<T>, NonNull<T> can never be null, so the None value is represented with a null pointer. This way Option<&T> is exactly the same size as *const T.

// In this code, no discriminant field is allocated for Option<&T>
pub fn find_even(v: &[i32]) -> Option<&i32> {
    v.iter().find(|&x| x % 2 == 0)
}

In assembly:

example::find_even:
    test    rsi, rsi
    je      .LBB2_4
    ; rdi: slice pointer, rsi: length
    ; rdx: used as current element pointer
    lea     rdx, [rdi + 4*rsi]
.LBB2_2:
    mov     eax, dword ptr [rdi]
    test    al, 1
    je      .LBB2_5        ; found even number
    add     rdi, 4
    cmp     rdi, rdx
    jne     .LBB2_2
.LBB2_4:
    xor     eax, eax       ; None -> return null pointer
    ret
.LBB2_5:
    mov     rax, rdi       ; Some(&value) -> return pointer
    ret

The Option<&i32> return value fits in a single register: null if None, pointer if Some(ptr). No tag whatsoever.

Closures: No Virtual Calls

Rust closures create a unique anonymous type for each call site. These types implement the Fn, FnMut, or FnOnce traits. Thanks to monomorphization, each closure call is resolved statically.

pub fn apply_twice<F: Fn(i32) -> i32>(f: F, x: i32) -> i32 {
    f(f(x))
}

pub fn use_closure() -> i32 {
    let multiplier = 10;
    apply_twice(|x| x * multiplier, 5)
}

Assembly:

example::use_closure:
    mov     eax, 500    ; 5 * 10 * 10 = 500, completely constant folded!
    ret

The compiler inlined the closure, solidified the multiplier variable with constant propagation, and performed the entire calculation at compile time. Nothing remains at runtime.

What happens if the closure captures state?

pub fn counter() -> impl FnMut() -> i32 {
    let mut count = 0;
    move || {
        count += 1;
        count
    }
}

Here the closure captures the count variable. The compiler creates something like:

// Anonymous struct created by the compiler (approximately)
struct CounterClosure {
    count: i32,
}

impl FnMut<()> for CounterClosure {
    fn call_mut(&mut self, _: ()) -> i32 {
        self.count += 1;
        self.count
    }
}

This struct is 4 bytes (a single i32). No heap allocation, no vtable pointer.

`dyn Trait` vs `impl Trait`: Making Informed Choices

Zero-cost abstraction isn't always possible. When you use dynamic dispatch (dyn Trait), you pay a cost:

// Static dispatch - monomorphized, can be inlined
fn process_static(x: &impl Display) {
    println!("{}", x);
}

// Dynamic dispatch - call via vtable, cannot be inlined
fn process_dynamic(x: &dyn Display) {
    println!("{}", x);
}

Differences between them:

Property	`impl Trait` / `<T: Trait>`	`dyn Trait`
Dispatch	Static (compile-time)	Dynamic (runtime)
Inlinable	Yes	No
Binary size	Grows (separate code for each type)	Small (single code path)
Call cost	~0 (inlined)	~1-2ns (vtable lookup)
Allocation	None	`Box<dyn Trait>` heap allocates
Type info	Known at compile-time	Erased at runtime (type erasure)

Practical advice: Use static dispatch in hot paths. Save dyn Trait for heterogeneous collections or when you want to reduce binary size.

Memory Layout of Enums

Rust enums are smarter than C unions. The compiler chooses the smallest layout that fits all variants:

use std::mem::size_of;

enum Small {
    A(i32),
    B(i32),
    C,
}

enum Mixed {
    A(i64),
    B(i32, i32),
    C(bool),
    D,
}

// Small: 8 bytes (4 bytes data + 4 bytes discriminant padding)
// Mixed: 16 bytes (8 bytes data + discriminant)
assert_eq!(size_of::<Small>(), 8);
assert_eq!(size_of::<Mixed>(), 16); // Due to i64 alignment requirement

The Rust compiler also does niche optimization. If an enum has unused bit patterns, it places the discriminant there:

enum NicheOptimized {
    A(bool),      // bool: 0 or 1, 255 patterns unused
    B,
    C,
    D,
    E,
}

// Only 1 byte! (bool's unused 254 patterns encode other variants)
assert_eq!(size_of::<NicheOptimized>(), 1);

This explains why Rust's Option<bool> and Option<Ordering> types don't use extra space.

Real-World Example: Writing a Parser

To see the practical benefit of zero-cost abstraction, let's write a simple CSV parser:

use std::str;

pub struct CsvRow<'a> {
    raw: &'a str,
}

impl<'a> CsvRow<'a> {
    pub fn fields(&self) -> impl Iterator<Item = &'a str> {
        self.raw.split(',').map(|s| s.trim())
    }

    pub fn get(&self, index: usize) -> Option<&'a str> {
        self.fields().nth(index)
    }
}

pub fn parse_csv(input: &str) -> impl Iterator<Item = CsvRow<'_>> {
    input.lines()
        .filter(|l| !l.is_empty() && !l.starts_with('#'))
        .map(|line| CsvRow { raw: line })
}

// Usage
pub fn sum_column(input: &str, col: usize) -> Option<i64> {
    parse_csv(input)
        .filter_map(|row| row.get(col))
        .filter_map(|val| val.parse::<i64>().ok())
        .sum::<i64>()
        .into()  // To return None instead of 0
}

This code:

Zero allocation: No String or Vec created, everything is borrowed
Zero copying: &str slices reference the original input
Single pass: The filter->filter_map->sum chain is optimized into a single loop
Zero panic possibility: Error handling with Option

A Python script with the same functionality would pay the cost of split(), strip(), list creation, and garbage collection for each line.

Understanding the Compiler: cargo-asm and Godbolt

To see assembly output of your own code:

# View function-by-function assembly with cargo-asm
cargo install cargo-asm
cargo asm --lib crate_name::function_name

# Online inspection with Godbolt (Compiler Explorer)
# https://rust.godbolt.org/
# Use -C opt-level=3 -C target-cpu=native for maximum optimization

Compiler optimization levels:

# Cargo.toml
[profile.release]
opt-level = 3      # 0-3, s and z (for size)
lto = true         # Link-time optimization
codegen-units = 1  # Disable parallel compilation, more aggressive optimization
panic = "abort"    # No stack unwind on panic

lto = true is particularly important. Normally LLVM optimizes each crate separately. With LTO, all crates are combined and inlining can be done beyond crate boundaries. codegen-units = 1 disables parallel codegen and increases optimization quality.

Limits and Reality

There are cases where zero-cost abstraction doesn't always work:

1. Excessive monomorphization (code bloat)

// This code copies the same logic for hundreds of different type combinations
fn process<T: Serialize, U: Deserialize>(input: T) -> U {
    // ...
}

Solution: Abstract the internal logic with dyn Trait or enum, make only the input/output layer generic.

2. Async functions and state machine size

// Each .await point adds a field to the state machine
async fn complex_flow() {
    let a = step1().await;    // State 0
    let b = step2(a).await;   // State 1
    let c = step3(b).await;   // State 2
    // State machine size: max(sizeof(state_i)) + discriminant
}

Breaking large async functions into smaller pieces reduces state machine size.

3. Box<dyn Future> with type erasure

Trait objects prevent monomorphization; each call goes through vtable.

Summary

Rust's zero-cost abstraction claim is not a marketing slogan, but a direct result of compiler architecture. Thanks to monomorphization, null pointer optimization, niche optimization, and aggressive inlining:

Iterator chains become a single flat loop
Generics produce specially optimized code for each concrete type
Closures are inlined, solidified with constant propagation
Option<&T> uses no extra memory
Enums are stored in the smallest possible layout

These aren't "free" — you pay with compile time and binary size. But at runtime, it's no slower than hand-optimized C code.

Practical advice:

Use impl Trait instead of dyn Trait in hot paths
Avoid unnecessary collect() calls — iterators are lazy, don't break the flow
Set lto = true and codegen-units = 1 in release builds
Check assembly output of critical functions with cargo-asm
Break up large async functions, reduce state machine size

Final word: When you hear the word "abstraction" in Rust, don't think of vtables and heap allocation like in Java. Rust's abstractions are zero-cost — and you can prove it by looking at the compiler output.

Tags: rust, zero-cost-abstraction, monomorphization, compiler-optimization, llvm, assembly, iterator, generics, english Date: 2026-05-26MARKDOWN_EOF