dhy@ironhide: ~/site
dhy@ironhide:~/site$cat header.html
_____ _ _ _ _ | __ \| | | | | | | | | | | |_| | | | | | | | | _ | |_| | | |__| | | | | _ | |_____/|_| |_|_| |_| ~/dhy.tr — personal notes & technical writing
dhy@ironhide:~/site$ls -la *.md

Kernel Observability with eBPF: A Deep Dive into Linux's Superpower

The Problem

Few things are as frustrating on a Linux server as the question "why is it slow?" Traditional tools like top, htop, iostat show you the current state of your system, but rarely tell you why. Is a process stuck on disk I/O, experiencing network retransmission, or under CPU throttling? To understand this, you run strace — and the process slows down by 30-80%. You run perf — it samples but can't filter inside the kernel. You try writing a kernel module — a bug can panic the entire system.

The Linux kernel community solved this problem in 2014 by delivering a revolutionary technology: eBPF (Extended Berkeley Packet Filter). Today, this technology enables Netflix to perform over 1 million flow exports per second, Meta to run an L4 load balancer handling 10 million connections/second, and Cloudflare to provide microsecond-level DDoS protection. In this article, we'll examine what eBPF is, how it's used in practice, and why it's become a cornerstone of modern Linux system administration.

What Is eBPF?

eBPF is a virtual machine inside the Linux kernel that allows safely and efficiently sandboxed programs to run. You can inject code into the kernel from user space without modifying kernel source code or loading kernel modules.

It's based on five fundamental principles:

  1. No touching kernel source code — kernel behavior is dynamically extended
  2. Safe sandbox — the Verifier automatically checks every program before loading; kernel panic is impossible
  3. JIT (Just-In-Time) compilation — eBPF bytecode is converted to native machine code, not interpreted
  4. Event-driven architecture — eBPF programs only run when a triggering event occurs; no idle CPU consumption
  5. Data sharing via Maps — data flow between kernel-space and user-space through structures like hash maps, arrays, and ring buffers

This architecture overcomes the three major limitations of traditional kernel modules: security risk (no kernel panic), portability (CO-RE means a single binary runs on all kernel versions), and maintenance burden (dynamic loading, no reboot required).

The Heart of eBPF: Verifier, JIT, and Maps

Before eBPF programs are loaded, they go through strict security checks by the verifier inside the kernel. This check guarantees:

┌─────────────────────────────────────────────────────────┐
│                   eBPF VERIFIER                         │
├─────────────────────────────────────────────────────────┤
│  1. Program is finite — no infinite loops (DAG check)   │
│  2. All registers are written before being read         │
│  3. All memory accesses are within bounds               │
│  4. Total instruction count < 1 million (kernel 5.2+)  │
│  5. Pointer arithmetic only within stack/context        │
│  6. Helper functions called with correct arguments      │
│  7. Hardening against speculative execution attacks     │
│  8. No NULL pointer dereference                        │
│  9. No out-of-bounds access                            │
└─────────────────────────────────────────────────────────┘

After passing the Verifier, the program is compiled to native machine code for x86_64, ARM64, RISC-V, or s390x by the JIT compiler. This allows eBPF programs to run 2-4x faster compared to interpreted bytecode.

Maps are the data bridge between eBPF programs and user space. With 12 different map types, you can model any data structure:

# How many map types does your system have?
bpftool feature probe | grep map_type | wc -l

Basic map types: BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_RINGBUF (for low latency instead of perf ring buffer), BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_PERCPU_HASH, BPF_MAP_TYPE_BLOOM_FILTER.

Practical eBPF Tools: bpftrace and BCC

You don't have to write C kernel programs and compile with clang to work with eBPF. The ecosystem offers two powerful tools that make this much more accessible.

bpftrace — "AWK for the Kernel"

Developed by Brendan Gregg (former Netflix Performance Engineering, now Intel), bpftrace allows you to do kernel tracing with one-liner scripts similar to AWK. You can write incredibly powerful queries without knowing programming.

# 1. Count all syscalls by process name
#    Prints sorted histogram when you press CTRL-C
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# 2. Which files are being opened? (PID, UID, process, filename)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
    printf("%-6d %-4d %-16s %s\n", pid, uid, comm, str(args.filename));
}'

# 3. Disk I/O latency histogram (microseconds)
#    Which disk operations take > 10ms?
bpftrace -e 'kprobe:blk_account_io_start { @start[arg0] = nsecs; }
    kprobe:blk_account_io_done / @start[arg0] / {
        @iolat_us = hist((nsecs - @start[arg0]) / 1000);
        delete(@start[arg0]);
    }'

# 4. TCP retransmit — indicator of network problems
bpftrace -e 'kprobe:tcp_retransmit_skb {
    printf("[%s] retransmit, state: %d\n", comm, ((struct sock *)arg0)->sk_state);
}'

# 5. Which processes are spawning new processes?
bpftrace -e 'tracepoint:syscalls:sys_enter_execve {
    printf("PARENT:%-6d -> CHILD: %s\n", pid, str(args.filename));
}'

# 6. Failed syscalls (with error codes)
bpftrace -e 'tracepoint:raw_syscalls:sys_exit / args.ret < 0 / {
    @errors[comm, args.id, -args.ret] = count();
}'

BCC (BPF Compiler Collection) — 100+ Ready-Made Tools

BCC lets you write eBPF programs in Python/Lua and use over 100 ready-made tools under /usr/share/bcc/tools/:

# First, install it
apt install bpfcc-tools

# Most commonly used:
execsnoop-bpfcc       # Monitor new processes in real-time
opensnoop-bpfcc       # All file open operations
tcptop-bpfcc          # Top TCP connections by bandwidth
tcplife-bpfcc         # TCP connection lifetime monitoring
biolatency-bpfcc      # Block I/O latency histogram
biosnoop-bpfcc        # Every disk I/O operation detail
cachestat-bpfcc       # Page cache hit/miss ratio
cpuunclaimed-bpfcc    # Idle CPU time
gethostlatency-bpfcc  # DNS resolution latency
runqlat-bpfcc         # CPU run queue wait time

Practical scenario: "Which starts faster, Python or Node.js?"

# Terminal 1: run execsnoop
execsnoop-bpfcc -T

# Terminal 2: take measurements
time python3 -c "print('hello')"
time node -e "console.log('hello')"

The -T parameter prints TIMESTAMP for process start, giving you microsecond-precision measurements.

Writing Your Own eBPF Program: BCC with Python

BCC is the most practical way to do kernel-level eBPF programming with Python. Here's a ~25-line example that finds the processes doing the most disk writes:

#!/usr/bin/env python3
from bcc import BPF

# eBPF C code (runs inside kernel)
bpf_text = """
#include <uapi/linux/ptrace.h>
BPF_HASH(writes_by_pid, u32, u64);

int trace_write(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid() >> 32;
    u64 *count = writes_by_pid.lookup(&pid);
    if (count)
        *count += 1;
    else {
        u64 init = 1;
        writes_by_pid.update(&pid, &init);
    }
    return 0;
}
"""

b = BPF(text=bpf_text)
b.attach_kprobe(event="vfs_write", fn_name="trace_write")

print("Disk Write Top 10 - CTRL+C to stop\n")
try:
    while True:
        import time; time.sleep(2)
        print("\033[2J\033[H")  # Clear screen
        print("PID      COMM             WRITES")
        counts = b["writes_by_pid"]
        sorted_counts = sorted(
            [(pid.value, counts[pid].value) for pid in counts],
            key=lambda x: x[1], reverse=True
        )[:10]
        for pid, count in sorted_counts:
            try:
                with open(f"/proc/{pid}/comm") as f:
                    comm = f.read().strip()
            except:
                comm = "?"
            print(f"{pid:<9} {comm:<16} {count}")
except KeyboardInterrupt:
    print("\nDone.")

The moment you run this code, a kprobe starts triggering inside the kernel and every vfs_write() call increments a write counter based on the process ID. You don't attach to any process with ptrace, no context switch occurs.

TCP Connection Analysis with eBPF

eBPF is incredibly powerful for detecting network problems. Traditional tcpdump captures packets but can't see connection state (state machine). Here's a bpftrace script that tracks the TCP lifecycle:

bpftrace -e '
BEGIN { printf("Tracing TCP connections... Ctrl-C to stop\n"); }

// New connection
kprobe:tcp_connect {
    @conn_start[tid] = nsecs;
    printf("[CONNECT] %-16s -> %s\n", comm, kstack(3));
}

// Connection successful
kprobe:tcp_finish_connect / @conn_start[tid] / {
    $duration_us = (nsecs - @conn_start[tid]) / 1000;
    @connect_latency = hist($duration_us);
    printf("[ESTABLISHED] %s, latency: %d us\n", comm, $duration_us);
    delete(@conn_start[tid]);
}

// Retransmit — sign of trouble!
kprobe:tcp_retransmit_skb {
    printf("[RETRANSMIT ⚠] %s\n", kstack(3));
    @retransmits[comm] = count();
}

END {
    clear(@conn_start);
}
'

If you see frequent [RETRANSMIT ⚠] in the output, it means there's packet loss in your network infrastructure. This is a metric that's difficult to capture with traditional tools.

Real-World: eBPF at Scale

eBPF isn't just a debugging tool — it's a platform used in production by the world's largest infrastructures:

Netflix — Led by Brendan Gregg, Netflix uses eBPF for over 1 million flow exports per second. Many tools like tcplife, tcptop, biolatency were born in Netflix's production environment. Netflix has released 50+ BCC tools that are now part of Linux distributions.

Meta (Facebook) — Built their Katran L4 load balancer entirely on eBPF/XDP. It runs at 10 million connections/second capacity and uses much less CPU compared to the previous DPDK-based system. They also customize the kernel scheduler with eBPF programs via sched_ext.

Cloudflare — Achieves DDoS protection with microsecond-level packet filtering via eBPF/XDP. They save 10% CPU compared to traditional iptables-based filtering. Cloudflare has detailed case studies on their blog titled "DDoS Mitigation with XDP."

Other major users: Google (GKE networking), Shopify (service mesh), Datadog (agent monitoring), Sysdig (container security), AWS (Lambda and VPC networking), Microsoft (eBPF for Windows).

eBPF vs Traditional Tools: Comparison

Knowing which tool to use when is important:

┌───────────────┬──────────────────┬──────────────────────────┐
│ Tool          │ Overhead         │ When to Use              │
├───────────────┼──────────────────┼──────────────────────────┤
│ strace        │ 30-80%           │ Single process, quick test│
│ perf          │ 2-5%             │ Sampling, flame graphs    │
│ ftrace        │ 0-2%             │ Kernel function tracing   │
│ SystemTap     │ 5-15%            │ Complex kernel scripts    │
│ eBPF/bpftrace │ 0.1-3%           │ Production-safe tracing   │
└───────────────┴──────────────────┴──────────────────────────┘

eBPF's low overhead comes from running within the kernel and not causing context switches. strace makes two user↔kernel transitions per syscall; eBPF is already in the kernel.

Security: Is eBPF Safe?

eBPF is much safer than kernel modules thanks to its sandbox model — but not risk-free. Some known CVEs:

  • CVE-2020-8835 — Out-of-bounds read/write vulnerability in Verifier (fixed in kernel 5.5)
  • CVE-2021-3490 — Incorrect bounds checking in 32-bit ALU operations
  • CVE-2022-23222 — Speculative execution attack

These were bugs in the Verifier itself, not in eBPF programs. Still, things to be careful about when using eBPF in production:

# Completely restrict eBPF loading (most secure)
sysctl kernel.unprivileged_bpf_disabled=1

# Enable JIT hardening
sysctl net.core.bpf_jit_harden=2

# Only root can use it
sysctl kernel.unprivileged_bpf_disabled=2

Conclusion

eBPF is the most transformative technology in the Linux kernel in the last 10 years. It offers an observability platform that you can safely use in production — without writing kernel modules, without sacrificing performance.

With the tools covered in this article, you can start today:

  1. bpftrace — start understanding your system's behavior in 5 minutes — apt install bpftrace is all you need
  2. BCC tools — use ready-made tools like execsnoop, biolatency, tcptop
  3. Your own BCC Python program — collect metrics specific to your workload

As a next step, you can explore Kubernetes networking with Cilium (eBPF-based CNI) or automatic instrumentation with Pixie (eBPF-based APM). The eBPF ecosystem is growing rapidly — now is the perfect time to start learning.


Tags: linux, ebpf, bpftrace, bcc, observability, kernel, networking, performance Date: 2026-05-29MARKDOWN_EOF

dhy@ironhide:~/site$