Linux cgroups v2: Process-Based Resource Limiting — Taming CPU, Memory, and I/O

The Problem

You're running dozens of services on a single server. NGINX, PostgreSQL, Redis, several Node.js applications, cron jobs... They all draw from the same CPU and memory pool. One day, a Node.js application has a memory leak and eats up 28 GB of your 32 GB RAM, the OOM killer hits PostgreSQL, and your site goes down. Or a log rotation script saturates all CPU cores and API response times hit the ceiling.

The classic solution: put everything in Docker containers. But what if you don't want to use Docker? Or what if you need fine-grained control even within a container? That's where cgroups v2 comes in.

What Are cgroups?

Control Groups (cgroups) are a Linux kernel mechanism that allows you to organize processes into hierarchical groups and assign resource limits to each group. cgroups v1 came in 2008, and v2 was introduced in kernel 4.5 (2016). Today, it's the default on most modern distributions (Ubuntu 22.04+, Debian 12+, RHEL 9+).

The biggest difference between v1 and v2 is that in v1, each controller (cpu, memory, blkio...) creates a separate hierarchy, while v2 has a single unified hierarchy. This means you can set CPU, memory, and I/O limits simultaneously on a single process group.

# v1: scattered hierarchy (each controller separate)
/sys/fs/cgroup/cpu/app/
/sys/fs/cgroup/memory/app/
/sys/fs/cgroup/blkio/app/

# v2: unified hierarchy
/sys/fs/cgroup/app/
├── cpu.max
├── memory.max
├── memory.high
├── io.max
└── ...

Check if your system uses v2:

mount | grep cgroup2
# cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

If the file /sys/fs/cgroup/cgroup.controllers exists, you're using v2.

Solution: Without systemd, Using the Direct cgroups v2 API

systemd wraps cgroups with its own slice/scope mechanism — but in this article, we'll use the cgroups v2 filesystem API directly. This way, you'll understand how the system works and be able to write your own solution for custom scenarios where systemd falls short.

1. Manually Creating a cgroup and Assigning a Process

First, let's create a cgroup and put a process inside it:

# Root privileges required (or you need to be in a delegated subtree)
sudo mkdir /sys/fs/cgroup/app

# Check: child cgroups automatically inherit all controllers
cat /sys/fs/cgroup/app/cgroup.controllers
# cpu memory io pids

# Add a process to this group
echo $$ | sudo tee /sys/fs/cgroup/app/cgroup.procs

That's it. Your current shell is now under the app cgroup. Now let's set the limits.

2. CPU Limiting

In cgroups v2, CPU control is done through the cpu.max file. Format: $MAX $PERIOD (in microseconds).

# 50% of 2 CPU cores = 1 full core worth of CPU
# $MAX=50000, $PERIOD=100000 → max 50ms CPU in every 100ms period
echo "50000 100000" | sudo tee /sys/fs/cgroup/app/cpu.max

# CPU weight (soft limit — only kicks in during CPU contention)
# Default is 100, lower value = lower priority
echo 50 | sudo tee /sys/fs/cgroup/app/cpu.weight

Real-world scenario: give a background log processing service at most 10% of CPU:

sudo mkdir /sys/fs/cgroup/log-processor
echo "10000 100000" | sudo tee /sys/fs/cgroup/log-processor/cpu.max
echo 10 | sudo tee /sys/fs/cgroup/log-processor/cpu.weight

Read CPU usage statistics:

# Total CPU usage (microseconds)
cat /sys/fs/cgroup/app/cpu.stat
# usage_usec 45238192
# user_usec 30123456
# system_usec 15114736

3. Memory Limiting

Memory management is one of cgroups v2's strongest areas. It has a three-tier limit system:

# Hard limit — cannot be exceeded, OOM killer triggers
echo 512M | sudo tee /sys/fs/cgroup/app/memory.max

# Soft limit — when exceeded, kernel does aggressive reclaim but OOM killer doesn't trigger
echo 384M | sudo tee /sys/fs/cgroup/app/memory.high

# Guaranteed minimum memory (kernel tries not to go below this amount)
echo 128M | sudo tee /sys/fs/cgroup/app/memory.min

Practical strategy: set memory.high to about 25% lower than memory.max. This way, when memory usage exceeds the high threshold, the kernel starts reclaiming but the process continues working. However, if it reaches the max limit, OOM kicks in.

# Memory statistics
cat /sys/fs/cgroup/app/memory.current   # Current usage
cat /sys/fs/cgroup/app/memory.stat      # Detailed statistics
cat /sys/fs/cgroup/app/memory.events    # OOM, high, max event counters

Memory pressure monitoring — one of the coolest features of cgroups v2:

cat /sys/fs/cgroup/app/memory.pressure
# some avg10=12.34 avg60=8.91 avg300=5.67 total=4523819
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some: at least one process is waiting for memory reclaim. full: all processes are waiting for reclaim. You can connect these metrics to Prometheus and set up alerting.

4. I/O Limiting

Disk I/O is usually forgotten but is a critical bottleneck. Prevent one process from saturating everyone else's disk:

# Format: MAJOR:MINOR rbps=X wbps=X riops=X wiops=X
# rbps = read bytes per second, wbps = write bytes per second
# riops = read IOPS, wiops = write IOPS

# First, find device numbers
lsblk -o NAME,MAJ:MIN /dev/sda
# NAME MAJ:MIN
# sda   8:0

# Max 50MB/s read, 20MB/s write, 1000 IOPS
echo "8:0 rbps=52428800 wbps=20971520 riops=1000 wiops=500" | \
  sudo tee /sys/fs/cgroup/app/io.max

# I/O statistics
cat /sys/fs/cgroup/app/io.stat
# 8:0 rbytes=4294967296 wbytes=1073741824 rios=1024 wios=256

5. PID Limit (Fork Bomb Protection)

# Max 100 processes can run under this cgroup
echo 100 | sudo tee /sys/fs/cgroup/app/pids.max

# If exceeded, fork() returns error, no new process can be created
cat /sys/fs/cgroup/app/pids.current  # Current process count

Real-World Scenario: NGINX + PostgreSQL Server

Let's say you're running both NGINX and PostgreSQL on a single server. PostgreSQL's memory usage must be guaranteed, NGINX should use a specific portion of CPU.

# PostgreSQL cgroup
sudo mkdir /sys/fs/cgroup/postgresql
echo "8G" | sudo tee /sys/fs/cgroup/postgresql/memory.max
echo "6G" | sudo tee /sys/fs/cgroup/postgresql/memory.high
echo "4G" | sudo tee /sys/fs/cgroup/postgresql/memory.min
echo "30000 100000" | sudo tee /sys/fs/cgroup/postgresql/cpu.max  # 3 cores worth
sudo sh -c 'for pid in $(pgrep -f postgres); do echo $pid > /sys/fs/cgroup/postgresql/cgroup.procs; done'

# NGINX cgroup
sudo mkdir /sys/fs/cgroup/nginx
echo "2G" | sudo tee /sys/fs/cgroup/nginx/memory.max
echo "15000 100000" | sudo tee /sys/fs/cgroup/nginx/cpu.max  # 1.5 cores
sudo sh -c 'for pid in $(pgrep -f nginx); do echo $pid > /sys/fs/cgroup/nginx/cgroup.procs; done'

Let's write a startup script to make this persistent:

#!/bin/bash
# /usr/local/sbin/setup-cgroups.sh
# Runs at server boot

setup_cgroup() {
    local name=$1 mem_max=$2 mem_high=$3 mem_min=$4 cpu_max=$5

    # Delete if exists, recreate
    [ -d "/sys/fs/cgroup/$name" ] && rmdir "/sys/fs/cgroup/$name" 2>/dev/null
    mkdir -p "/sys/fs/cgroup/$name"

    # Memory limits
    echo "$mem_max"  > "/sys/fs/cgroup/$name/memory.max"
    echo "$mem_high" > "/sys/fs/cgroup/$name/memory.high"
    echo "$mem_min"  > "/sys/fs/cgroup/$name/memory.min"

    # CPU limits
    echo "$cpu_max 100000" > "/sys/fs/cgroup/$name/cpu.max"
}

# Container-like isolation
setup_cgroup "postgresql" "8G" "6G" "4G" "30000"
setup_cgroup "nginx"      "2G" "1500M" "512M" "15000"
setup_cgroup "redis"      "1G" "768M" "256M" "5000"

Then let's make it persistent with systemd:

# /etc/systemd/system/cgroups-setup.service
[Unit]
Description=Setup custom cgroups for service isolation
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/setup-cgroups.sh
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

sudo systemctl daemon-reload
sudo systemctl enable --now cgroups-setup.service

Using cgroups v2 Programmatically (Rust Example)

The direct filesystem API is fine for scripts, but using the kernel API is cleaner in production code. Here's a process spawner that uses cgroups v2 in Rust:

use std::fs;
use std::process::Command;

struct CgroupV2 {
    name: String,
    path: String,
}

impl CgroupV2 {
    fn new(name: &str) -> std::io::Result<Self> {
        let path = format!("/sys/fs/cgroup/{}", name);
        fs::create_dir_all(&path)?;

        // Make sure controllers are active
        let controllers = fs::read_to_string("/sys/fs/cgroup/cgroup.controllers")?;
        let subtree = format!("{}/cgroup.subtree_control", &path);
        fs::write(&subtree, controllers.trim())?;

        Ok(Self {
            name: name.to_string(),
            path,
        })
    }

    fn set_memory_limit(&self, limit: &str) -> std::io::Result<()> {
        fs::write(format!("{}/memory.max", self.path), limit)
    }

    fn set_cpu_limit(&self, max_us: u64, period_us: u64) -> std::io::Result<()> {
        fs::write(
            format!("{}/cpu.max", self.path),
            format!("{} {}", max_us, period_us),
        )
    }

    fn add_process(&self, pid: u32) -> std::io::Result<()> {
        fs::write(format!("{}/cgroup.procs", self.path), pid.to_string())
    }

    fn spawn_limited(&self, cmd: &str, args: &[&str]) -> std::io::Result<std::process::Child> {
        let child = unsafe {
            Command::new(cmd)
                .args(args)
                .pre_exec(move || {
                    // After fork, before exec: add ourselves to the cgroup
                    let pid = std::process::id();
                    let _ = fs::write(
                        format!("{}/cgroup.procs", self.path),
                        pid.to_string(),
                    );
                    Ok(())
                })
                .spawn()?
        };
        Ok(child)
    }
}

fn main() -> std::io::Result<()> {
    let cg = CgroupV2::new("rust-app")?;
    cg.set_memory_limit("256M")?;
    cg.set_cpu_limit(50000, 100000)?;  // max 50ms CPU per 100ms

    println!("cgroup '{}' created, limits set", cg.name);

    // Start a test process under this cgroup
    let mut child = cg.spawn_limited("stress-ng", &["--cpu", "4", "--timeout", "30s"])?;
    println!("Process PID={} started under cgroup", child.id());

    let status = child.wait()?;
    println!("Process exit code: {:?}", status.code());

    Ok(())
}

cgroups v2 vs Docker: When to Use Which?

You know that Docker uses cgroups under the hood. So when should you prefer direct cgroups?

Direct cgroups v2:
  ✅ Zero overhead (no container runtime, overlayfs, network bridge)
  ✅ Full kernel control (eBPF programs, custom I/O scheduler)
  ✅ Works on embedded systems without systemd integration
  ✅ No disk image, layer, or registry headaches
  ❌ Manual management, you need to write your own scripts
  ❌ Network isolation requires extra configuration

Docker/Podman:
  ✅ Ready-made image ecosystem
  ✅ Network, volume, secret management built-in
  ✅ CI/CD pipeline integration
  ✅ Multi-host orchestration (Kubernetes)
  ❌ Overhead (containerd, runc, overlayfs, bridge network)
  ❌ Image size and layer complexity

Practical rule: If you only want resource limiting, use cgroups directly. If you want filesystem isolation, network isolation, image-based deployment, use containers.

Monitoring and Debugging

cgroups v2 provides a set of monitoring outputs. You can collect these with Prometheus node_exporter or custom scripts:

#!/bin/bash
# /usr/local/bin/cgroup-monitor.sh — Reports status of all custom cgroups

echo "CGROUP            CPU_USAGE%   MEM_USAGE    MEM_LIMIT    OOM_COUNT"
echo "────────────────  ──────────  ───────────  ───────────  ─────────"

for cg in /sys/fs/cgroup/{postgresql,nginx,redis,app}; do
    [ -d "$cg" ] || continue
    name=$(basename "$cg")

    # Memory
    mem_cur=$(cat "$cg/memory.current" 2>/dev/null)
    mem_max=$(cat "$cg/memory.max" 2>/dev/null)
    oom_cnt=$(cat "$cg/memory.events" 2>/dev/null | awk '/oom /{print $2}')

    # Convert to human-readable format
    mem_cur_h=$(numfmt --to=iec "$mem_cur" 2>/dev/null)
    mem_max_h=$(numfmt --to=iec "$mem_max" 2>/dev/null)

    printf "%-18s %-10s %-11s %-11s %-9s\n" \
        "$name" "N/A" "$mem_cur_h" "$mem_max_h" "$oom_cnt"
done

By monitoring the OOM counter in the memory.events file, you can see which service exceeded its memory limit and when:

# Monitor OOM events for PostgreSQL
watch -n 5 'cat /sys/fs/cgroup/postgresql/memory.events'
# low 0
# high 374
# max 15        ← hard limit exceeded 15 times
# oom 3         ← OOM killer triggered 3 times
# oom_kill 3

Conclusion

cgroups v2 is one of the most underrated tools in Linux system administration. Being able to do process-based CPU, memory, and I/O limiting without using containers offers a significant advantage, especially on bare-metal servers, embedded systems, and high-performance computing environments.

Summary of what we covered:

Single unified hierarchy: with v2, all controllers are in the same tree
Filesystem API: as simple as echo 512M > memory.max
Three-tier memory management: graceful degradation with min, high, max
Memory pressure metrics: proactive monitoring with PSI (Pressure Stall Information)
I/O limiting: read/write speed and IOPS limiting
Programmatic usage: easy integration with Rust, Go, Python

As a next step, consider combining cgroups v2 with eBPF programs to write a custom I/O scheduler, or developing a cgroup-aware OOM handler. With each new Linux kernel version, cgroups v2 gets new controllers — network priority and CPU affinity features will soon be moved to the unified hierarchy.

Tags: linux, cgroups, kernel, system-administration, resource-management, performance, memory, cpu, io Date: 2026-05-28MARKDOWN_EOF