Understanding System Load Average Explained

Last Tuesday at 3:47 AM, my phone goes off. Alerts. Again.

"Load average critical: 18.45 on prod-web-03"

I grab my laptop, squinting at the screen, SSH in, run top.

CPU usage: 23%.

...what?

Look, if you've done DevOps for more than a month, you've been here. High load, but CPU is just chilling. Or worse - you see load average of 8 on a 16-core box and think "meh, we're at 50%, we're fine" while users are blowing up Slack about slow response times.

I've been doing this for 4 years now and I STILL sometimes stare at load average like it's personally insulting me. So let me tell you what I wish someone had told me back when I was first on-call and had absolutely no idea what I was looking at.

What is Load Average? (No, It's Not CPU Usage)

When you run uptime you get this:

$ uptime
 03:47:12 up 45 days,  4:23,  2 users,  load average: 18.45, 12.73, 8.92

Those three numbers are load average. And if you're like me when I started, you probably thought "okay so that's like... CPU percentage?"

Nope. Wrong. I spent my first 6 months being confused about this.

Load average is how many processes are either running OR waiting to run. That "waiting" part? That's the bit that screwed me up for ages.

Think of it like the line at a coffee shop:

Running processes = people actively getting their order made
Waiting processes = everyone standing in line, waiting their turn
Load average = you count BOTH groups

The line can be out the door (high load) even if there's only one barista slowly making drinks (low CPU). Or the barista can be working super fast (high CPU) but there's only 2 people in line (low load).

This is not an academic distinction. This has bitten me in production. Multiple times.

The Thing Nobody Tells You About Load Average

Here's what really messed with my head when I was starting out:

I had a database server showing:

Load Average: 24.8
CPU Usage: 18%

My manager (who was actually pretty cool but also stressed as hell) messages me at like 2 PM: "Why is DB load so high?"

Me, confidently: "Load is high but CPU is low, so we're fine."

Him: "Users are reporting slow queries."

Me: "..."

Turns out the server was doing a massive table scan. Disk I/O was HAMMERED. The CPU was mostly sitting around waiting for the disk to respond. Processes were piling up in what's called "uninterruptible sleep" waiting for disk operations.

When I finally ran iostat (which I should have done immediately but didn't because I'm sometimes an idiot):

$ iostat -x 1
avg-cpu:  %user   %system  %iowait   %idle
          12.3    5.8      78.2      3.7

That 78% in iowait? That was the problem.

Here's what I learned the hard way: High load + low CPU = you're waiting on something else. Usually:

Disk I/O (most common in my experience)
Network I/O
Memory/swap thrashing
Lock contention (processes waiting on each other)
Sometimes just weird network stuff like DNS timeouts

The Three Numbers (And Why They Matter More Than I Thought)

load average: 2.45, 1.73, 0.92
              └─1min  └─5min  └─15min

Okay so these aren't three different things - they're the same thing averaged over different windows.

When I first started, I only looked at the first number. Big mistake.

The pattern tells you what's happening:

# Oh shit, something just broke
load average: 8.2, 3.1, 1.4

# Oh no, it's getting WORSE
load average: 2.1, 4.3, 6.8

# This has been broken for a while
load average: 12.4, 12.1, 11.8

# We're good
load average: 1.2, 1.4, 1.3

I learned to read these patterns after getting woken up at 4 AM one too many times. The increasing pattern (2, 4, 6) is the scariest because it means whatever's wrong is still getting worse.

How Many CPUs Do I Have Again?

Real talk: I STILL sometimes have to count.

$ nproc
8

Okay so 8 cores. Now here's the rule I actually use (not the textbook stuff):

On web/API servers:

Load under 5-6 → I go back to sleep
Load 6-8 → I'm watching it
Load over 12 → I'm up and investigating

On batch processing boxes:

Load under 16 → totally normal
Load over 24 → maybe check what's running
Load over 40 → okay what the hell

Why the difference? Batch jobs are SUPPOSED to max out resources. Web servers need headroom for traffic spikes.

I've learned this from experience, not from any blog post. Your mileage may vary.

When I Actually Worry About Load

Honestly? Context matters way more than the number.

I've seen load 15 on an 8-core box doing overnight ETL processing - totally fine, job finishes on time, nobody cares.

I've also seen load 5 on the same box during business hours causing API timeouts and angry Slack messages from the product team.

Here's my actual mental checklist (4 years of getting this wrong has taught me):

Is anything actually broken? (check error rates, response times)
What time is it? (batch job time vs peak traffic time)
Is load going up or down? (trend matters)
Are users complaining? (this one's important)

If load is high but everything works fine, I document it and move on. If load is "normal" but things are broken, I investigate anyway.

Real Scenarios I've Actually Dealt With

Let me tell you about some of my actual 3 AM debugging sessions.

The Database That Wasn't Really a Database Problem

This was maybe 6 months into my first DevOps job. I get paged at 3 AM.

$ uptime
load average: 28.4, 24.1, 18.9

I'm still half asleep, trying to remember what server this even is. Run top:

%Cpu(s): 15.2 us, 4.3 sy, 0.0 ni, 2.1 id, 78.4 wa

That wa is I/O wait. It's at 78%.

Now, past me would have panicked. But I'd learned by this point to check disk I/O:

$ iostat -x 1
Device  r/s    w/s    %util
sda     1247   89     100.00

Disk is absolutely maxed. Great. What's hammering it?

Checked MySQL slow query log (something I should have set up better in the first place, but that's a different story). Someone had run a report query without a WHERE clause. Full table scan on a 50GB table.

The fix: Kill the query, add an index, tell the analyst to please god test queries in staging first.

Load dropped from 28 to like 2 in a couple minutes. Went back to bed. Got shit from my manager the next day because apparently that index "should have existed already." Whatever, production was up.

Lesson: High load + low CPU + high iowait = check your damn disks.

The Memory Leak I Didn't Know Was Happening

This one took me THREE DAYS to figure out. I'm still kind of annoyed about it.

We had a Node.js app that just kept getting slower. Not dramatically, just... gradually worse. Week 1, load was around 2. Week 3, load was hovering at 7. Week 4, we hit 15 and things started timing out.

CPU usage? Still around 25% the whole time. What the hell.

I spent way too long looking at the application code, at database queries, at everything EXCEPT the actual problem. Finally, someone more experienced than me (shoutout to Priya from my old team) told me to check memory:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:           62G          58G        312M        1.2G        3.8G        2.1G
Swap:          8.0G        6.2G        1.8G

We were SWAPPING. The app had a memory leak (turned out to be an unclosed database connection pool, classic) and had grown to 58GB. System was thrashing swap constantly.

$ vmstat 1
procs -----------memory---------- ---swap-- -----io----
 r  b   swpd   free   buff  cache   si   so    bi    bo
 8 47 8388608  184320  12    81920  1247 1182  8493 4721

That si and so (swap in and out) plus 47 processes in blocked state = we're spending all our time managing memory instead of doing actual work.

The fix: Restart the app (temporary), track down the leak (took another week), add proper memory limits, set up better monitoring.

Lesson: If you see high load + low CPU + swap activity, you have a memory problem. Also, I should have checked memory on day 1. Live and learn.

The Weird Network Thing That Made No Sense

This one was weird. And I'm still not 100% sure I understand what happened.

$ uptime  
load average: 6.8, 6.2, 5.9

$ top
%Cpu(s): 12.3 us, 45.2 sy, 0.0 ni, 42.5 id, 0.0 wa

So user CPU is low, but system CPU is high. That usually means the kernel is doing something expensive.

Checked network (because honestly I was running out of ideas):

$ sar -n DEV 1
IFACE    rxpck/s   txpck/s
eth0     125789    124532

Holy shit. 125,000 packets per second. Our microservices were basically screaming at each other constantly. Small packets, tons of overhead.

Turned out someone had deployed a change that made services poll each other every 100ms instead of using proper event-driven updates. Every. Service. Was. Polling. Every. Other. Service.

The fix: Revert the deployment, have a very awkward conversation in retro about testing before production.

Lesson: High system CPU can mean network saturation. Also, microservices are great until they're not.

What Your Grafana Dashboard Is Actually Showing

Okay so most of us have Grafana or Datadog or whatever showing pretty graphs. Here's the thing: looking at just load average is USELESS.

I learned this after staring at a graph for 30 minutes trying to figure out why load spiked at 2 PM every day. Turns out it was scheduled reports hitting the database. Would have been obvious if I'd been looking at disk I/O at the same time.

Here's what I actually look at now:

Load average (obviously)
CPU usage breakdown (user, system, iowait, idle)
Memory available (not used - available)
Disk I/O (operations per second, not just throughput)
Network I/O (packets AND bandwidth)

When load spikes, you can immediately see what's correlated. Is CPU spiking? Is iowait spiking? Is memory dropping?

Quick Grafana tip that saved my ass:

# Load per core - if this is over 1, you're oversubscribed
node_load1 / count(node_cpu_seconds_total{mode="idle"})

This normalizes load by CPU count. Makes it way easier to set alerts across different instance types.

The Monitoring Script I Actually Use

Alright, enough theory. Here's a script I actually run on servers when load is high and I need to figure out what's going on.

This isn't perfect. It's what I've cobbled together over 4 years of debugging production issues at 3 AM when I'm too tired to remember all the commands.

#!/bin/bash

# Quick and dirty load analysis script
# Made by someone who's tired of typing the same commands at 3 AM

RED='\033[0;31m'
YELLOW='\033[1;33m'
GREEN='\033[0;32m'
BLUE='\033[0;34m'
RESET='\033[0m'

echo -e "${GREEN}=== Load Average Quick Check ===${RESET}\n"

# Basic info
CPU_COUNT=$(nproc)
read LOAD1 LOAD5 LOAD15 <<< $(cat /proc/loadavg | awk '{print $1, $2, $3}')

echo "CPU Cores: $CPU_COUNT"
echo "Load: $LOAD1 (1m) / $LOAD5 (5m) / $LOAD15 (15m)"

# Calculate load per core
LOAD_PER_CORE=$(echo "scale=2; $LOAD1 / $CPU_COUNT" | bc)
echo "Load per core: $LOAD_PER_CORE"

# Quick status
if (( $(echo "$LOAD_PER_CORE > 2.0" | bc -l) )); then
    echo -e "${RED}Status: CRITICAL - way over capacity${RESET}"
elif (( $(echo "$LOAD_PER_CORE > 1.5" | bc -l) )); then
    echo -e "${YELLOW}Status: WARNING - high load${RESET}"
elif (( $(echo "$LOAD_PER_CORE > 1.0" | bc -l) )); then
    echo -e "${YELLOW}Status: Near capacity${RESET}"
else
    echo -e "${GREEN}Status: Normal${RESET}"
fi

# Trend
echo -e "\n${BLUE}=== Trend ===${RESET}"
if (( $(echo "$LOAD1 > $LOAD5" | bc -l) )) && (( $(echo "$LOAD5 > $LOAD15" | bc -l) )); then
    echo -e "${RED}⬆ Getting WORSE${RESET}"
elif (( $(echo "$LOAD1 < $LOAD5" | bc -l) )) && (( $(echo "$LOAD5 < $LOAD15" | bc -l) )); then
    echo -e "${GREEN}⬇ Recovering${RESET}"
else
    echo -e "${YELLOW}→ Stable${RESET}"
fi

# What's the bottleneck?
echo -e "\n${BLUE}=== Bottleneck Check ===${RESET}"

# CPU
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
IOWAIT=$(top -bn1 | grep "Cpu(s)" | awk '{print $10}' | tr -d '%wa,')

echo "CPU Usage: ${CPU_USAGE}%"
echo "I/O Wait: ${IOWAIT}%"

# Memory
MEM_AVAILABLE=$(free -m | awk 'NR==2{printf "%s", $7}')
MEM_TOTAL=$(free -m | awk 'NR==2{printf "%s", $2}')
MEM_PERCENT=$(echo "scale=2; (($MEM_TOTAL - $MEM_AVAILABLE) / $MEM_TOTAL) * 100" | bc)

echo "Memory Usage: ${MEM_PERCENT}%"

# Swap
SWAP_USED=$(free -m | awk 'NR==3{printf "%s", $3}')
if [ "$SWAP_USED" -gt 100 ]; then
    echo -e "${RED}Swap in use: ${SWAP_USED}MB - MEMORY PROBLEM${RESET}"
fi

echo ""

# My actual logic for what to check
if (( $(echo "$IOWAIT > 30" | bc -l) )); then
    echo -e "${RED}🔍 DISK I/O problem - check iostat${RESET}"
    echo "Commands to run:"
    echo "  iostat -x 1"
    echo "  iotop -o"
elif (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
    echo -e "${RED}🔍 CPU problem - check top${RESET}"
    echo "Commands to run:"
    echo "  top -o %CPU"
    echo "  ps aux --sort=-%cpu | head -20"
elif (( $(echo "$MEM_PERCENT > 90" | bc -l) )) || [ "$SWAP_USED" -gt 100 ]; then
    echo -e "${RED}🔍 MEMORY problem${RESET}"
    echo "Commands to run:"
    echo "  free -h"
    echo "  ps aux --sort=-%mem | head -20"
    echo "  vmstat 1"
else
    echo -e "${YELLOW}No obvious bottleneck - check network or locks${RESET}"
    echo "Try: sar -n DEV 1"
fi

# Top CPU hogs
echo -e "\n${BLUE}=== Top 5 CPU Users ===${RESET}"
ps aux --sort=-%cpu | head -6 | tail -5 | awk '{printf "%-10s %-8s %s\n", $2, $3"%", $11}'

# Processes in uninterruptible sleep (the sneaky ones)
UNINTERRUPTIBLE=$(ps aux | awk '$8 == "D"' | wc -l)
if [ "$UNINTERRUPTIBLE" -gt 5 ]; then
    echo -e "\n${RED}⚠ $UNINTERRUPTIBLE processes stuck waiting for I/O${RESET}"
    echo "These contribute to load but don't show in CPU usage"
fi

echo -e "\n${GREEN}=== Done ===${RESET}"

Save as check_load.sh, chmod +x it, run it when you get paged.

Is it perfect? No. Does it cover every edge case? Absolutely not. But it tells me what to look at next when I'm half-asleep and don't want to think.

My Actual Troubleshooting Process

When I get a high load alert, here's what I actually do (not what I "should" do according to textbooks):

Step 1: Quick Look (30 seconds)

uptime && top -bn1 | head -20

I'm looking for:

Is load going up or down? (the three numbers)
Is CPU high?
Is iowait high?

Step 2: Figure Out What's Actually Wrong (2-5 minutes)

If CPU is high:

# Who's using CPU?
ps aux --sort=-%cpu | head -10

Nine times out of ten it's one process being an idiot.

If iowait is high:

# What's hitting the disk?
iostat -x 1

Usually a database query or log writes or something dumb.

If neither CPU nor iowait are high:

# Check memory
free -h
vmstat 1

# Check for blocked processes
ps aux | awk '$8 == "D"' | wc -l

This is where it gets weird. Usually network stuff or locking.

Step 3: Fix It or Escalate

Honestly? Most of the time I:

Restart the problem service (if it's not critical)
Kill the bad query (if it's database)
Call someone more senior (if I have no idea)

You know what I've learned? Sometimes "turn it off and on again" actually works. Not elegant, but effective at 3 AM.

Things That Will Drive You Insane

Okay real talk, here are the things about load average that still annoy me:

Docker Containers See Host Load

This is SO DUMB. Your container sees the entire host's load average even though it only has 2 cores allocated.

# Inside container with 2 core limit
$ uptime
load average: 45.2, 42.1, 38.9

But the container can't even USE that many cores! The load calculation is completely wrong!

I've wasted hours explaining this to teammates. "No, our app isn't broken, that's the host's load from all containers."

Better metrics for containers: Use cgroup metrics. Or just track actual CPU usage and ignore load entirely in containers.

Cloud Instance Weirdness

AWS T2/T3 instances are "burstable." They have CPU credits. Load average doesn't account for this AT ALL.

You can have low load but still be throttled because you ran out of credits. Or high load but everything's fine because you have credits.

I learned this the expensive way when our T2 instances kept getting slow and I couldn't figure out why the load average wasn't that high. Turns out we were burning through CPU credits in the first hour of the day, then getting throttled.

Fix: Use normal instances for anything important. T2/T3 are fine for dev stuff but they'll mess with your head in production.

The "It Was Fine Yesterday" Problem

Load average is a moving average. If something broke 10 minutes ago, it takes a while to fully reflect in the 15-minute average.

I've had situations where something went wrong, we fixed it, but the alerts kept firing because the 15-minute average was still climbing.

Just... be aware of this. Recent changes take time to fully show up.

Setting Up Alerts That Don't Suck

Early in my career, I set up an alert like "if load > 8, page me."

I got paged approximately 47 times in one week.

Here's what I do now:

# Don't just alert on load
# Alert on load + actual impact

alert: HighLoadWithImpact
expr: |
  (node_load1 / count(node_cpu_seconds_total{mode="idle"})) > 1.5
  AND
  (
    http_request_duration_seconds_p95 > 2.0
    OR
    http_requests_error_rate > 0.05
  )
for: 5m

Translation: Only page me if load is high AND users are actually affected.

Also, I have different thresholds for different times:

Business hours: Alert at load 1.2x cores
Night/weekend batch jobs: Alert at load 3x cores

Saves my sleep schedule.

War Story: The DNS Timeout Incident

This was the weirdest one I've dealt with.

Our API servers were showing load of 15-20 on 8-core machines. CPU was like 30%. iowait was fine. Memory was fine. What the hell?

Checked running processes:

$ ps aux | awk '$8 == "D"' | wc -l
847

EIGHT HUNDRED AND FORTY SEVEN processes in uninterruptible sleep (state "D"). That's... a lot.

I had no idea what was happening. Called my senior engineer (it was like 4 PM, thankfully not the middle of the night). He told me to strace one of the processes:

$ strace -p $(pgrep -f api | head -1)
...endless DNS lookups...

Our DNS server had gone down. Every API call was trying to resolve hostnames and timing out after 30 seconds. Processes were piling up waiting for DNS.

The fix: Switch to secondary DNS, set up local DNS cache (dnsmasq), add proper DNS monitoring.

Lesson: High load + low CPU + tons of D-state processes = probably waiting on external service. Check network stuff, DNS, remote APIs, whatever.

I still think about this one sometimes. It was so weird and took forever to figure out.

Stuff I Still Don't Fully Understand

Look, I'm 4 years in and there's still stuff that confuses me:

Why exactly does Linux include I/O wait in load average but not network wait? Still don't really get it.
The exact formula for the exponential decay in the moving average - I know it exists, don't fully understand the math
How load average behaves with cgroups and namespaces in containers - it's weird and inconsistent

Point is, you don't need to understand everything perfectly to be effective. You need to understand enough to debug production.

Tools I Actually Use

Everyone says "learn these 50 performance tools!" but here's what I actually reach for:

Every single time:

top or htop (I prefer htop but top is always installed)
iostat -x 1
free -h

When I'm confused:

vmstat 1
sar -n DEV 1 (network)
ps aux with various sort flags

When I'm desperate:

strace (attaching to running processes)
lsof (see what files processes have open)
netstat or ss (network connections)

Never:

Any fancy GUI tool in production
Anything that requires installing stuff while things are on fire

Keep it simple. Use what's already there.

What I Wish I'd Known Earlier

If I could go back and tell myself something 4 years ago:

Load average is not CPU usage - Seems obvious now, wasn't then
High load + low CPU = you're waiting on something - Usually disk
The trend matters more than the number - Increasing load is scarier than high but stable load
Context matters - Load 5 during batch processing is fine, during peak traffic is not
Check memory/swap earlier - I always forget this one
Don't panic - Most high load situations aren't actually emergencies
Document your fixes - You will see the same issue again

Also: it's okay to not know. It's okay to call for help. It's okay to say "I don't know what's wrong yet" in Slack. Four years in and I still do all of these regularly.

Resources That Actually Helped Me

Books I've actually read (not just skimmed):

"Systems Performance" by Brendan Gregg - Chapter 6 is all about CPU and load
- Warning: This book is DENSE. I read it over like 6 months.

Blogs:

Brendan Gregg's blog (brendangregg.com) - Lots of deep technical stuff
Julia Evans's blog (jvns.ca) - Explains things in a way that makes sense

Man pages that are actually useful:

man proc - search for "loadavg"
man top
man iostat

Tools to actually learn:

htop (better than top, has colors, easier to read)
glances (shows everything at once, good for quick overview)
dstat (I don't use this much but some people swear by it)

Your Turn

So what's your load average horror story?

I bet you've got one. Maybe you've been paged at 2 AM for a load spike that turned out to be scheduled backups. Maybe you've spent hours debugging only to realize you were looking at the wrong server (I've done this. Twice.).

Drop it in the comments. We all learn from each other's mistakes, and honestly it makes me feel better knowing I'm not the only one who's screwed this up.

Also, if you've found better ways to deal with this stuff, share them. I'm always trying to get better at this.

What's next? Thinking about writing either:

"Why I/O Wait Isn't Always What You Think It Is"
"Memory Pressure and the OOM Killer: A Love Story"

Let me know which sounds more useful. Or suggest something else you want to see.

And if this helped you debug something at 3 AM, that's what it's for. Good luck out there.

What the Hell is 'Load Average' Anyway? A Guide to Actually Understanding System Load

What is Load Average? (No, It's Not CPU Usage)

The Thing Nobody Tells You About Load Average

The Three Numbers (And Why They Matter More Than I Thought)

How Many CPUs Do I Have Again?

When I Actually Worry About Load

Real Scenarios I've Actually Dealt With

The Database That Wasn't Really a Database Problem

The Memory Leak I Didn't Know Was Happening

The Weird Network Thing That Made No Sense

What Your Grafana Dashboard Is Actually Showing

The Monitoring Script I Actually Use

My Actual Troubleshooting Process

Step 1: Quick Look (30 seconds)

Step 2: Figure Out What's Actually Wrong (2-5 minutes)

Step 3: Fix It or Escalate

Things That Will Drive You Insane

Docker Containers See Host Load

Cloud Instance Weirdness

The "It Was Fine Yesterday" Problem

Setting Up Alerts That Don't Suck

War Story: The DNS Timeout Incident

Stuff I Still Don't Fully Understand

Tools I Actually Use

What I Wish I'd Known Earlier

Resources That Actually Helped Me

Your Turn

Comments

Linux for DevOps

How to Master Linux for DevOps: A Beginner's Handbook

More from this blog

MetalLB ARP vs BGP: The Production Decision Guide You Actually Need

Beginner's Guide: Overcoming Initial Challenges with Istio

Apache Kafka Goes Zookeeper-Free: Your Complete Guide to KRaft Mode (Part 2)

Mastering Kubernetes VPA for Better Container and Database Performance

Command Palette

What is Load Average? (No, It's Not CPU Usage)

The Thing Nobody Tells You About Load Average

The Three Numbers (And Why They Matter More Than I Thought)

How Many CPUs Do I Have Again?

When I Actually Worry About Load

Real Scenarios I've Actually Dealt With

The Database That Wasn't Really a Database Problem

The Memory Leak I Didn't Know Was Happening

The Weird Network Thing That Made No Sense

What Your Grafana Dashboard Is Actually Showing

The Monitoring Script I Actually Use

My Actual Troubleshooting Process

Step 1: Quick Look (30 seconds)

Step 2: Figure Out What's Actually Wrong (2-5 minutes)

Step 3: Fix It or Escalate

Things That Will Drive You Insane

Docker Containers See Host Load

Cloud Instance Weirdness

The "It Was Fine Yesterday" Problem

Setting Up Alerts That Don't Suck

War Story: The DNS Timeout Incident

Stuff I Still Don't Fully Understand

Tools I Actually Use

What I Wish I'd Known Earlier

Resources That Actually Helped Me

Your Turn

Comments

Linux for DevOps

How to Master Linux for DevOps: A Beginner's Handbook

More from this blog