Table of contents
As a DevOps engineer working with large-scale systems, I've noticed that file descriptors are often overlooked until they become a problem. In this guide, we'll explore what file descriptors are, why they matter, and how to monitor them effectively using a practical monitoring script.
What are File Descriptors?
Before diving into monitoring, let's understand what we're dealing with. In Linux, everything is treated as a file, and file descriptors are simply numeric handles that the operating system uses to keep track of open files. These "files" include:
Regular files and directories
Network sockets
Pipes
Device files
When your application opens a file or creates a network connection, the operating system assigns it a file descriptor. Each process has a limit on how many file descriptors it can use, and there's also a system-wide limit.
Why Monitor File Descriptors?
Common scenarios where file descriptor monitoring becomes critical:
Web servers handling multiple concurrent connections
Database systems managing many open files
Applications with memory leaks that don't properly close file handles
Microservices architectures with numerous network connections
Understanding File Descriptor Limits
Linux systems have two types of limits:
Soft limit: The default limit for processes
Hard limit: The maximum limit that can be set without root privileges
You can check your current limits using:
ulimit -Sn # Soft limit
ulimit -Hn # Hard limit
When to Consider Increasing Limits
High Connection Volume
Web servers handling thousands of concurrent connections
Each connection typically requires one file descriptor
Example: A Node.js application serving 10,000 concurrent users
Database Operations
Database systems managing many open files
Multiple table files open simultaneously
Example: MongoDB using memory-mapped files
Application Warnings
Log messages about "Too many open files"
Connection failures under high load
Process crashes with file descriptor errors
Microservices Architecture
Multiple services communicating via network sockets
Each service maintaining multiple connections
Example: A system with 20 microservices, each connecting to 5 others
Benefits of Increasing Limits
Higher Concurrency
Handle more simultaneous connections
Better support for websocket applications
Improved scalability for high-traffic services
Reduced Error Rates
Fewer "Too many open files" errors
More stable application performance
Better user experience
Operational Flexibility
Room for temporary spikes in usage
Easier debugging (more room for diagnostic tools)
Better support for development activities
Potential Drawbacks and Risks
Memory Impact
Memory Usage ≈ Number of FDs × FD Structure Size
Each file descriptor consumes kernel memory
Large numbers can impact system memory availability
Example: 100,000 FDs might use ~20MB of kernel memory
Security Considerations
Higher limits can amplify security vulnerabilities
DoS attacks might be more impactful
Resource exhaustion risks
System Stability
Too high limits might mask underlying issues
Harder to detect resource leaks
Potential impact on other system resources
Finding the Right Balance
Calculate Actual Needs
# Formula for web servers Minimum FDs = (Peak Concurrent Users × 1.5) + System Overhead
Monitor Usage Patterns
# Track daily peaks while true; do date >> fd_usage.log lsof | wc -l >> fd_usage.log sleep 3600 done
Implement Graduated Increases
# Start with modest increases soft_limit = current_peak × 1.5 hard_limit = soft_limit × 2
Best Practices for Limit Management
Dynamic Adjustment Strategy
# Example monitoring threshold USAGE_THRESHOLD=70 # Percentage CURRENT_USAGE=$(lsof | wc -l) SOFT_LIMIT=$(ulimit -Sn) if [ $(($CURRENT_USAGE * 100 / $SOFT_LIMIT)) -gt $USAGE_THRESHOLD ]; then # Alert for review echo "Consider limit increase" fi
Regular Review Process
Monitor weekly usage patterns
Review application error logs
Check system resource usage
Documentation Requirements
# Example documentation template - Current limits: soft=<value>, hard=<value> - Last change date: <date> - Reason for change: <reason> - Impact observed: <impact>
Emergency Response Plan
# Quick temporary increase ulimit -n <new_limit> # For current session
Practical Implementation Guide
Gradual Increase Approach
# Step 1: Increase by 50% NEW_LIMIT=$((CURRENT_LIMIT * 3/2)) # Step 2: Monitor for 1 week # Step 3: Evaluate impact # Step 4: Adjust if needed
System-specific Considerations
Web Servers:
max_clients × avg_files_per_client
Databases:
max_connections × tables_per_connection
Microservices:
services × connections_per_service × safety_factor
Monitoring Implementation
# Add to monitoring script track_fd_usage() { current=$(lsof | wc -l) limit=$(ulimit -Sn) usage=$((current * 100 / limit)) echo "Current Usage: $usage%" echo "Absolute Count: $current" echo "Limit: $limit" }
The Monitoring Script
Let's break down our monitoring script into digestible sections. This script provides comprehensive file descriptor monitoring with smart OS detection and useful recommendations.
OS Detection and Package Management
First, the script determines which Linux distribution it's running on:
detect_os() {
if [ -f /etc/os-release ]; then
. /etc/os-release
OS=$ID
OS_VERSION=$VERSION_ID
elif [ -f /etc/redhat-release ]; then
OS=$(cat /etc/redhat-release | awk '{print tolower($1)}')
OS_VERSION=$(cat /etc/redhat-release | awk '{print $3}')
fi
}
Analyzing File Descriptor Usage
The script checks current usage and calculates important metrics:
analyze_fd_usage() {
read used_fds free_fds max_fds < /proc/sys/fs/file-nr
soft_limit=$(ulimit -Sn)
hard_limit=$(ulimit -Hn)
# Calculate usage percentages
system_usage=$(echo "scale=2; ($used_fds / $max_fds) * 100" | bc)
user_soft_usage=$(echo "scale=2; ($used_fds / $soft_limit) * 100" | bc)
}
Smart Recommendations
When usage is high, the script provides customized recommendations:
if (( $(echo "$user_soft_usage > 70" | bc -l) )); then
recommended_soft_limit=$(( (used_fds * 2 + 1000) / 1000 * 1000 ))
recommended_hard_limit=$(( recommended_soft_limit * 2 ))
echo "Recommended settings for /etc/security/limits.conf:"
echo "* soft nofile ${recommended_soft_limit}"
echo "* hard nofile ${recommended_hard_limit}"
}
Practical Implementation Guide
Installation and Setup
Save the script as
fd_
monitor.sh
Make it executable:
chmod +x fd_
monitor.sh
Run with sudo:
sudo ./fd_
monitor.sh
Regular Monitoring Set up a cron job for daily monitoring:
0 0 * * * /path/to/fd_monitor.sh >> /var/log/fd_monitoring.log 2>&1
Integration with Monitoring Systems The script's output can be parsed for monitoring systems like Nagios or Zabbix:
# Example Nagios check usage=$(./fd_monitor.sh | grep "System Usage" | awk '{print $NF}' | tr -d '%') if [ "$usage" -gt 80 ]; then echo "CRITICAL - FD usage at ${usage}%" exit 2 fi
Troubleshooting Common Issues
High File Descriptor Usage
If you see high usage, check these common culprits:
Leaked File Descriptors
lsof -p <pid> | wc -l # Count open files for a process
Network Connections
ss -s # Socket statistics
Process Analysis
for pid in /proc/[0-9]*; do echo "$(ls -l $pid/fd 2>/dev/null | wc -l) $(cat $pid/cmdline 2>/dev/null)" done | sort -rn | head
System-wide Issues
For system-wide problems:
Check system limits:
cat /proc/sys/fs/file-max
Monitor system-wide usage:
cat /proc/sys/fs/file-nr
Advanced Topics
Container Considerations
When running in containerized environments:
Check both container and host limits
Consider namespace limitations
Monitor Docker socket usage
Performance Tuning
For high-performance systems:
Adjust based on available memory
Consider workload patterns
Plan for growth
Next Steps
Start with basic monitoring
Understand your baseline usage
Set up alerting
Plan for scaling
Automatic Script
This script helps automate the setup process:
Detects Your OS: It identifies your operating system (like Ubuntu or CentOS).
Installs Dependencies: It automatically installs any necessary dependencies.
Adapts Checks: It adjusts its checks based on your specific environment.
#!/bin/bash
# Color codes for enhanced readability
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
BLUE='\033[0;34m'
RESET='\033[0m'
# Detect Operating System
detect_os() {
if [ -f /etc/os-release ]; then
. /etc/os-release
OS=$ID
OS_VERSION=$VERSION_ID
elif [ -f /etc/redhat-release ]; then
OS=$(cat /etc/redhat-release | awk '{print tolower($1)}')
OS_VERSION=$(cat /etc/redhat-release | awk '{print $3}')
else
echo -e "${RED}Unsupported operating system.${RESET}"
exit 1
fi
echo -e "${GREEN}Detected OS: ${OS} ${OS_VERSION}${RESET}"
}
# Install required packages
install_dependencies() {
# Array of required packages
REQUIRED_PACKAGES=("procps" "bc" "net-tools" "iproute2")
# Detect package manager and install packages
case "$OS" in
ubuntu|debian)
# Install missing packages without updating package lists
for pkg in "${REQUIRED_PACKAGES[@]}"; do
if ! dpkg -s "$pkg" &> /dev/null; then
echo -e "${YELLOW}Installing ${pkg}...${RESET}"
apt-get install -y --no-upgrade "$pkg"
fi
done
# Additional Ubuntu/Debian specific packages
if ! dpkg -s "sysstat" &> /dev/null; then
apt-get install -y --no-upgrade sysstat
fi
;;
centos|rhel|fedora)
# Install missing packages without updating
for pkg in "${REQUIRED_PACKAGES[@]}"; do
if ! rpm -q "$pkg" &> /dev/null; then
echo -e "${YELLOW}Installing ${pkg}...${RESET}"
yum install -y "$pkg"
fi
done
# Additional CentOS/RHEL specific packages
if ! rpm -q "sysstat" &> /dev/null; then
yum install -y sysstat
fi
;;
*)
echo -e "${RED}Unsupported operating system for package installation.${RESET}"
exit 1
;;
esac
}
# Function to print section header
print_section_header() {
echo -e "\n${BLUE}===== $1 =====${RESET}"
}
# Function to get current user and system-wide ulimits
get_ulimit_info() {
print_section_header "Current User and System Ulimit Settings"
echo -e "${YELLOW}Current User Ulimit Settings:${RESET}"
ulimit -a | grep "open files"
echo -e "\n${YELLOW}System-wide Limits:${RESET}"
# Different paths for different OSes
case "$OS" in
ubuntu|debian)
grep -E "^\*.*nofile" /etc/security/limits.conf || echo "No system-wide nofile limits found in limits.conf"
;;
centos|rhel|fedora)
grep -E "^\*.*nofile" /etc/security/limits.d/*.conf /etc/security/limits.conf || echo "No system-wide nofile limits found"
;;
*)
grep -E "^\*.*nofile" /etc/security/limits.conf || echo "No system-wide nofile limits found"
;;
esac
}
# Function to analyze file descriptor usage
analyze_fd_usage() {
print_section_header "File Descriptor Usage Analysis"
read used_fds free_fds max_fds < /proc/sys/fs/file-nr
soft_limit=$(ulimit -Sn)
hard_limit=$(ulimit -Hn)
system_usage=$(echo "scale=2; ($used_fds / $max_fds) * 100" | bc)
user_soft_usage=$(echo "scale=2; ($used_fds / $soft_limit) * 100" | bc)
echo -e "${YELLOW}System-wide File Descriptor Analysis:${RESET}"
echo -e "Total File Descriptors in Use: ${used_fds}"
echo -e "Free File Descriptors: ${free_fds}"
echo -e "Maximum Allowed: ${max_fds}"
echo -e "System Usage: ${system_usage}%"
echo -e "\n${YELLOW}User Limit Analysis:${RESET}"
echo -e "Soft Limit: ${soft_limit}"
echo -e "Hard Limit: ${hard_limit}"
echo -e "Current Usage Relative to Soft Limit: ${user_soft_usage}%"
echo -e "\n${YELLOW}Recommendations:${RESET}"
recommended_soft_limit=$(( (used_fds * 2 + 1000) / 1000 * 1000 ))
recommended_hard_limit=$(( recommended_soft_limit * 2 ))
if (( $(echo "$user_soft_usage > 70" | bc -l) )); then
echo -e "${RED}WARNING: High file descriptor usage detected!${RESET}"
echo "Recommended actions:"
# OS-specific configuration recommendations
case "$OS" in
ubuntu|debian)
echo "1. Temporary increase:"
echo " ulimit -Sn ${recommended_soft_limit}"
echo " ulimit -Hn ${recommended_hard_limit}"
echo ""
echo "2. Permanent increase in /etc/security/limits.conf:"
echo " * soft nofile ${recommended_soft_limit}"
echo " * hard nofile ${recommended_hard_limit}"
echo " root soft nofile ${recommended_soft_limit}"
echo " root hard nofile ${recommended_hard_limit}"
;;
centos|rhel|fedora)
echo "1. Temporary increase:"
echo " ulimit -Sn ${recommended_soft_limit}"
echo " ulimit -Hn ${recommended_hard_limit}"
echo ""
echo "2. Permanent increase in /etc/security/limits.d/20-nproc.conf:"
echo " * soft nofile ${recommended_soft_limit}"
echo " * hard nofile ${recommended_hard_limit}"
echo " root soft nofile ${recommended_soft_limit}"
echo " root hard nofile ${recommended_hard_limit}"
;;
*)
echo "1. Temporary increase:"
echo " ulimit -Sn ${recommended_soft_limit}"
echo " ulimit -Hn ${recommended_hard_limit}"
;;
esac
echo ""
echo "3. System tuning in /etc/sysctl.conf:"
echo " fs.file-max = ${recommended_hard_limit}"
fi
print_section_header "Top Processes by File Descriptor Usage"
echo -e "${YELLOW}PID | FD Count | Process Name | Command${RESET}"
echo "-----------------------------------------"
sudo find /proc -maxdepth 1 -regex '/proc/[0-9]+' -printf "%f\n" | \
while read pid; do
if [ -d "/proc/$pid/fd" ]; then
fd_count=$(ls -l "/proc/$pid/fd" 2>/dev/null | wc -l)
if [ "$fd_count" -gt 50 ]; then
cmd=$(ps -p "$pid" -o comm= 2>/dev/null)
cmdline=$(ps -p "$pid" -o cmd= 2>/dev/null | cut -c1-50)
[ ! -z "$cmd" ] && printf "%-8s | %-8s | %-15s | %s\n" "$pid" "$fd_count" "$cmd" "$cmdline"
fi
fi
done | sort -t'|' -k2 -nr | head -10
}
# Function to analyze socket usage
analyze_socket_usage() {
print_section_header "Socket Connection Analysis"
case "$OS" in
centos|rhel|fedora)
# Use ss for newer versions
echo -e "${YELLOW}Socket Statistics:${RESET}"
ss -s
;;
ubuntu|debian)
# Use ss for newer versions
echo -e "${YELLOW}Socket Statistics:${RESET}"
ss -s
;;
*)
echo -e "${YELLOW}Socket Statistics:${RESET}"
netstat -s
;;
esac
echo -e "\n${YELLOW}Top Processes by Socket Usage:${RESET}"
printf "%-25s %-10s %-15s %-15s\n" "PROCESS NAME" "PID" "TCP SOCKETS" "TOTAL FDs"
echo "--------------------------------------------------------------------------------"
ss -tanp | awk '
$1 == "ESTAB" {
split($NF, pid_info, ",")
pid = pid_info[2]
gsub(/pid=/, "", pid)
tcp_count[pid]++
}
END {
for (pid in tcp_count) {
fd_count = "ls -l /proc/" pid "/fd 2>/dev/null | wc -l"
fd_count | getline fd_total
close(fd_count)
cmd = "ps -p " pid " -o comm= 2>/dev/null"
cmd | getline pname
close(cmd)
printf "%-25s %-10s %-15s %-15s\n", pname, pid, tcp_count[pid], fd_total
}
}
' | sort -k3,3nr | head -10
}
# Main script execution
main() {
# Check if script is run with sudo
if [[ $EUID -ne 0 ]]; then
echo -e "${RED}This script must be run with sudo.${RESET}"
exit 1
fi
# Detect OS
detect_os
# Install dependencies
install_dependencies
# Check for required commands with OS-specific variation
case "$OS" in
ubuntu|debian)
REQUIRED_COMMANDS=("ss" "bc" "awk" "sort" "find" "ps")
;;
centos|rhel|fedora)
REQUIRED_COMMANDS=("ss" "bc" "awk" "sort" "find" "ps")
;;
*)
REQUIRED_COMMANDS=("ss" "bc" "awk" "sort" "find" "ps")
;;
esac
for cmd in "${REQUIRED_COMMANDS[@]}"; do
if ! command -v "$cmd" &> /dev/null; then
echo -e "${RED}Required command '$cmd' not found. Please install it.${RESET}"
exit 1
fi
done
echo -e "${GREEN}Comprehensive File Descriptor and Socket Analysis Report${RESET}"
echo -e "${BLUE}=================================================${RESET}"
get_ulimit_info
analyze_fd_usage
analyze_socket_usage
}
main
Implementing File Descriptor Changes: Service Restart Guide
Required Restarts Based on Configuration Level
System-wide Changes (
/etc/sysctl.conf
or/etc/security/limits.conf
)# After modifying /etc/sysctl.conf sudo sysctl -p # Reload sysctl settings without reboot # After modifying limits.conf sudo systemctl daemon-reload # Reload systemd configuration
Full System Restart Required When:
Making changes to PAM configuration
Modifying kernel parameters that don't support hot-reload
Changing core system security limits
Service-Level Changes (systemd service files)
# After modifying service limits in systemd sudo systemctl daemon-reload sudo systemctl restart your-service
Only the modified service needs to restart
Other services remain unaffected
Example systemd service modification:
[Service] LimitNOFILE=65535
Application-Level Changes
# For applications using ulimit in startup script sudo service application-name restart # OR sudo systemctl restart application-name
Restart Requirements by Service Type
Web Servers
Nginx:
sudo nginx -t # Test configuration sudo systemctl restart nginx
Apache:
sudo apachectl configtest sudo systemctl restart apache2
Database Servers
MySQL/MariaDB:
sudo systemctl restart mysql # Verify new limits mysql -e "SHOW VARIABLES LIKE 'open_files_limit';"
PostgreSQL:
sudo systemctl restart postgresql # Verify limits sudo -u postgres ulimit -n
MongoDB:
sudo systemctl restart mongod # Verify in MongoDB logs tail -f /var/log/mongodb/mongod.log | grep "maxOpenFiles"
Application Servers
Node.js (with PM2):
pm2 reload all # Zero-downtime reload # OR pm2 restart all # Full restart
Java Applications:
sudo systemctl restart your-java-service # Verify in JVM jinfo -flag MaxFileDescriptorCount <pid>
Verifying Changes After Restart
System-wide Verification
# Check system limits cat /proc/sys/fs/file-max # Check process limits ps aux | grep process_name cat /proc/<pid>/limits | grep "open files"
Service-specific Verification
# Generic verification script check_service_limits() { service_name=$1 pid=$(pgrep -f "$service_name") if [ ! -z "$pid" ]; then echo "Service: $service_name (PID: $pid)" cat /proc/$pid/limits | grep "open files" else echo "Service not found" fi }
Minimizing Downtime During Restarts
Load Balancer Strategy
# Example with nginx upstream upstream backend { server backend1.example.com; server backend2.example.com; } # Rolling restart for server in backend1 backend2; do ssh $server "sudo systemctl restart application" sleep 30 # Allow time for health checks done
Zero-Downtime Techniques
# Using systemd socket activation [Socket] ListenStream=80 [Service] ExecStart=/usr/sbin/your-service LimitNOFILE=65535
Graceful Restart Commands
Nginx:
nginx -s reload
Apache:
apachectl graceful
Gunicorn:
kill -HUP $pid
Post-Restart Monitoring
Immediate Checks
# Monitor for restart-related issues watch_service_startup() { service=$1 timeout=300 # 5 minutes echo "Monitoring $service startup..." start_time=$(date +%s) while true; do if systemctl is-active $service >/dev/null; then echo "$service is up" check_service_limits "$service" break fi current_time=$(date +%s) if [ $((current_time - start_time)) -gt $timeout ]; then echo "Timeout waiting for $service" break fi sleep 5 done }
Performance Validation
# Quick load test script quick_load_test() { service_url=$1 connections=100 ab -n 1000 -c $connections $service_url }
Common Restart Issues and Solutions
Service Fails to Start
Check system logs:
journalctl -xe
Verify file permissions
Ensure parent directories have execute permissions
Limits Not Applied
Verify user session limits:
ulimit -n
Check service configuration
Validate systemd unit files
Performance Degradation
# Monitor performance metrics while true; do date ps aux | grep service_name netstat -anp | grep service_name sleep 10 done
Further Reading
Want to dive deeper? Here are some resources that saved my bacon:
Linux System Programming (O'Reilly) - Chapter on File I/O
The Linux Programming Interface - Chapter 5
man limits.conf
(Yes, really - it's surprisingly helpful)
Share Your War Stories
We've all got our production horror stories. What's yours? Have you battled the file descriptor beast and lived to tell the tale? Drop a comment below - I'd love to hear how you've solved similar issues in your environment.
Remember: in DevOps, we learn best from each other's mistakes. Well, that and from breaking production - but let's try to do less of that second one.
Update: The monitoring script mentioned in this article is now available on my GitHub. Feel free to fork it, improve it, or use it as a starting point for your own monitoring solution.