Production War Stories: The NGINX Log Rotation That Caused a P1
I was on-call when the alerts started firing. Traffic graphs showed intermittent drops – tiny blips, but consistent. The kind that makes you think “is this real or is the monitoring flaky?” It was real. And it took us down a rabbit hole involving truncate, file descriptors, zombie logs, and a lesson about production parity that I won’t forget.
This happened at a company where I was managing a fleet of NGINX reverse proxies sitting in front of a large microservices estate. We’d just rolled out an AMI upgrade – routine maintenance, nothing fancy. What followed was anything but routine.
The Setup
Our NGINX instances sat behind an internal load balancer, proxying traffic to upstream services via Consul-backed service discovery. Logs were shipped to ELK via filebeat, reading from /var/log/nginx/access.json.
The log rotation process was handled by a custom systemd timer that ran hourly. The original implementation looked like this:
#!/bin/bash
jsonlog="/var/log/nginx/access.json"
if [ -e "$jsonlog" ]; then
size=$(stat -c%s "$jsonlog")
echo "INFO: Truncating $jsonlog from $size bytes"
truncate -s 0 "$jsonlog"
echo "Done. Size is now $(stat -c%s "$jsonlog") bytes."
fi
Simple, right? Truncate the file to zero bytes, let NGINX keep writing, filebeat keeps tailing. Job done.
Except it wasn’t.
The Incident: Traffic Drops
After the AMI rollout, monitoring showed brief but regular traffic drops – requests timing out, 502s spiking, then recovering. The pattern was suspiciously regular: roughly hourly.
Correlating with systemctl list-timers:
$ systemctl list-timers --all | grep nginx
Thu 2024-01-25 12:00:00 GMT rotate_nginx_access.timer
The log rotation timer. Every time it fired, traffic dropped for 2–3 seconds.
Root Cause 1: truncate Blocks NGINX
Here’s the thing about truncate on a file that’s being actively written to: it’s not atomic. When you truncate a file that NGINX has open, the kernel has to update the file’s metadata and potentially flush buffers. During this window, NGINX can stall.
From the NGINX error logs:
2024/01/25 12:00:01 [error] 825392#825392: *5598 no live upstreams while
connecting to upstream, client: 10.249.160.82, server: api.example.local,
request: "POST /api/v1/resource HTTP/1.1"
“No live upstreams” – NGINX wasn’t dead, but it wasn’t processing requests either. The worker processes were blocked waiting for the truncate to complete.
After engaging F5 support (we had enterprise NGINX Plus), they confirmed: don’t truncate active log files. The recommended approach is:
- Rename the log file (atomic operation)
- Signal NGINX to reopen logs (
kill -USR1) - Delete the old file
#!/bin/bash
jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"
if [ -e "$jsonlog" ]; then
mv "$jsonlog" "${jsonlog}.old"
kill -USR1 "$(cat $pidfile)"
sleep 1 # Give NGINX time to reopen
rm -f "${jsonlog}.old"
fi
The USR1 signal tells the NGINX master process to reopen all log files. Workers finish writing to the old file descriptor, then switch to the new file. No blocking, no dropped requests.
The Fix… That Created a New Problem
We updated the AMI with the new rotation script, tested in lower environments, and rolled out to production. Traffic drops stopped. Victory.
I went for lunch.
Two hours later, my phone buzzed. PagerDuty. Disk utilisation critical on the NGINX fleet. I abandoned my sandwich and legged it back to my desk.
Disk alerts firing across every NGINX instance.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 50G 47G 3.0G 94% /
The log directory:
$ ls -lh /var/log/nginx/
total 42G
-rw-r--r-- 1 nginx nginx 1.2G Jan 28 14:00 access.json
-rw-r--r-- 1 nginx nginx 8.5G Jan 28 12:00 access.json.old
-rw-r--r-- 1 nginx nginx 8.4G Jan 28 11:00 access.json.old
-rw-r--r-- 1 nginx nginx 8.3G Jan 28 10:00 access.json.old
...
Wait – multiple .old files? The script should be deleting them. What’s going on?
Root Cause 2: Zombie File Descriptors
Here’s where it gets interesting. The rm command was executing, but the disk space wasn’t being freed. Classic Linux file descriptor behaviour:
$ lsof +L1 /var/log/nginx/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
filebeat 1234 root 5r REG 259,1 8589934592 0 1234 /var/log/nginx/access.json.old (deleted)
Filebeat still had an open file handle to the “deleted” file. In Linux, a file isn’t truly removed until all file descriptors pointing to it are closed. The inode persists, the disk space stays allocated, but ls shows nothing because the directory entry is gone.
The rotation script was:
- Renaming
access.jsontoaccess.json.old - Signalling NGINX (which reopened to the new
access.json) - Deleting
access.json.old
But filebeat was still tailing the old file. The rm unlinked the directory entry, but filebeat held the file descriptor open. The file became a “zombie” – invisible but consuming disk space.
Over time, these zombies accumulated. Each rotation cycle created a new one. Disk filled up.
The Real Fix
The solution required coordinating the rotation with filebeat’s behaviour. Options:
Option 1: Restart filebeat after rotation
mv "$jsonlog" "${jsonlog}.old"
kill -USR1 "$(cat $pidfile)"
sleep 1
systemctl restart filebeat
rm -f "${jsonlog}.old"
Heavy-handed, but guaranteed to release the file descriptor.
Option 2: Use logrotate with copytruncate
/var/log/nginx/*.json {
hourly
missingok
rotate 0
copytruncate
notifempty
postrotate
/bin/kill -USR1 $(cat /var/run/nginx.pid) 2>/dev/null || true
endscript
}
copytruncate copies the log content to a new file, then truncates the original in place. Filebeat keeps its file descriptor to the same inode. No zombies.
But wait – we established earlier that truncate causes NGINX stalls. This brings us back to square one.
Option 3: The proper fix – signal filebeat to close handles
Filebeat supports input-level close behaviours. In filebeat.yml:
filebeat.inputs:
- type: log
paths:
- /var/log/nginx/access.json
close_renamed: true
close_removed: true
close_inactive: 5m
close_renamed: true tells filebeat to close the file handle when the file is renamed. Combined with our rotation script:
#!/bin/bash
jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"
if [ -e "$jsonlog" ]; then
echo "INFO: Rotating $jsonlog"
# Rename the log file
mv "$jsonlog" "${jsonlog}.old"
# Signal NGINX to reopen logs
kill -USR1 "$(cat $pidfile)"
# Wait for filebeat to detect rename and close handle
sleep 5
# Verify no processes holding the file before deletion
if ! fuser "${jsonlog}.old" >/dev/null 2>&1; then
rm -f "${jsonlog}.old"
echo "INFO: Rotation complete"
else
echo "WARN: File still in use, deferring deletion"
fi
fi
The fuser check is a safety net – if something still has the file open, we leave it for the next rotation cycle rather than creating a zombie.
The Final Implementation
The production-ready version included:
- Randomised timer to prevent all instances rotating simultaneously:
# /etc/systemd/system/rotate_nginx_access.timer
[Unit]
Description=Rotate Nginx Access Logs
[Timer]
OnCalendar=hourly
RandomizedDelaySec=300
Persistent=true
[Install]
WantedBy=timers.target
- Defensive rotation script with logging:
#!/bin/bash
set -euo pipefail
jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"
log() { echo "$(date '+%Y-%m-%d %H:%M:%S') $1"; }
if [[ ! -e "$jsonlog" ]]; then
log "WARN: $jsonlog not found, skipping"
exit 0
fi
size=$(stat -c%s "$jsonlog")
log "INFO: Rotating $jsonlog ($size bytes)"
mv "$jsonlog" "${jsonlog}.old"
kill -USR1 "$(cat $pidfile)"
# Wait for filebeat to release handle
sleep 5
if fuser "${jsonlog}.old" >/dev/null 2>&1; then
log "WARN: ${jsonlog}.old still in use by:"
fuser -v "${jsonlog}.old" 2>&1 | while read line; do log " $line"; done
log "WARN: Deferring deletion to next cycle"
else
rm -f "${jsonlog}.old"
log "INFO: Rotation complete, old file removed"
fi
- Monitoring for zombie files:
# CloudWatch custom metric via cron
zombie_size=$(lsof +L1 2>/dev/null | grep nginx | awk '{sum+=$7} END {print sum+0}')
aws cloudwatch put-metric-data \
--namespace "Custom/Nginx" \
--metric-name "ZombieLogBytes" \
--value "$zombie_size" \
--unit Bytes
Lessons Learned
1. truncate on active files is dangerous
It seems safe – same inode, same file descriptor. But the kernel-level operations can block the writing process. For high-throughput services, this translates to dropped requests.
2. Deleted files aren’t gone until all handles close
The (deleted) marker in lsof output is your clue. If disk space isn’t being freed after deletion, something’s holding a handle. lsof +L1 shows all files with link count < 1 (i.e., deleted but still open).
3. Lower environment testing has limits
We tested the rotation script in staging. It worked perfectly. The difference? Staging had 1% of production traffic. The log files were small enough that filebeat processed them before the next rotation. In production, 8GB log files meant filebeat was always behind, always holding handles.
4. Simulate production load, not just production config
For this class of issue, you need:
- Realistic log volume (use log replay or traffic shadowing)
- Multiple rotation cycles (run for days, not hours)
- Disk utilisation monitoring from day one
5. The fix for one problem can create another
We fixed traffic drops by switching from truncate to mv + USR1. This created zombie files. The “obvious” solution (truncate) had a non-obvious failure mode. The “correct” solution (mv) had a different non-obvious failure mode. Defence in depth – fuser checks, monitoring, alerts – catches the failures you didn’t anticipate.
Debugging Commands Reference
For the next time you’re staring at this at 2am:
# Check what's holding a file open
fuser -v /var/log/nginx/access.json
# Find deleted files still consuming space
lsof +L1 | grep nginx
# Check disk usage including deleted files
du -sh /var/log/nginx/ # What you see
df -h /var/log/nginx/ # What the kernel sees (includes zombies)
# Watch log file growth
watch -n 1 'ls -lh /var/log/nginx/access.json'
# Verify NGINX received the USR1 signal
journalctl -u nginx --since "5 minutes ago" | grep reopen
# List systemd timers
systemctl list-timers --all
# Check rotation service logs
journalctl -u rotate_nginx_access.service --since today
Wrapping Up
This incident taught me that log rotation – something that feels like solved infrastructure – can absolutely take down production if you don’t understand the interactions between your components. NGINX, filebeat, the kernel’s file descriptor semantics, systemd timers – they all have opinions about how files should be handled, and those opinions don’t always align.
The “right” answer depends on your stack. But the debugging approach is universal: understand what each component expects, trace the file descriptors, and never trust that “deleted” means “gone”.
Got your own war stories about log rotation, file descriptors, or on-call incidents? I’d love to hear them – find me on LinkedIn.