Production War Stories: The NGINX Log Rotation That Caused a P1

I was on-call when the alerts started firing. Traffic graphs showed intermittent drops – tiny blips, but consistent. The kind that makes you think “is this real or is the monitoring flaky?” It was real. And it took us down a rabbit hole involving truncate, file descriptors, zombie logs, and a lesson about production parity that I won’t forget.

This happened at a company where I was managing a fleet of NGINX reverse proxies sitting in front of a large microservices estate. We’d just rolled out an AMI upgrade – routine maintenance, nothing fancy. What followed was anything but routine.

The Setup

Our NGINX instances sat behind an internal load balancer, proxying traffic to upstream services via Consul-backed service discovery. Logs were shipped to ELK via filebeat, reading from /var/log/nginx/access.json.

The log rotation process was handled by a custom systemd timer that ran hourly. The original implementation looked like this:

#!/bin/bash
jsonlog="/var/log/nginx/access.json"

if [ -e "$jsonlog" ]; then
    size=$(stat -c%s "$jsonlog")
    echo "INFO: Truncating $jsonlog from $size bytes"
    truncate -s 0 "$jsonlog"
    echo "Done. Size is now $(stat -c%s "$jsonlog") bytes."
fi

Simple, right? Truncate the file to zero bytes, let NGINX keep writing, filebeat keeps tailing. Job done.

Except it wasn’t.

The Incident: Traffic Drops

After the AMI rollout, monitoring showed brief but regular traffic drops – requests timing out, 502s spiking, then recovering. The pattern was suspiciously regular: roughly hourly.

Correlating with systemctl list-timers:

$ systemctl list-timers --all | grep nginx
Thu 2024-01-25 12:00:00 GMT  rotate_nginx_access.timer

The log rotation timer. Every time it fired, traffic dropped for 2–3 seconds.

Root Cause 1: truncate Blocks NGINX

Here’s the thing about truncate on a file that’s being actively written to: it’s not atomic. When you truncate a file that NGINX has open, the kernel has to update the file’s metadata and potentially flush buffers. During this window, NGINX can stall.

From the NGINX error logs:

2024/01/25 12:00:01 [error] 825392#825392: *5598 no live upstreams while 
connecting to upstream, client: 10.249.160.82, server: api.example.local, 
request: "POST /api/v1/resource HTTP/1.1"

“No live upstreams” – NGINX wasn’t dead, but it wasn’t processing requests either. The worker processes were blocked waiting for the truncate to complete.

After engaging F5 support (we had enterprise NGINX Plus), they confirmed: don’t truncate active log files. The recommended approach is:

Rename the log file (atomic operation)
Signal NGINX to reopen logs (kill -USR1)
Delete the old file

#!/bin/bash
jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"

if [ -e "$jsonlog" ]; then
    mv "$jsonlog" "${jsonlog}.old"
    kill -USR1 "$(cat $pidfile)"
    sleep 1  # Give NGINX time to reopen
    rm -f "${jsonlog}.old"
fi

The USR1 signal tells the NGINX master process to reopen all log files. Workers finish writing to the old file descriptor, then switch to the new file. No blocking, no dropped requests.

The Fix… That Created a New Problem

We updated the AMI with the new rotation script, tested in lower environments, and rolled out to production. Traffic drops stopped. Victory.

I went for lunch.

Two hours later, my phone buzzed. PagerDuty. Disk utilisation critical on the NGINX fleet. I abandoned my sandwich and legged it back to my desk.

Disk alerts firing across every NGINX instance.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1   50G   47G  3.0G  94% /

The log directory:

$ ls -lh /var/log/nginx/
total 42G
-rw-r--r-- 1 nginx nginx  1.2G Jan 28 14:00 access.json
-rw-r--r-- 1 nginx nginx  8.5G Jan 28 12:00 access.json.old
-rw-r--r-- 1 nginx nginx  8.4G Jan 28 11:00 access.json.old
-rw-r--r-- 1 nginx nginx  8.3G Jan 28 10:00 access.json.old
...

Wait – multiple .old files? The script should be deleting them. What’s going on?

Root Cause 2: Zombie File Descriptors

Here’s where it gets interesting. The rm command was executing, but the disk space wasn’t being freed. Classic Linux file descriptor behaviour:

$ lsof +L1 /var/log/nginx/
COMMAND   PID  USER   FD   TYPE DEVICE    SIZE/OFF NLINK NODE NAME
filebeat 1234 root    5r   REG  259,1 8589934592     0 1234 /var/log/nginx/access.json.old (deleted)

Filebeat still had an open file handle to the “deleted” file. In Linux, a file isn’t truly removed until all file descriptors pointing to it are closed. The inode persists, the disk space stays allocated, but ls shows nothing because the directory entry is gone.

The rotation script was:

Renaming access.json to access.json.old
Signalling NGINX (which reopened to the new access.json)
Deleting access.json.old

But filebeat was still tailing the old file. The rm unlinked the directory entry, but filebeat held the file descriptor open. The file became a “zombie” – invisible but consuming disk space.

Over time, these zombies accumulated. Each rotation cycle created a new one. Disk filled up.

The Real Fix

The solution required coordinating the rotation with filebeat’s behaviour. Options:

Option 1: Restart filebeat after rotation

mv "$jsonlog" "${jsonlog}.old"
kill -USR1 "$(cat $pidfile)"
sleep 1
systemctl restart filebeat
rm -f "${jsonlog}.old"

Heavy-handed, but guaranteed to release the file descriptor.

Option 2: Use logrotate with copytruncate

/var/log/nginx/*.json {
    hourly
    missingok
    rotate 0
    copytruncate
    notifempty
    postrotate
        /bin/kill -USR1 $(cat /var/run/nginx.pid) 2>/dev/null || true
    endscript
}

copytruncate copies the log content to a new file, then truncates the original in place. Filebeat keeps its file descriptor to the same inode. No zombies.

But wait – we established earlier that truncate causes NGINX stalls. This brings us back to square one.

Option 3: The proper fix – signal filebeat to close handles

Filebeat supports input-level close behaviours. In filebeat.yml:

filebeat.inputs:
  - type: log
    paths:
      - /var/log/nginx/access.json
    close_renamed: true
    close_removed: true
    close_inactive: 5m

close_renamed: true tells filebeat to close the file handle when the file is renamed. Combined with our rotation script:

#!/bin/bash
jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"

if [ -e "$jsonlog" ]; then
    echo "INFO: Rotating $jsonlog"
    
    # Rename the log file
    mv "$jsonlog" "${jsonlog}.old"
    
    # Signal NGINX to reopen logs
    kill -USR1 "$(cat $pidfile)"
    
    # Wait for filebeat to detect rename and close handle
    sleep 5
    
    # Verify no processes holding the file before deletion
    if ! fuser "${jsonlog}.old" >/dev/null 2>&1; then
        rm -f "${jsonlog}.old"
        echo "INFO: Rotation complete"
    else
        echo "WARN: File still in use, deferring deletion"
    fi
fi

The fuser check is a safety net – if something still has the file open, we leave it for the next rotation cycle rather than creating a zombie.

The Final Implementation

The production-ready version included:

Randomised timer to prevent all instances rotating simultaneously:

# /etc/systemd/system/rotate_nginx_access.timer
[Unit]
Description=Rotate Nginx Access Logs

[Timer]
OnCalendar=hourly
RandomizedDelaySec=300
Persistent=true

[Install]
WantedBy=timers.target

Defensive rotation script with logging:

#!/bin/bash
set -euo pipefail

jsonlog="/var/log/nginx/access.json"
pidfile="/var/run/nginx.pid"

log() { echo "$(date '+%Y-%m-%d %H:%M:%S') $1"; }

if [[ ! -e "$jsonlog" ]]; then
    log "WARN: $jsonlog not found, skipping"
    exit 0
fi

size=$(stat -c%s "$jsonlog")
log "INFO: Rotating $jsonlog ($size bytes)"

mv "$jsonlog" "${jsonlog}.old"
kill -USR1 "$(cat $pidfile)"

# Wait for filebeat to release handle
sleep 5

if fuser "${jsonlog}.old" >/dev/null 2>&1; then
    log "WARN: ${jsonlog}.old still in use by:"
    fuser -v "${jsonlog}.old" 2>&1 | while read line; do log "  $line"; done
    log "WARN: Deferring deletion to next cycle"
else
    rm -f "${jsonlog}.old"
    log "INFO: Rotation complete, old file removed"
fi

Monitoring for zombie files:

# CloudWatch custom metric via cron
zombie_size=$(lsof +L1 2>/dev/null | grep nginx | awk '{sum+=$7} END {print sum+0}')
aws cloudwatch put-metric-data \
    --namespace "Custom/Nginx" \
    --metric-name "ZombieLogBytes" \
    --value "$zombie_size" \
    --unit Bytes

Lessons Learned

1. truncate on active files is dangerous

It seems safe – same inode, same file descriptor. But the kernel-level operations can block the writing process. For high-throughput services, this translates to dropped requests.

2. Deleted files aren’t gone until all handles close

The (deleted) marker in lsof output is your clue. If disk space isn’t being freed after deletion, something’s holding a handle. lsof +L1 shows all files with link count < 1 (i.e., deleted but still open).

3. Lower environment testing has limits

We tested the rotation script in staging. It worked perfectly. The difference? Staging had 1% of production traffic. The log files were small enough that filebeat processed them before the next rotation. In production, 8GB log files meant filebeat was always behind, always holding handles.

4. Simulate production load, not just production config

For this class of issue, you need:

Realistic log volume (use log replay or traffic shadowing)
Multiple rotation cycles (run for days, not hours)
Disk utilisation monitoring from day one

5. The fix for one problem can create another

We fixed traffic drops by switching from truncate to mv + USR1. This created zombie files. The “obvious” solution (truncate) had a non-obvious failure mode. The “correct” solution (mv) had a different non-obvious failure mode. Defence in depth – fuser checks, monitoring, alerts – catches the failures you didn’t anticipate.

Debugging Commands Reference

For the next time you’re staring at this at 2am:

# Check what's holding a file open
fuser -v /var/log/nginx/access.json

# Find deleted files still consuming space
lsof +L1 | grep nginx

# Check disk usage including deleted files
du -sh /var/log/nginx/           # What you see
df -h /var/log/nginx/            # What the kernel sees (includes zombies)

# Watch log file growth
watch -n 1 'ls -lh /var/log/nginx/access.json'

# Verify NGINX received the USR1 signal
journalctl -u nginx --since "5 minutes ago" | grep reopen

# List systemd timers
systemctl list-timers --all

# Check rotation service logs
journalctl -u rotate_nginx_access.service --since today

Wrapping Up

This incident taught me that log rotation – something that feels like solved infrastructure – can absolutely take down production if you don’t understand the interactions between your components. NGINX, filebeat, the kernel’s file descriptor semantics, systemd timers – they all have opinions about how files should be handled, and those opinions don’t always align.

The “right” answer depends on your stack. But the debugging approach is universal: understand what each component expects, trace the file descriptors, and never trust that “deleted” means “gone”.

Got your own war stories about log rotation, file descriptors, or on-call incidents? I’d love to hear them – find me on LinkedIn.

Production War Stories: The NGINX Log Rotation That Caused a P1

The Setup

The Incident: Traffic Drops

Root Cause 1: truncate Blocks NGINX

The Fix… That Created a New Problem

Root Cause 2: Zombie File Descriptors

The Real Fix

The Final Implementation

Lessons Learned

Debugging Commands Reference

Wrapping Up

Related Posts

OpenTelemetry Changed How I Think About Observability

Building an Automated Multi-Account AWS Architecture with Control Tower and Terraform

Spacelift from Scratch: Automating Terraform at Scale with Spaces, Stacks, OPA Policies, and a Private Module Registry

Comments