Debugging JVM Thread Exhaustion on EC2: A Contractor War Story
I got called in as a contractor to help a client whose Java application kept dying under load. The staging environment would work fine with a handful of users, but the moment they ran load tests simulating real traffic, the JVM would crash with cryptic errors about threads and memory.
The symptoms were classic resource exhaustion, but the root causes were multiple – and finding them required digging through JVM settings, Linux system limits, and EC2 instance sizing. This post walks through the debugging process and the fixes that got them to production.
The Symptoms
The application was a REST API running on EC2, serving requests like:
curl -vvk https://api.example.com/rest/getVersionDetails/web
Under light load: fine. Under load testing (simulating ~500 concurrent users): crashes within minutes.
The errors in the logs varied:
java.lang.OutOfMemoryError: unable to create native thread
java.lang.OutOfMemoryError: Java heap space
Cannot allocate memory (errno=12)
The application would sometimes hang completely, other times crash and restart via systemd, only to crash again.
Initial Assessment
First, I SSH’d into the staging server during a load test to see what was happening in real-time.
Check System Resources
# Memory usage
free -h
total used free shared buff/cache available
Mem: 983Mi 812Mi 62Mi 0.0Ki 108Mi 74Mi
Swap: 0B 0B 0B
# CPU and load
uptime
14:23:45 up 2 days, 3:42, 1 user, load average: 4.82, 3.21, 1.89
# Process count
ps aux | wc -l
847
The server was a t2.micro with 1GB RAM. It was completely maxed out – 812MB used, only 74MB available, and no swap configured. The load average was nearly 5x the single vCPU.
Check Thread Count
# Threads for the Java process
ps -eo pid,nlwp,cmd | grep java
12847 523 /usr/bin/java -jar /opt/app/api.jar
# System-wide thread count
cat /proc/sys/kernel/threads-max
7732
# Threads per process limit
ulimit -u
unlimited
The Java process had 523 threads running. That’s a lot for a t2.micro.
Check systemd TasksMax
This was a key finding:
systemctl show --property DefaultTasksMax
DefaultTasksMax=1844674407370955161
That absurdly large number meant the system default was essentially unlimited – but the per-service limit might be different:
systemctl show myapp.service --property TasksMax
TasksMax=512
There it was. The systemd service had a TasksMax of 512, and the Java process was hitting 523 threads. systemd was killing threads when they exceeded the limit.
The Problems (There Were Several)
Problem 1: TasksMax Limit
systemd’s TasksMax setting limits how many tasks (threads) a service can spawn. The default varies by distribution, but many set it to 512. A busy Java application can easily exceed this.
Problem 2: Undersized Instance
A t2.micro has:
- 1 vCPU (burstable)
- 1GB RAM
- No swap by default
Running a JVM that spawns hundreds of threads on this is asking for trouble. The JVM alone needs memory for:
- Heap (application objects)
- Metaspace (class metadata)
- Thread stacks (1MB default per thread × 500 threads = 500MB just for stacks)
- Native memory (JIT compiler, GC, etc.)
- The OS itself
On a 1GB instance, there simply wasn’t enough memory.
Problem 3: No JVM Tuning
The application was running with default JVM settings:
ps aux | grep java
# Showed no -Xmx, -Xms, or -Xss flags
The JVM was auto-sizing based on available memory, but its choices weren’t appropriate for this workload.
Problem 4: No Swap Space
When physical RAM runs out, Linux normally uses swap. But EC2 instances don’t have swap by default, so the OOM killer would just terminate processes.
Problem 5: Thread Leak
Looking at thread dumps over time, the thread count kept growing:
# Take thread dumps 30 seconds apart
jstack 12847 > /tmp/threads1.txt
sleep 30
jstack 12847 > /tmp/threads2.txt
# Compare thread counts
grep "java.lang.Thread.State" /tmp/threads1.txt | wc -l
487
grep "java.lang.Thread.State" /tmp/threads2.txt | wc -l
512
Threads were being created but not cleaned up – a classic thread leak, likely from connection pools or async handlers not being properly closed.
The Fixes
Fix 1: Increase TasksMax
Edit the systemd service file:
sudo systemctl edit myapp.service
Add:
[Service]
TasksMax=4096
Then reload:
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
Verify:
systemctl show myapp.service --property TasksMax
TasksMax=4096
This was the immediate fix that stopped the crashes, but it only masked the underlying problems.
Fix 2: Right-Size the EC2 Instance
I recommended upgrading from t2.micro to at least t3.medium (2 vCPU, 4GB RAM) for staging, and t3.large (2 vCPU, 8GB RAM) for production.
The memory calculation:
| Component | Memory |
|---|---|
| JVM Heap | 2GB |
| Metaspace | 256MB |
| Thread stacks (500 threads × 512KB) | 250MB |
| Native/JIT/GC | ~500MB |
| OS + buffer cache | ~1GB |
| Total | ~4GB minimum |
A t2.micro was never going to work. We moved to t3.medium for staging and t3.large for production.
Fix 3: Tune JVM Settings
I added explicit JVM flags to the startup script:
#!/bin/bash
# /opt/app/start.sh
JAVA_OPTS="-server"
JAVA_OPTS="$JAVA_OPTS -Xms1g -Xmx2g" # Heap: 1GB initial, 2GB max
JAVA_OPTS="$JAVA_OPTS -Xss512k" # Thread stack: 512KB (down from 1MB default)
JAVA_OPTS="$JAVA_OPTS -XX:MaxMetaspaceSize=256m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC" # G1 garbage collector
JAVA_OPTS="$JAVA_OPTS -XX:MaxGCPauseMillis=200"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JAVA_OPTS="$JAVA_OPTS -XX:HeapDumpPath=/var/log/app/heapdump.hprof"
exec java $JAVA_OPTS -jar /opt/app/api.jar
Key settings explained:
| Flag | Purpose |
|---|---|
-Xms1g -Xmx2g | Set initial and max heap. Setting them equal avoids resize overhead. |
-Xss512k | Reduce thread stack size from 1MB to 512KB. Saves memory with many threads. |
-XX:MaxMetaspaceSize=256m | Cap metaspace to prevent unbounded growth. |
-XX:+UseG1GC | G1 is better for larger heaps and lower pause times. |
-XX:+HeapDumpOnOutOfMemoryError | Automatically dump heap on OOM for post-mortem analysis. |
Fix 4: Add Swap Space
Even with proper sizing, swap provides a safety net:
# Create 2GB swap file
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Make permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
# Reduce swappiness (prefer RAM, use swap only when necessary)
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
Verify:
free -h
total used free shared buff/cache available
Mem: 3.8Gi 1.2Gi 1.9Gi 0.0Ki 712Mi 2.4Gi
Swap: 2.0Gi 0B 2.0Gi
Fix 5: Fix the Thread Leak
This required code changes from the development team. The issues were:
- HTTP connection pool not configured with max connections:
// Before: unbounded pool
CloseableHttpClient client = HttpClients.createDefault();
// After: bounded pool
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(100);
cm.setDefaultMaxPerRoute(20);
CloseableHttpClient client = HttpClients.custom()
.setConnectionManager(cm)
.build();
- Async handlers not completing:
// Before: CompletableFuture without timeout
CompletableFuture.supplyAsync(() -> fetchData());
// After: with timeout
CompletableFuture.supplyAsync(() -> fetchData())
.orTimeout(30, TimeUnit.SECONDS)
.exceptionally(ex -> {
logger.error("Async operation timed out", ex);
return fallbackValue;
});
- ExecutorService not bounded:
// Before: cached thread pool (unbounded)
ExecutorService executor = Executors.newCachedThreadPool();
// After: fixed thread pool
ExecutorService executor = Executors.newFixedThreadPool(50);
Fix 6: Add Monitoring
I set up CloudWatch alarms to catch these issues before they caused outages:
# Install CloudWatch agent
sudo yum install -y amazon-cloudwatch-agent
# Configure to collect memory and process metrics
cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'
{
"metrics": {
"namespace": "MyApp",
"metrics_collected": {
"mem": {
"measurement": ["mem_used_percent", "mem_available"]
},
"processes": {
"measurement": ["running", "blocked", "zombies"]
}
}
}
}
EOF
sudo amazon-cloudwatch-agent-ctl -a start
And a custom metric for JVM thread count:
#!/bin/bash
# /opt/scripts/jvm-metrics.sh
# Run via cron every minute
PID=$(pgrep -f "api.jar")
if [ -n "$PID" ]; then
THREAD_COUNT=$(cat /proc/$PID/status | grep Threads | awk '{print $2}')
aws cloudwatch put-metric-data \
--namespace "MyApp" \
--metric-name "JVMThreadCount" \
--value "$THREAD_COUNT" \
--unit Count
fi
CloudWatch alarm:
aws cloudwatch put-metric-alarm \
--alarm-name "jvm-thread-count-high" \
--metric-name "JVMThreadCount" \
--namespace "MyApp" \
--statistic "Average" \
--period 300 \
--threshold 400 \
--comparison-operator "GreaterThanThreshold" \
--evaluation-periods 2 \
--alarm-actions "arn:aws:sns:eu-west-1:123456789012:alerts"
Verification
After all fixes were applied, I ran the load test again:
# Before fixes
# 500 concurrent users → crash within 5 minutes
# After fixes
# 500 concurrent users → stable for 2 hours
# Memory: 2.1GB / 4GB
# Threads: stable at ~180 (down from 500+)
# Response time: P99 < 200ms
The thread leak fix was the biggest improvement – thread count dropped from 500+ to ~180 and stayed stable.
Debugging Commands Reference
For the next time you’re debugging JVM issues on Linux:
# System memory
free -h
cat /proc/meminfo
# Process memory
ps aux --sort=-%mem | head
pmap -x <pid>
# Thread count for a process
cat /proc/<pid>/status | grep Threads
ps -eo pid,nlwp,cmd | grep java
# System thread limits
cat /proc/sys/kernel/threads-max
ulimit -u
# systemd TasksMax
systemctl show --property DefaultTasksMax
systemctl show <service> --property TasksMax
# JVM thread dump
jstack <pid> > threads.txt
# JVM heap dump
jmap -dump:format=b,file=heap.hprof <pid>
# JVM flags in use
jcmd <pid> VM.flags
# Watch thread count over time
watch -n 1 "cat /proc/<pid>/status | grep Threads"
# Check for OOM killer activity
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"
Lessons Learned
1. t2.micro Is Not for Production JVMs
A JVM with any meaningful workload needs at least 2GB RAM available, preferably 4GB+. t2.micro is for testing and tiny workloads only.
2. Always Set Explicit JVM Heap Sizes
Don’t rely on JVM auto-tuning. Set -Xms and -Xmx explicitly based on your instance size and workload.
3. Reduce Thread Stack Size
The default 1MB per thread is often excessive. -Xss512k or even -Xss256k works for most applications and saves significant memory with many threads.
4. Check systemd TasksMax
This catches many people off guard. A default of 512 tasks is easily exceeded by JVM applications.
5. Always Have Swap
Even if you’ve sized everything correctly, swap provides a buffer against unexpected memory spikes. It’s better to slow down than to crash.
6. Monitor Thread Count
Thread leaks are common in async Java applications. Monitor thread count as a first-class metric alongside CPU and memory.
7. Bound Your Thread Pools
Never use Executors.newCachedThreadPool() in production. Always use bounded pools with explicit maximums.
Summary
The client’s JVM crashes were caused by a combination of:
- systemd TasksMax limit (512 threads)
- Undersized EC2 instance (t2.micro with 1GB RAM)
- No JVM tuning (defaults for heap and thread stack)
- No swap space
- Thread leak in application code
The fixes:
- Increased TasksMax to 4096
- Upgraded to t3.medium (4GB RAM)
- Tuned JVM with explicit heap sizes and reduced thread stack
- Added 2GB swap
- Fixed thread leak in connection pools and async handlers
- Added monitoring for threads and memory
Total time to diagnose and fix: about 2 days. The application has been stable in production for months since.
Debugging JVM performance issues or have questions about EC2 sizing for Java? Find me on LinkedIn.