Debugging JVM Thread Exhaustion on EC2: A Contractor War Story

I got called in as a contractor to help a client whose Java application kept dying under load. The staging environment would work fine with a handful of users, but the moment they ran load tests simulating real traffic, the JVM would crash with cryptic errors about threads and memory.

The symptoms were classic resource exhaustion, but the root causes were multiple – and finding them required digging through JVM settings, Linux system limits, and EC2 instance sizing. This post walks through the debugging process and the fixes that got them to production.

The Symptoms

The application was a REST API running on EC2, serving requests like:

curl -vvk https://api.example.com/rest/getVersionDetails/web

Under light load: fine. Under load testing (simulating ~500 concurrent users): crashes within minutes.

The errors in the logs varied:

java.lang.OutOfMemoryError: unable to create native thread

java.lang.OutOfMemoryError: Java heap space

Cannot allocate memory (errno=12)

The application would sometimes hang completely, other times crash and restart via systemd, only to crash again.

Initial Assessment

First, I SSH’d into the staging server during a load test to see what was happening in real-time.

Check System Resources

# Memory usage
free -h
              total        used        free      shared  buff/cache   available
Mem:          983Mi       812Mi        62Mi       0.0Ki       108Mi        74Mi
Swap:            0B          0B          0B

# CPU and load
uptime
 14:23:45 up 2 days,  3:42,  1 user,  load average: 4.82, 3.21, 1.89

# Process count
ps aux | wc -l
847

The server was a t2.micro with 1GB RAM. It was completely maxed out – 812MB used, only 74MB available, and no swap configured. The load average was nearly 5x the single vCPU.

Check Thread Count

# Threads for the Java process
ps -eo pid,nlwp,cmd | grep java
12847  523  /usr/bin/java -jar /opt/app/api.jar

# System-wide thread count
cat /proc/sys/kernel/threads-max
7732

# Threads per process limit
ulimit -u
unlimited

The Java process had 523 threads running. That’s a lot for a t2.micro.

Check systemd TasksMax

This was a key finding:

systemctl show --property DefaultTasksMax
DefaultTasksMax=1844674407370955161

That absurdly large number meant the system default was essentially unlimited – but the per-service limit might be different:

systemctl show myapp.service --property TasksMax
TasksMax=512

There it was. The systemd service had a TasksMax of 512, and the Java process was hitting 523 threads. systemd was killing threads when they exceeded the limit.

The Problems (There Were Several)

Problem 1: TasksMax Limit

systemd’s TasksMax setting limits how many tasks (threads) a service can spawn. The default varies by distribution, but many set it to 512. A busy Java application can easily exceed this.

Problem 2: Undersized Instance

A t2.micro has:

1 vCPU (burstable)
1GB RAM
No swap by default

Running a JVM that spawns hundreds of threads on this is asking for trouble. The JVM alone needs memory for:

Heap (application objects)
Metaspace (class metadata)
Thread stacks (1MB default per thread × 500 threads = 500MB just for stacks)
Native memory (JIT compiler, GC, etc.)
The OS itself

On a 1GB instance, there simply wasn’t enough memory.

Problem 3: No JVM Tuning

The application was running with default JVM settings:

ps aux | grep java
# Showed no -Xmx, -Xms, or -Xss flags

The JVM was auto-sizing based on available memory, but its choices weren’t appropriate for this workload.

Problem 4: No Swap Space

When physical RAM runs out, Linux normally uses swap. But EC2 instances don’t have swap by default, so the OOM killer would just terminate processes.

Problem 5: Thread Leak

Looking at thread dumps over time, the thread count kept growing:

# Take thread dumps 30 seconds apart
jstack 12847 > /tmp/threads1.txt
sleep 30
jstack 12847 > /tmp/threads2.txt

# Compare thread counts
grep "java.lang.Thread.State" /tmp/threads1.txt | wc -l
487
grep "java.lang.Thread.State" /tmp/threads2.txt | wc -l
512

Threads were being created but not cleaned up – a classic thread leak, likely from connection pools or async handlers not being properly closed.

The Fixes

Fix 1: Increase TasksMax

Edit the systemd service file:

sudo systemctl edit myapp.service

Add:

[Service]
TasksMax=4096

Then reload:

sudo systemctl daemon-reload
sudo systemctl restart myapp.service

Verify:

systemctl show myapp.service --property TasksMax
TasksMax=4096

This was the immediate fix that stopped the crashes, but it only masked the underlying problems.

Fix 2: Right-Size the EC2 Instance

I recommended upgrading from t2.micro to at least t3.medium (2 vCPU, 4GB RAM) for staging, and t3.large (2 vCPU, 8GB RAM) for production.

The memory calculation:

Component	Memory
JVM Heap	2GB
Metaspace	256MB
Thread stacks (500 threads × 512KB)	250MB
Native/JIT/GC	~500MB
OS + buffer cache	~1GB
Total	~4GB minimum

A t2.micro was never going to work. We moved to t3.medium for staging and t3.large for production.

Fix 3: Tune JVM Settings

I added explicit JVM flags to the startup script:

#!/bin/bash
# /opt/app/start.sh

JAVA_OPTS="-server"
JAVA_OPTS="$JAVA_OPTS -Xms1g -Xmx2g"           # Heap: 1GB initial, 2GB max
JAVA_OPTS="$JAVA_OPTS -Xss512k"                 # Thread stack: 512KB (down from 1MB default)
JAVA_OPTS="$JAVA_OPTS -XX:MaxMetaspaceSize=256m"
JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"             # G1 garbage collector
JAVA_OPTS="$JAVA_OPTS -XX:MaxGCPauseMillis=200"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JAVA_OPTS="$JAVA_OPTS -XX:HeapDumpPath=/var/log/app/heapdump.hprof"

exec java $JAVA_OPTS -jar /opt/app/api.jar

Key settings explained:

Flag	Purpose
`-Xms1g -Xmx2g`	Set initial and max heap. Setting them equal avoids resize overhead.
`-Xss512k`	Reduce thread stack size from 1MB to 512KB. Saves memory with many threads.
`-XX:MaxMetaspaceSize=256m`	Cap metaspace to prevent unbounded growth.
`-XX:+UseG1GC`	G1 is better for larger heaps and lower pause times.
`-XX:+HeapDumpOnOutOfMemoryError`	Automatically dump heap on OOM for post-mortem analysis.

Fix 4: Add Swap Space

Even with proper sizing, swap provides a safety net:

# Create 2GB swap file
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

# Reduce swappiness (prefer RAM, use swap only when necessary)
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Verify:

free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       1.2Gi       1.9Gi       0.0Ki       712Mi       2.4Gi
Swap:         2.0Gi          0B       2.0Gi

Fix 5: Fix the Thread Leak

This required code changes from the development team. The issues were:

HTTP connection pool not configured with max connections:

// Before: unbounded pool
CloseableHttpClient client = HttpClients.createDefault();

// After: bounded pool
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(100);
cm.setDefaultMaxPerRoute(20);
CloseableHttpClient client = HttpClients.custom()
    .setConnectionManager(cm)
    .build();

Async handlers not completing:

// Before: CompletableFuture without timeout
CompletableFuture.supplyAsync(() -> fetchData());

// After: with timeout
CompletableFuture.supplyAsync(() -> fetchData())
    .orTimeout(30, TimeUnit.SECONDS)
    .exceptionally(ex -> {
        logger.error("Async operation timed out", ex);
        return fallbackValue;
    });

ExecutorService not bounded:

// Before: cached thread pool (unbounded)
ExecutorService executor = Executors.newCachedThreadPool();

// After: fixed thread pool
ExecutorService executor = Executors.newFixedThreadPool(50);

Fix 6: Add Monitoring

I set up CloudWatch alarms to catch these issues before they caused outages:

# Install CloudWatch agent
sudo yum install -y amazon-cloudwatch-agent

# Configure to collect memory and process metrics
cat > /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json << 'EOF'
{
  "metrics": {
    "namespace": "MyApp",
    "metrics_collected": {
      "mem": {
        "measurement": ["mem_used_percent", "mem_available"]
      },
      "processes": {
        "measurement": ["running", "blocked", "zombies"]
      }
    }
  }
}
EOF

sudo amazon-cloudwatch-agent-ctl -a start

And a custom metric for JVM thread count:

#!/bin/bash
# /opt/scripts/jvm-metrics.sh
# Run via cron every minute

PID=$(pgrep -f "api.jar")
if [ -n "$PID" ]; then
    THREAD_COUNT=$(cat /proc/$PID/status | grep Threads | awk '{print $2}')
    
    aws cloudwatch put-metric-data \
        --namespace "MyApp" \
        --metric-name "JVMThreadCount" \
        --value "$THREAD_COUNT" \
        --unit Count
fi

CloudWatch alarm:

aws cloudwatch put-metric-alarm \
    --alarm-name "jvm-thread-count-high" \
    --metric-name "JVMThreadCount" \
    --namespace "MyApp" \
    --statistic "Average" \
    --period 300 \
    --threshold 400 \
    --comparison-operator "GreaterThanThreshold" \
    --evaluation-periods 2 \
    --alarm-actions "arn:aws:sns:eu-west-1:123456789012:alerts"

Verification

After all fixes were applied, I ran the load test again:

# Before fixes
# 500 concurrent users → crash within 5 minutes

# After fixes
# 500 concurrent users → stable for 2 hours
# Memory: 2.1GB / 4GB
# Threads: stable at ~180 (down from 500+)
# Response time: P99 < 200ms

The thread leak fix was the biggest improvement – thread count dropped from 500+ to ~180 and stayed stable.

Debugging Commands Reference

For the next time you’re debugging JVM issues on Linux:

# System memory
free -h
cat /proc/meminfo

# Process memory
ps aux --sort=-%mem | head
pmap -x <pid>

# Thread count for a process
cat /proc/<pid>/status | grep Threads
ps -eo pid,nlwp,cmd | grep java

# System thread limits
cat /proc/sys/kernel/threads-max
ulimit -u

# systemd TasksMax
systemctl show --property DefaultTasksMax
systemctl show <service> --property TasksMax

# JVM thread dump
jstack <pid> > threads.txt

# JVM heap dump
jmap -dump:format=b,file=heap.hprof <pid>

# JVM flags in use
jcmd <pid> VM.flags

# Watch thread count over time
watch -n 1 "cat /proc/<pid>/status | grep Threads"

# Check for OOM killer activity
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"

Lessons Learned

1. t2.micro Is Not for Production JVMs

A JVM with any meaningful workload needs at least 2GB RAM available, preferably 4GB+. t2.micro is for testing and tiny workloads only.

2. Always Set Explicit JVM Heap Sizes

Don’t rely on JVM auto-tuning. Set -Xms and -Xmx explicitly based on your instance size and workload.

3. Reduce Thread Stack Size

The default 1MB per thread is often excessive. -Xss512k or even -Xss256k works for most applications and saves significant memory with many threads.

4. Check systemd TasksMax

This catches many people off guard. A default of 512 tasks is easily exceeded by JVM applications.

5. Always Have Swap

Even if you’ve sized everything correctly, swap provides a buffer against unexpected memory spikes. It’s better to slow down than to crash.

6. Monitor Thread Count

Thread leaks are common in async Java applications. Monitor thread count as a first-class metric alongside CPU and memory.

7. Bound Your Thread Pools

Never use Executors.newCachedThreadPool() in production. Always use bounded pools with explicit maximums.

Summary

The client’s JVM crashes were caused by a combination of:

systemd TasksMax limit (512 threads)
Undersized EC2 instance (t2.micro with 1GB RAM)
No JVM tuning (defaults for heap and thread stack)
No swap space
Thread leak in application code

The fixes:

Increased TasksMax to 4096
Upgraded to t3.medium (4GB RAM)
Tuned JVM with explicit heap sizes and reduced thread stack
Added 2GB swap
Fixed thread leak in connection pools and async handlers
Added monitoring for threads and memory

Total time to diagnose and fix: about 2 days. The application has been stable in production for months since.

Debugging JVM performance issues or have questions about EC2 sizing for Java? Find me on LinkedIn.