Common DevOps Interview Questions Candidates Fail

I’ve interviewed hundreds of DevOps and Platform Engineering candidates. Most can tell me what Kubernetes is. Most can explain CI/CD pipelines. Most have terraform on their CV.

But ask them why something works the way it does, or what happens when things go wrong, and you quickly separate the senior engineers from those who followed tutorials.

These are the questions that trip people up - not because they’re trick questions, but because they require actual understanding rather than memorisation.

TL;DR

Candidates fail questions that probe why, not what
Troubleshooting scenarios reveal real experience
Understanding trade-offs matters more than knowing the “right” answer
Interviewers want to see how you think, not what you’ve memorised
The best answers acknowledge complexity and trade-offs

1. “A pod is stuck in Pending. Walk me through how you’d debug it.”

Why candidates fail: They jump straight to kubectl describe pod without explaining their mental model.

What interviewers want: A systematic approach that shows you understand the Kubernetes scheduling process.

Good answer:

# First, check what the scheduler is telling us
kubectl describe pod <pod-name> -n <namespace>

Look at the Events section. Common causes:

Insufficient resources - No node has enough CPU/memory

kubectl describe nodes | grep -A 5 "Allocated resources"

Node selectors/affinity not matching - Pod requires a label no node has
```
kubectl get nodes --show-labels
```
Taints and tolerations - Nodes are tainted and pod doesn’t tolerate them
```
kubectl describe nodes | grep Taints
```
PVC not bound - Pod needs a volume that doesn’t exist or can’t provision
```
kubectl get pvc -n <namespace>
```
ResourceQuota exceeded - Namespace has hit its limits
```
kubectl describe resourcequota -n <namespace>
```

Red flag answer: “I’d Google it” or immediately suggesting to delete and recreate the pod.

2. “Explain what happens when you type `kubectl apply -f deployment.yaml`”

Why candidates fail: They describe the user-facing behaviour, not the internal flow.

What interviewers want: Understanding of the Kubernetes control loop and API server architecture.

Good answer:

kubectl parses the YAML and sends a POST/PATCH request to the API server
API server authenticates (certs/tokens), authorises (RBAC), runs admission controllers (mutating then validating)
API server persists the object to etcd
Deployment controller (in controller-manager) notices the new Deployment
Controller creates/updates a ReplicaSet to match the desired state
ReplicaSet controller notices and creates Pod objects
Scheduler sees Pods with no nodeName, scores nodes, assigns the best fit
kubelet on the assigned node sees the Pod, pulls images via container runtime
Container runtime (containerd/CRI-O) creates containers
kubelet reports status back to API server

Bonus points: Mentioning that this is eventually consistent, that each controller only cares about its own resources, and that the whole system is built on watch/reconciliation loops.

3. “Your Terraform plan shows a resource will be destroyed and recreated. How do you prevent downtime?”

Why candidates fail: They say “use lifecycle { prevent_destroy = true }” without understanding when that’s appropriate or what alternatives exist.

What interviewers want: Understanding of Terraform’s lifecycle, state management, and infrastructure change strategies.

Good answer:

First, understand why it’s being recreated. Check which attribute change is forcing replacement:

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan | jq '.resource_changes[] | select(.change.actions | contains(["delete"]))'

Common causes and solutions:

1. Changing an immutable attribute (e.g., AMI, instance type on some resources)

Use create_before_destroy lifecycle:

lifecycle {
  create_before_destroy = true
}

For stateful resources: take a snapshot, create new, migrate, destroy old

2. Resource moved in code (renamed or moved to module)

Use terraform state mv to update state without destroying:

terraform state mv aws_instance.old aws_instance.new

3. Provider upgrade changed resource ID

Pin provider versions
Use terraform state rm + terraform import to re-adopt

4. Unavoidable replacement (e.g., changing RDS engine)

Blue-green deployment: create new, migrate data, switch traffic, destroy old
For databases: replicate first, promote replica, update Terraform to point to new

Red flag: Suggesting to manually change infrastructure outside Terraform, or not understanding that prevent_destroy just makes Terraform error - it doesn’t solve the underlying issue.

4. “How does a container differ from a VM at the kernel level?”

Why candidates fail: They recite “containers share the kernel” without understanding what that means.

What interviewers want: Understanding of namespaces, cgroups, and the security implications.

Good answer:

VMs have their own kernel - the hypervisor virtualises hardware, and each VM runs a complete OS.

Containers share the host kernel. Isolation comes from Linux kernel features:

Namespaces - Isolate what processes can see:

pid - Process IDs (container sees itself as PID 1)
net - Network stack (own interfaces, routing tables)
mnt - Filesystem mounts
uts - Hostname
ipc - Inter-process communication
user - UID/GID mapping (rootless containers)
cgroup - Cgroup visibility

Cgroups - Limit what processes can use:

CPU shares/quota
Memory limits
I/O bandwidth
PIDs (fork bomb protection)

Security implications:

Kernel exploits affect all containers (not isolated like VMs)
Container escapes are possible if misconfigured
Privileged containers have near-host-level access
seccomp, AppArmor, SELinux add additional syscall filtering

Why this matters in production:

Multi-tenant clusters need strong isolation (consider gVisor, Kata Containers)
Resource limits aren’t just nice-to-have - they prevent noisy neighbour problems
Running as root in a container is still dangerous

5. “Your service is returning 504s intermittently. How do you troubleshoot?”

Why candidates fail: They don’t have a systematic approach and jump to random guesses.

What interviewers want: Methodical debugging that narrows down the problem space.

Good answer:

504 Gateway Timeout means an upstream didn’t respond in time. Work backwards from the user:

User → CDN/WAF → Load Balancer → Ingress → Service → Pod → Dependencies

Step 1: Identify the scope

# Is it all requests or specific endpoints?
# Is it all pods or specific ones?
# Does it correlate with traffic patterns?

Step 2: Check the load balancer/ingress logs

# AWS ALB
aws logs filter-log-events --log-group-name /aws/alb/... \
  --filter-pattern "504"

# Nginx ingress
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep 504

Which upstream timed out? That tells you where to look next.

Step 3: Check pod health and resources

kubectl top pods -n <namespace>
kubectl describe pod <pod> | grep -A 10 "Conditions"

# Are pods being OOMKilled?
kubectl get events --field-selector reason=OOMKilled

Step 4: Check the application itself

# Slow queries? Thread pool exhaustion? Connection pool exhaustion?
kubectl exec -it <pod> -- curl -w "@curl-timing.txt" localhost:8080/health

# Check application metrics
# - Request duration percentiles
# - Active connections
# - Thread pool usage

Step 5: Check downstream dependencies

Database connections maxed?
External API timing out?
DNS resolution slow?

Step 6: Check timeout configurations

Ingress timeout vs service timeout vs app timeout
Mismatched timeouts cause 504s (ingress gives up before app responds)

# Common fix: align timeouts
# Ingress > Service > Application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

6. “What’s the difference between blue-green and canary deployments? When would you choose each?”

Why candidates fail: They can define both but can’t articulate the trade-offs.

What interviewers want: Understanding of risk, rollback speed, cost, and when each is appropriate.

Good answer:

Blue-Green:

Run two identical environments
Switch all traffic at once (DNS, load balancer, etc.)
Instant rollback (switch back)
Requires 2x infrastructure during deployment
All-or-nothing: everyone gets the new version simultaneously

Canary:

Gradually shift traffic (1% → 5% → 25% → 100%)
Observe metrics at each stage
Slower rollout, but issues affect fewer users
Can catch problems that only appear at scale
More complex routing infrastructure needed

When to choose blue-green:

Database migrations (need atomic cutover)
Breaking API changes
Small user base where canary percentages don’t make sense
When you need instant, complete rollback
Compliance requires testing exact production config

When to choose canary:

Large user base (1% is still thousands of users for validation)
Changes where impact may not be immediately visible
Performance changes you need to measure
When you can’t afford 2x infrastructure cost
Features you want to A/B test

Bonus: Mention progressive delivery tools (Argo Rollouts, Flagger) and that in practice many teams use both - canary for application code, blue-green for infrastructure.

7. “Explain the CAP theorem and how it applies to a system you’ve worked with.”

Why candidates fail: They recite “Consistency, Availability, Partition tolerance - pick two” without understanding what it actually means.

What interviewers want: Practical understanding of distributed systems trade-offs.

Good answer:

CAP theorem: In a distributed system experiencing a network partition, you must choose between consistency (all nodes see the same data) and availability (every request gets a response).

The “pick two” framing is misleading - partitions will happen, so you’re really choosing between CP and AP during failures.

Real example - Cassandra (AP): I’ve run Cassandra clusters. During a network partition:

Writes continue to both sides of the partition
When partition heals, conflicts are resolved (last-write-wins by timestamp)
You might read stale data during the partition
Chose this for user activity tracking where availability mattered more than perfect consistency

Real example - etcd/Consul (CP): Kubernetes uses etcd, which is CP:

During partition, minority side stops accepting writes (no quorum)
Guarantees you never read stale data
If etcd loses quorum, cluster is effectively read-only
Critical for systems where inconsistency causes real problems (scheduling, leader election)

The nuance: Modern systems let you tune this per-operation:

Cassandra: QUORUM reads/writes give you consistency, ONE gives you availability
DynamoDB: Eventually consistent reads vs strongly consistent reads

8. “Your CI pipeline takes 45 minutes. How do you make it faster?”

Why candidates fail: They suggest parallelisation without analysing where time is actually spent.

What interviewers want: Systematic optimisation approach and understanding of CI/CD architecture.

Good answer:

Step 1: Measure first

# Break down where time goes
# - Checkout/setup: 2 min
# - Install dependencies: 10 min
# - Build: 15 min
# - Tests: 18 min
# - Deploy: 5 min

Step 2: Attack the biggest offenders

Dependency installation (10 min):

# Cache dependencies
- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('**/package-lock.json') }}

Or better - use a pre-built Docker image with dependencies:

FROM node:20
COPY package*.json ./
RUN npm ci
# Use this as your CI base image

Build time (15 min):

Enable build caching (Docker layer cache, Gradle build cache, etc.)
Incremental builds where possible
Consider remote build caching (Gradle Enterprise, Bazel remote cache)

Tests (18 min):

# Parallelise test suites
strategy:
  matrix:
    shard: [1, 2, 3, 4]
steps:
  - run: npm test -- --shard=${{ matrix.shard }}/4

Run slow tests (integration, e2e) only on main branch
Use test impact analysis to run only affected tests

Step 3: Architectural changes

Monorepo? Only build/test changed packages
Split into smaller services with independent pipelines
Move to trunk-based development (smaller, faster PRs)

Step 4: Infrastructure

Self-hosted runners with fast storage
Larger runner instances for parallelisation
Local artifact caching (Artifactory, Nexus)

The 45→15 min pipeline I actually fixed:

Cached Docker layers: -8 min
Parallel test shards (4x): -12 min
Pre-built base image: -5 min
Removed unnecessary steps: -5 min

9. “What happens when a Kubernetes node goes down?”

Why candidates fail: They say “pods get rescheduled” without explaining the timeline or conditions.

What interviewers want: Understanding of node health detection, pod eviction, and the timing involved.

Good answer:

Timeline:

0s: Node stops responding (crash, network partition, etc.)
~40s: kubelet stops sending heartbeats (default nodeStatusUpdateFrequency: 10s)
~5 min: Node controller marks node as NotReady after node-monitor-grace-period (default 40s) + pod-eviction-timeout considerations
~5 min: Taints applied: node.kubernetes.io/not-ready:NoExecute
~5+ min: Pods without tolerations are evicted (based on tolerationSeconds, default 300s for not-ready)
After eviction: Controllers (Deployment, StatefulSet) create replacement pods on healthy nodes

Key points:

Total time before rescheduling: ~5-10 minutes by default
StatefulSets wait longer (need confirmation node is truly dead before reassigning PVs)
DaemonSets don’t reschedule (they run on every node by design)
Pods with tolerations for not-ready might wait longer

How to speed this up:

# Tighter health checks (careful - causes flapping)
--node-monitor-period=2s
--node-monitor-grace-period=20s

# Pod-level: tolerate not-ready for less time
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 30

For critical workloads:

Pod Disruption Budgets ensure minimum availability
Pod anti-affinity spreads across nodes
Multiple replicas are essential (single replica = guaranteed downtime)

10. “Describe a production incident you caused and what you learned.”

Why candidates fail: They either can’t think of one (suspicious) or describe it without showing learning.

What interviewers want: Humility, growth mindset, and evidence you create systems to prevent recurrence.

Good answer framework:

What happened: Be specific, own it
Impact: Quantify if possible
How you fixed it: Immediate response
Root cause: What actually caused it
What changed: Systemic improvements

Example:

“I was migrating a database connection string to use Vault secrets. I tested in staging, but staging used a different Vault path structure. In production, the app couldn’t retrieve the secret and failed to start.

Impact: 15 minutes of downtime for a payment service.

Immediate fix: Rolled back the deployment, added the secret manually.

Root cause: Staging/prod Vault paths weren’t consistent, and our CI didn’t validate Vault paths existed before deploying.

What changed:

Added pre-deployment check that validates all required secrets exist
Documented and enforced consistent Vault path structure
Added runbook for secret-related rollbacks
Staging Vault now mirrors prod structure

I was embarrassed, but the systemic fixes meant nobody could make the same mistake again.”

Red flags: Blaming others, not having an example, not describing systemic improvements.

Parting Thoughts

The best DevOps engineers I’ve hired weren’t the ones with perfect answers. They were the ones who:

Said “I don’t know, but here’s how I’d find out”
Asked clarifying questions instead of assuming
Acknowledged trade-offs and edge cases
Showed genuine curiosity about how things work
Had clearly learned from their mistakes

Interviews are a conversation, not a test. If you don’t know something, say so and work through it together. That’s exactly what we’d do on the job.

Common DevOps Interview Questions Candidates Fail

Common DevOps Interview Questions Candidates Fail

TL;DR

1. “A pod is stuck in Pending. Walk me through how you’d debug it.”

2. “Explain what happens when you type `kubectl apply -f deployment.yaml`”

3. “Your Terraform plan shows a resource will be destroyed and recreated. How do you prevent downtime?”

4. “How does a container differ from a VM at the kernel level?”

5. “Your service is returning 504s intermittently. How do you troubleshoot?”

6. “What’s the difference between blue-green and canary deployments? When would you choose each?”

7. “Explain the CAP theorem and how it applies to a system you’ve worked with.”

8. “Your CI pipeline takes 45 minutes. How do you make it faster?”

9. “What happens when a Kubernetes node goes down?”

10. “Describe a production incident you caused and what you learned.”

Parting Thoughts

References

Comments

Common DevOps Interview Questions Candidates Fail

TL;DR

1. “A pod is stuck in Pending. Walk me through how you’d debug it.”

2. “Explain what happens when you type kubectl apply -f deployment.yaml”

3. “Your Terraform plan shows a resource will be destroyed and recreated. How do you prevent downtime?”

4. “How does a container differ from a VM at the kernel level?”

5. “Your service is returning 504s intermittently. How do you troubleshoot?”

6. “What’s the difference between blue-green and canary deployments? When would you choose each?”

7. “Explain the CAP theorem and how it applies to a system you’ve worked with.”

8. “Your CI pipeline takes 45 minutes. How do you make it faster?”

9. “What happens when a Kubernetes node goes down?”

10. “Describe a production incident you caused and what you learned.”

Parting Thoughts

References

Related Posts

Building a Production-Grade Homelab with K3s, Vault, and FluxCD

OpenTelemetry Changed How I Think About Observability

AWS Control Tower Account Factory - The Gotchas Nobody Tells You

Comments

2. “Explain what happens when you type `kubectl apply -f deployment.yaml`”