Skip to content
Back to blog Common DevOps Interview Questions Candidates Fail

Common DevOps Interview Questions Candidates Fail

CareerDevOps

Common DevOps Interview Questions Candidates Fail

I’ve interviewed hundreds of DevOps and Platform Engineering candidates. Most can tell me what Kubernetes is. Most can explain CI/CD pipelines. Most have terraform on their CV.

But ask them why something works the way it does, or what happens when things go wrong, and you quickly separate the senior engineers from those who followed tutorials.

These are the questions that trip people up - not because they’re trick questions, but because they require actual understanding rather than memorisation.

TL;DR

  • Candidates fail questions that probe why, not what
  • Troubleshooting scenarios reveal real experience
  • Understanding trade-offs matters more than knowing the “right” answer
  • Interviewers want to see how you think, not what you’ve memorised
  • The best answers acknowledge complexity and trade-offs

1. “A pod is stuck in Pending. Walk me through how you’d debug it.”

Why candidates fail: They jump straight to kubectl describe pod without explaining their mental model.

What interviewers want: A systematic approach that shows you understand the Kubernetes scheduling process.

Good answer:

# First, check what the scheduler is telling us
kubectl describe pod <pod-name> -n <namespace>

Look at the Events section. Common causes:

  1. Insufficient resources - No node has enough CPU/memory

    kubectl describe nodes | grep -A 5 "Allocated resources"
  2. Node selectors/affinity not matching - Pod requires a label no node has

    kubectl get nodes --show-labels
  3. Taints and tolerations - Nodes are tainted and pod doesn’t tolerate them

    kubectl describe nodes | grep Taints
  4. PVC not bound - Pod needs a volume that doesn’t exist or can’t provision

    kubectl get pvc -n <namespace>
  5. ResourceQuota exceeded - Namespace has hit its limits

    kubectl describe resourcequota -n <namespace>

Red flag answer: “I’d Google it” or immediately suggesting to delete and recreate the pod.


2. “Explain what happens when you type kubectl apply -f deployment.yaml

Why candidates fail: They describe the user-facing behaviour, not the internal flow.

What interviewers want: Understanding of the Kubernetes control loop and API server architecture.

Good answer:

  1. kubectl parses the YAML and sends a POST/PATCH request to the API server
  2. API server authenticates (certs/tokens), authorises (RBAC), runs admission controllers (mutating then validating)
  3. API server persists the object to etcd
  4. Deployment controller (in controller-manager) notices the new Deployment
  5. Controller creates/updates a ReplicaSet to match the desired state
  6. ReplicaSet controller notices and creates Pod objects
  7. Scheduler sees Pods with no nodeName, scores nodes, assigns the best fit
  8. kubelet on the assigned node sees the Pod, pulls images via container runtime
  9. Container runtime (containerd/CRI-O) creates containers
  10. kubelet reports status back to API server

Bonus points: Mentioning that this is eventually consistent, that each controller only cares about its own resources, and that the whole system is built on watch/reconciliation loops.


3. “Your Terraform plan shows a resource will be destroyed and recreated. How do you prevent downtime?”

Why candidates fail: They say “use lifecycle { prevent_destroy = true }” without understanding when that’s appropriate or what alternatives exist.

What interviewers want: Understanding of Terraform’s lifecycle, state management, and infrastructure change strategies.

Good answer:

First, understand why it’s being recreated. Check which attribute change is forcing replacement:

terraform plan -out=plan.tfplan
terraform show -json plan.tfplan | jq '.resource_changes[] | select(.change.actions | contains(["delete"]))'

Common causes and solutions:

1. Changing an immutable attribute (e.g., AMI, instance type on some resources)

  • Use create_before_destroy lifecycle:
    lifecycle {
      create_before_destroy = true
    }
  • For stateful resources: take a snapshot, create new, migrate, destroy old

2. Resource moved in code (renamed or moved to module)

  • Use terraform state mv to update state without destroying:
    terraform state mv aws_instance.old aws_instance.new

3. Provider upgrade changed resource ID

  • Pin provider versions
  • Use terraform state rm + terraform import to re-adopt

4. Unavoidable replacement (e.g., changing RDS engine)

  • Blue-green deployment: create new, migrate data, switch traffic, destroy old
  • For databases: replicate first, promote replica, update Terraform to point to new

Red flag: Suggesting to manually change infrastructure outside Terraform, or not understanding that prevent_destroy just makes Terraform error - it doesn’t solve the underlying issue.


4. “How does a container differ from a VM at the kernel level?”

Why candidates fail: They recite “containers share the kernel” without understanding what that means.

What interviewers want: Understanding of namespaces, cgroups, and the security implications.

Good answer:

VMs have their own kernel - the hypervisor virtualises hardware, and each VM runs a complete OS.

Containers share the host kernel. Isolation comes from Linux kernel features:

Namespaces - Isolate what processes can see:

  • pid - Process IDs (container sees itself as PID 1)
  • net - Network stack (own interfaces, routing tables)
  • mnt - Filesystem mounts
  • uts - Hostname
  • ipc - Inter-process communication
  • user - UID/GID mapping (rootless containers)
  • cgroup - Cgroup visibility

Cgroups - Limit what processes can use:

  • CPU shares/quota
  • Memory limits
  • I/O bandwidth
  • PIDs (fork bomb protection)

Security implications:

  • Kernel exploits affect all containers (not isolated like VMs)
  • Container escapes are possible if misconfigured
  • Privileged containers have near-host-level access
  • seccomp, AppArmor, SELinux add additional syscall filtering

Why this matters in production:

  • Multi-tenant clusters need strong isolation (consider gVisor, Kata Containers)
  • Resource limits aren’t just nice-to-have - they prevent noisy neighbour problems
  • Running as root in a container is still dangerous

5. “Your service is returning 504s intermittently. How do you troubleshoot?”

Why candidates fail: They don’t have a systematic approach and jump to random guesses.

What interviewers want: Methodical debugging that narrows down the problem space.

Good answer:

504 Gateway Timeout means an upstream didn’t respond in time. Work backwards from the user:

User → CDN/WAF → Load Balancer → Ingress → Service → Pod → Dependencies

Step 1: Identify the scope

# Is it all requests or specific endpoints?
# Is it all pods or specific ones?
# Does it correlate with traffic patterns?

Step 2: Check the load balancer/ingress logs

# AWS ALB
aws logs filter-log-events --log-group-name /aws/alb/... \
  --filter-pattern "504"

# Nginx ingress
kubectl logs -n ingress-nginx deploy/ingress-nginx-controller | grep 504

Which upstream timed out? That tells you where to look next.

Step 3: Check pod health and resources

kubectl top pods -n <namespace>
kubectl describe pod <pod> | grep -A 10 "Conditions"

# Are pods being OOMKilled?
kubectl get events --field-selector reason=OOMKilled

Step 4: Check the application itself

# Slow queries? Thread pool exhaustion? Connection pool exhaustion?
kubectl exec -it <pod> -- curl -w "@curl-timing.txt" localhost:8080/health

# Check application metrics
# - Request duration percentiles
# - Active connections
# - Thread pool usage

Step 5: Check downstream dependencies

  • Database connections maxed?
  • External API timing out?
  • DNS resolution slow?

Step 6: Check timeout configurations

  • Ingress timeout vs service timeout vs app timeout
  • Mismatched timeouts cause 504s (ingress gives up before app responds)
# Common fix: align timeouts
# Ingress > Service > Application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

6. “What’s the difference between blue-green and canary deployments? When would you choose each?”

Why candidates fail: They can define both but can’t articulate the trade-offs.

What interviewers want: Understanding of risk, rollback speed, cost, and when each is appropriate.

Good answer:

Blue-Green:

  • Run two identical environments
  • Switch all traffic at once (DNS, load balancer, etc.)
  • Instant rollback (switch back)
  • Requires 2x infrastructure during deployment
  • All-or-nothing: everyone gets the new version simultaneously

Canary:

  • Gradually shift traffic (1% → 5% → 25% → 100%)
  • Observe metrics at each stage
  • Slower rollout, but issues affect fewer users
  • Can catch problems that only appear at scale
  • More complex routing infrastructure needed

When to choose blue-green:

  • Database migrations (need atomic cutover)
  • Breaking API changes
  • Small user base where canary percentages don’t make sense
  • When you need instant, complete rollback
  • Compliance requires testing exact production config

When to choose canary:

  • Large user base (1% is still thousands of users for validation)
  • Changes where impact may not be immediately visible
  • Performance changes you need to measure
  • When you can’t afford 2x infrastructure cost
  • Features you want to A/B test

Bonus: Mention progressive delivery tools (Argo Rollouts, Flagger) and that in practice many teams use both - canary for application code, blue-green for infrastructure.


7. “Explain the CAP theorem and how it applies to a system you’ve worked with.”

Why candidates fail: They recite “Consistency, Availability, Partition tolerance - pick two” without understanding what it actually means.

What interviewers want: Practical understanding of distributed systems trade-offs.

Good answer:

CAP theorem: In a distributed system experiencing a network partition, you must choose between consistency (all nodes see the same data) and availability (every request gets a response).

The “pick two” framing is misleading - partitions will happen, so you’re really choosing between CP and AP during failures.

Real example - Cassandra (AP): I’ve run Cassandra clusters. During a network partition:

  • Writes continue to both sides of the partition
  • When partition heals, conflicts are resolved (last-write-wins by timestamp)
  • You might read stale data during the partition
  • Chose this for user activity tracking where availability mattered more than perfect consistency

Real example - etcd/Consul (CP): Kubernetes uses etcd, which is CP:

  • During partition, minority side stops accepting writes (no quorum)
  • Guarantees you never read stale data
  • If etcd loses quorum, cluster is effectively read-only
  • Critical for systems where inconsistency causes real problems (scheduling, leader election)

The nuance: Modern systems let you tune this per-operation:

  • Cassandra: QUORUM reads/writes give you consistency, ONE gives you availability
  • DynamoDB: Eventually consistent reads vs strongly consistent reads

8. “Your CI pipeline takes 45 minutes. How do you make it faster?”

Why candidates fail: They suggest parallelisation without analysing where time is actually spent.

What interviewers want: Systematic optimisation approach and understanding of CI/CD architecture.

Good answer:

Step 1: Measure first

# Break down where time goes
# - Checkout/setup: 2 min
# - Install dependencies: 10 min
# - Build: 15 min
# - Tests: 18 min
# - Deploy: 5 min

Step 2: Attack the biggest offenders

Dependency installation (10 min):

# Cache dependencies
- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('**/package-lock.json') }}

Or better - use a pre-built Docker image with dependencies:

FROM node:20
COPY package*.json ./
RUN npm ci
# Use this as your CI base image

Build time (15 min):

  • Enable build caching (Docker layer cache, Gradle build cache, etc.)
  • Incremental builds where possible
  • Consider remote build caching (Gradle Enterprise, Bazel remote cache)

Tests (18 min):

# Parallelise test suites
strategy:
  matrix:
    shard: [1, 2, 3, 4]
steps:
  - run: npm test -- --shard=${{ matrix.shard }}/4
  • Run slow tests (integration, e2e) only on main branch
  • Use test impact analysis to run only affected tests

Step 3: Architectural changes

  • Monorepo? Only build/test changed packages
  • Split into smaller services with independent pipelines
  • Move to trunk-based development (smaller, faster PRs)

Step 4: Infrastructure

  • Self-hosted runners with fast storage
  • Larger runner instances for parallelisation
  • Local artifact caching (Artifactory, Nexus)

The 45→15 min pipeline I actually fixed:

  • Cached Docker layers: -8 min
  • Parallel test shards (4x): -12 min
  • Pre-built base image: -5 min
  • Removed unnecessary steps: -5 min

9. “What happens when a Kubernetes node goes down?”

Why candidates fail: They say “pods get rescheduled” without explaining the timeline or conditions.

What interviewers want: Understanding of node health detection, pod eviction, and the timing involved.

Good answer:

Timeline:

  1. 0s: Node stops responding (crash, network partition, etc.)

  2. ~40s: kubelet stops sending heartbeats (default nodeStatusUpdateFrequency: 10s)

  3. ~5 min: Node controller marks node as NotReady after node-monitor-grace-period (default 40s) + pod-eviction-timeout considerations

  4. ~5 min: Taints applied: node.kubernetes.io/not-ready:NoExecute

  5. ~5+ min: Pods without tolerations are evicted (based on tolerationSeconds, default 300s for not-ready)

  6. After eviction: Controllers (Deployment, StatefulSet) create replacement pods on healthy nodes

Key points:

  • Total time before rescheduling: ~5-10 minutes by default
  • StatefulSets wait longer (need confirmation node is truly dead before reassigning PVs)
  • DaemonSets don’t reschedule (they run on every node by design)
  • Pods with tolerations for not-ready might wait longer

How to speed this up:

# Tighter health checks (careful - causes flapping)
--node-monitor-period=2s
--node-monitor-grace-period=20s

# Pod-level: tolerate not-ready for less time
tolerations:
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 30

For critical workloads:

  • Pod Disruption Budgets ensure minimum availability
  • Pod anti-affinity spreads across nodes
  • Multiple replicas are essential (single replica = guaranteed downtime)

10. “Describe a production incident you caused and what you learned.”

Why candidates fail: They either can’t think of one (suspicious) or describe it without showing learning.

What interviewers want: Humility, growth mindset, and evidence you create systems to prevent recurrence.

Good answer framework:

  1. What happened: Be specific, own it
  2. Impact: Quantify if possible
  3. How you fixed it: Immediate response
  4. Root cause: What actually caused it
  5. What changed: Systemic improvements

Example:

“I was migrating a database connection string to use Vault secrets. I tested in staging, but staging used a different Vault path structure. In production, the app couldn’t retrieve the secret and failed to start.

Impact: 15 minutes of downtime for a payment service.

Immediate fix: Rolled back the deployment, added the secret manually.

Root cause: Staging/prod Vault paths weren’t consistent, and our CI didn’t validate Vault paths existed before deploying.

What changed:

  • Added pre-deployment check that validates all required secrets exist
  • Documented and enforced consistent Vault path structure
  • Added runbook for secret-related rollbacks
  • Staging Vault now mirrors prod structure

I was embarrassed, but the systemic fixes meant nobody could make the same mistake again.”

Red flags: Blaming others, not having an example, not describing systemic improvements.


Parting Thoughts

The best DevOps engineers I’ve hired weren’t the ones with perfect answers. They were the ones who:

  • Said “I don’t know, but here’s how I’d find out”
  • Asked clarifying questions instead of assuming
  • Acknowledged trade-offs and edge cases
  • Showed genuine curiosity about how things work
  • Had clearly learned from their mistakes

Interviews are a conversation, not a test. If you don’t know something, say so and work through it together. That’s exactly what we’d do on the job.


References

Found this helpful?

Comments