MLOps for DevOps Engineers - What You Actually Need to Know

Last year I got pulled into an ML project as “the Kubernetes guy.” The data science team had trained a fraud detection model. It worked great in their notebooks. Now they needed it in production.

“Should be easy,” they said. “Just deploy it.”

Six weeks later, we had a working system. But those six weeks taught me that ML deployment is a completely different beast. The model was the easy part. Everything around it - the data pipelines, the serving infrastructure, the monitoring - that’s where the real work lives.

If you’re a DevOps engineer being asked to support ML workloads, this is what I wish someone had told me before I started.

Why ML Systems Are Different

Traditional applications are predictable. You deploy code, it behaves the same way every time. Same input, same output. If something breaks, you check the logs, find the error, fix it.

ML systems don’t work like that.

The model is just a mathematical function that learned patterns from data. But data changes. Customer behavior shifts. New fraud patterns emerge. A model that worked brilliantly last month might be making terrible predictions today - and it won’t throw a single error.

This is the fundamental difference: ML systems can “work” while being completely wrong.

Your monitoring won’t catch it unless you’ve specifically built for it. The API returns 200 OK. Latency looks fine. But the model is predicting 0.5 for everything because the input data distribution changed.

The other major difference is dependencies. Traditional apps depend on code and maybe a database. ML systems depend on:

Training data (the original dataset)
Feature pipelines (transformations applied to raw data)
Model artifacts (the serialized model file)
Inference data (live data coming in)
External APIs (if you’re enriching features)

Change any of these, and behavior changes. Often in unpredictable ways.

The ML Pipeline - What You’re Actually Operating

Before diving into tools, you need to understand what you’re operating. Here’s the lifecycle:

Data Collection → Feature Engineering → Training → Validation → Deployment → Monitoring
       ↑                                                                          |
       └──────────────────────────────────────────────────────────────────────────┘
                                    (Retraining Loop)

Data Collection is where most of the cost lives. Data lakes, streaming pipelines, storage. This is familiar territory for DevOps - just bigger datasets than you’re used to.

Feature Engineering transforms raw data into model inputs. If the raw data is “user clicked product X at time T,” the features might be “number of clicks in last hour” and “average time between clicks.” This often runs on Spark or similar batch processing systems.

Training is the expensive part. GPU clusters, hours or days of compute, massive memory requirements. But it’s also bursty - you train occasionally, not continuously.

Validation is where teams cut corners and pay for it later. Does the model meet quality thresholds? Does it perform equally across different user segments? Is it faster than the model it’s replacing?

Deployment is model serving - getting predictions with low latency at scale.

Monitoring closes the loop. Detect when the model degrades, trigger retraining.

Training Infrastructure

Training jobs need GPUs. Lots of them. Here’s how to set it up on Kubernetes.

First, you need the NVIDIA device plugin. It exposes GPUs as a schedulable resource.

We’re going to create a DaemonSet that runs on all GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins

Now training jobs can request GPUs:

apiVersion: batch/v1
kind: Job
metadata:
  name: model-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: my-training-image:v1
        resources:
          limits:
            nvidia.com/gpu: 4
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: ml-secrets
              key: wandb-key
      nodeSelector:
        gpu-type: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      restartPolicy: Never

My take on GPU node pools: Create dedicated node pools for GPU workloads with taints. This prevents regular workloads from accidentally scheduling there and blocking expensive GPU capacity. The tolerations in the training job spec allow it to run on tainted nodes.

Spot instances for training are a no-brainer. Training jobs can checkpoint progress and resume after interruption. You’ll save 60-90% on GPU costs. The key is implementing checkpointing properly - save model state every N steps to S3 or GCS, and have your training script resume from the latest checkpoint on startup.

Model Serving - The Production Bit

Training happens occasionally. Serving happens constantly. This is where latency and reliability matter.

You have a few options for serving. Let me walk you through what I’ve seen work.

Option 1: BYO Flask/FastAPI

The simple approach. Wrap your model in a REST API:

from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

@app.post("/predict")
async def predict(features: dict):
    prediction = model.predict([list(features.values())])
    return {"prediction": float(prediction[0])}

This works for simple cases. But you’re reinventing the wheel on:

Batching (grouping multiple requests for GPU efficiency)
Model versioning
Canary deployments
Auto-scaling
Health checks

Option 2: KServe (My Recommendation)

KServe (formerly KFServing) handles all of that out of the box. It’s become the standard for model serving on Kubernetes.

Let’s deploy a scikit-learn model:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 70
    scaleMetric: concurrency
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detector/v2
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 1
          memory: 2Gi

KServe handles:

Downloading the model from S3
Creating the serving container
Auto-scaling based on request concurrency
Canary deployments (deploy v3 to 10% of traffic)
A/B testing
Standardized prediction protocol

For canary deployments, which you’ll want when replacing models:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    canaryTrafficPercent: 10
    minReplicas: 1
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detector/v3

This sends 10% of traffic to v3, keeping 90% on the previous version. Gradually increase if metrics look good.

Experiment Tracking - MLflow

Here’s a lesson I learned the hard way: data scientists will train hundreds of model variants. Without tracking, nobody knows which one is in production or why it was chosen.

MLflow is the standard tool. Let’s set it up on Kubernetes.

First, we need a PostgreSQL database for metadata and S3 for artifacts. Then deploy the tracking server:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.10.0
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow
        - --default-artifact-root=s3://mlflow-artifacts/
        ports:
        - containerPort: 5000
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-key

Data scientists integrate with a few lines of code:

import mlflow

mlflow.set_tracking_uri("http://mlflow-tracking:5000")
mlflow.set_experiment("fraud-detection")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("max_depth", 5)
    
    # Training happens here...
    
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)
    mlflow.sklearn.log_model(model, "model")

Now every experiment is tracked: what parameters were used, what metrics were achieved, and the model artifact itself. When something goes wrong in production, you can trace back to exactly what was deployed.

Monitoring ML Systems - The Hard Part

Standard application monitoring (latency, error rate, throughput) still applies. But it misses the ML-specific failures.

What to Monitor

Prediction distribution. If your fraud model normally predicts between 0.1 and 0.9, and suddenly everything is 0.5, something’s wrong. Track the mean, standard deviation, and percentiles of predictions.

Feature drift. Input data changing from the training distribution. If the model was trained on users aged 18-65 and suddenly you’re getting users aged 70+, predictions might be unreliable.

Concept drift. The relationship between features and labels changing. Fraud patterns evolve. What indicated fraud last year might be normal behavior now.

Data quality. Missing values, null features, unexpected types. Garbage in, garbage out.

Implementing Drift Detection

Here’s a simple approach using Prometheus. First, instrument your serving code:

from prometheus_client import Histogram, Counter

prediction_histogram = Histogram(
    'model_prediction_value',
    'Distribution of model predictions',
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

feature_missing_counter = Counter(
    'feature_missing_total',
    'Count of missing features',
    ['feature_name']
)

@app.post("/predict")
async def predict(features: dict):
    # Check for missing features
    for expected in ['feature_a', 'feature_b', 'feature_c']:
        if expected not in features:
            feature_missing_counter.labels(feature_name=expected).inc()
    
    prediction = model.predict([list(features.values())])
    prediction_histogram.observe(prediction[0])
    
    return {"prediction": float(prediction[0])}

Then alert on drift:

groups:
- name: ml-monitoring
  rules:
  - alert: PredictionDistributionShift
    expr: |
      abs(
        avg_over_time(model_prediction_value_sum[1h]) / avg_over_time(model_prediction_value_count[1h])
        -
        avg_over_time(model_prediction_value_sum[1h] offset 7d) / avg_over_time(model_prediction_value_count[1h] offset 7d)
      ) > 0.1
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Model prediction distribution has shifted significantly"

  - alert: HighMissingFeatureRate
    expr: rate(feature_missing_total[5m]) > 0.01
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High rate of missing features in model input"

For more sophisticated drift detection, look at Evidently AI or WhyLabs. They provide statistical tests (Kolmogorov-Smirnov, Population Stability Index) and dashboards specifically designed for ML monitoring.

The Retraining Pipeline

Models degrade. You need automated retraining. Here’s how I set it up with Argo Workflows:

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: fraud-model-retrain
spec:
  schedule: "0 2 * * 0"  # Weekly, Sunday 2am
  workflowSpec:
    entrypoint: retrain-pipeline
    templates:
    - name: retrain-pipeline
      dag:
        tasks:
        - name: extract-data
          template: extract-training-data
        - name: train
          dependencies: [extract-data]
          template: train-model
        - name: validate
          dependencies: [train]
          template: validate-model
        - name: deploy
          dependencies: [validate]
          template: deploy-if-better
          when: "{{tasks.validate.outputs.parameters.passed}} == true"

    - name: extract-training-data
      container:
        image: data-pipeline:v1
        command: [python, extract.py]
        args: ["--output", "/data/training.parquet"]

    - name: train-model
      container:
        image: training:v1
        resources:
          limits:
            nvidia.com/gpu: 2
        command: [python, train.py]
        args: ["--data", "/data/training.parquet"]

    - name: validate-model
      container:
        image: validation:v1
        command: [python, validate.py]
      outputs:
        parameters:
        - name: passed
          valueFrom:
            path: /tmp/validation_passed

    - name: deploy-if-better
      container:
        image: deployer:v1
        command: [python, deploy.py]

The key insight: the validation step gates deployment. Never auto-deploy a model that hasn’t been validated against quality thresholds. Compare accuracy, latency, and fairness metrics against the current production model.

Cost Management

ML infrastructure is expensive. Here’s how to keep it under control:

Spot instances for training. I mentioned this, but it bears repeating. Checkpointing + spot = 70% savings.

Right-size GPU instances. A100s are overkill for most inference. T4s often work fine at a fraction of the cost. Profile your model’s actual memory and compute requirements.

Scale to zero. KServe can scale to zero replicas when there’s no traffic. You only pay when the model is being used.

Monitor GPU utilization. I’ve seen teams running GPUs at 10% utilization because they’re processing one request at a time. Enable request batching to improve throughput.

Lifecycle policies for model artifacts. Old model versions accumulate in S3. Set up lifecycle rules to archive or delete after 90 days.

Getting Started

If you’re new to MLOps, don’t try to adopt everything at once. Here’s the order I’d recommend:

Containerize models. Get them out of notebooks and into Docker images with pinned dependencies. This alone solves half the “works on my machine” problems.
Set up MLflow. Experiment tracking is low effort, high value. You’ll thank yourself when someone asks “what’s in production?”
Deploy with KServe. Don’t build your own serving infrastructure. KServe handles the hard parts.
Add Prometheus metrics. Start tracking prediction distributions from day one. You need baseline data before you can detect drift.
Automate retraining. Once you have monitoring, add scheduled retraining with validation gates.

ML systems are harder to operate than traditional applications. But the patterns are learnable, the tools are maturing, and frankly, this is where infrastructure is heading. Every company is becoming an ML company, whether they realise it or not.

The DevOps engineers who understand this stack will be in high demand. Start learning it now.