Last year I got pulled into an ML project as “the Kubernetes guy.” The data science team had trained a fraud detection model. It worked great in their notebooks. Now they needed it in production.
“Should be easy,” they said. “Just deploy it.”
Six weeks later, we had a working system. But those six weeks taught me that ML deployment is a completely different beast. The model was the easy part. Everything around it - the data pipelines, the serving infrastructure, the monitoring - that’s where the real work lives.
If you’re a DevOps engineer being asked to support ML workloads, this is what I wish someone had told me before I started.
Why ML Systems Are Different
Traditional applications are predictable. You deploy code, it behaves the same way every time. Same input, same output. If something breaks, you check the logs, find the error, fix it.
ML systems don’t work like that.
The model is just a mathematical function that learned patterns from data. But data changes. Customer behavior shifts. New fraud patterns emerge. A model that worked brilliantly last month might be making terrible predictions today - and it won’t throw a single error.
This is the fundamental difference: ML systems can “work” while being completely wrong.
Your monitoring won’t catch it unless you’ve specifically built for it. The API returns 200 OK. Latency looks fine. But the model is predicting 0.5 for everything because the input data distribution changed.
The other major difference is dependencies. Traditional apps depend on code and maybe a database. ML systems depend on:
- Training data (the original dataset)
- Feature pipelines (transformations applied to raw data)
- Model artifacts (the serialized model file)
- Inference data (live data coming in)
- External APIs (if you’re enriching features)
Change any of these, and behavior changes. Often in unpredictable ways.
The ML Pipeline - What You’re Actually Operating
Before diving into tools, you need to understand what you’re operating. Here’s the lifecycle:
Data Collection → Feature Engineering → Training → Validation → Deployment → Monitoring
↑ |
└──────────────────────────────────────────────────────────────────────────┘
(Retraining Loop)
Data Collection is where most of the cost lives. Data lakes, streaming pipelines, storage. This is familiar territory for DevOps - just bigger datasets than you’re used to.
Feature Engineering transforms raw data into model inputs. If the raw data is “user clicked product X at time T,” the features might be “number of clicks in last hour” and “average time between clicks.” This often runs on Spark or similar batch processing systems.
Training is the expensive part. GPU clusters, hours or days of compute, massive memory requirements. But it’s also bursty - you train occasionally, not continuously.
Validation is where teams cut corners and pay for it later. Does the model meet quality thresholds? Does it perform equally across different user segments? Is it faster than the model it’s replacing?
Deployment is model serving - getting predictions with low latency at scale.
Monitoring closes the loop. Detect when the model degrades, trigger retraining.
Training Infrastructure
Training jobs need GPUs. Lots of them. Here’s how to set it up on Kubernetes.
First, you need the NVIDIA device plugin. It exposes GPUs as a schedulable resource.
We’re going to create a DaemonSet that runs on all GPU nodes:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
Now training jobs can request GPUs:
apiVersion: batch/v1
kind: Job
metadata:
name: model-training
spec:
template:
spec:
containers:
- name: trainer
image: my-training-image:v1
resources:
limits:
nvidia.com/gpu: 4
env:
- name: WANDB_API_KEY
valueFrom:
secretKeyRef:
name: ml-secrets
key: wandb-key
nodeSelector:
gpu-type: a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
restartPolicy: Never
My take on GPU node pools: Create dedicated node pools for GPU workloads with taints. This prevents regular workloads from accidentally scheduling there and blocking expensive GPU capacity. The tolerations in the training job spec allow it to run on tainted nodes.
Spot instances for training are a no-brainer. Training jobs can checkpoint progress and resume after interruption. You’ll save 60-90% on GPU costs. The key is implementing checkpointing properly - save model state every N steps to S3 or GCS, and have your training script resume from the latest checkpoint on startup.
Model Serving - The Production Bit
Training happens occasionally. Serving happens constantly. This is where latency and reliability matter.
You have a few options for serving. Let me walk you through what I’ve seen work.
Option 1: BYO Flask/FastAPI
The simple approach. Wrap your model in a REST API:
from fastapi import FastAPI
import joblib
app = FastAPI()
model = joblib.load("model.pkl")
@app.post("/predict")
async def predict(features: dict):
prediction = model.predict([list(features.values())])
return {"prediction": float(prediction[0])}
This works for simple cases. But you’re reinventing the wheel on:
- Batching (grouping multiple requests for GPU efficiency)
- Model versioning
- Canary deployments
- Auto-scaling
- Health checks
Option 2: KServe (My Recommendation)
KServe (formerly KFServing) handles all of that out of the box. It’s become the standard for model serving on Kubernetes.
Let’s deploy a scikit-learn model:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
spec:
predictor:
minReplicas: 1
maxReplicas: 10
scaleTarget: 70
scaleMetric: concurrency
model:
modelFormat:
name: sklearn
storageUri: s3://models/fraud-detector/v2
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1
memory: 2Gi
KServe handles:
- Downloading the model from S3
- Creating the serving container
- Auto-scaling based on request concurrency
- Canary deployments (deploy v3 to 10% of traffic)
- A/B testing
- Standardized prediction protocol
For canary deployments, which you’ll want when replacing models:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: fraud-detector
spec:
predictor:
canaryTrafficPercent: 10
minReplicas: 1
model:
modelFormat:
name: sklearn
storageUri: s3://models/fraud-detector/v3
This sends 10% of traffic to v3, keeping 90% on the previous version. Gradually increase if metrics look good.
Experiment Tracking - MLflow
Here’s a lesson I learned the hard way: data scientists will train hundreds of model variants. Without tracking, nobody knows which one is in production or why it was chosen.
MLflow is the standard tool. Let’s set it up on Kubernetes.
First, we need a PostgreSQL database for metadata and S3 for artifacts. Then deploy the tracking server:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking
spec:
replicas: 1
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:v2.10.0
command:
- mlflow
- server
- --host=0.0.0.0
- --port=5000
- --backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow
- --default-artifact-root=s3://mlflow-artifacts/
ports:
- containerPort: 5000
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
Data scientists integrate with a few lines of code:
import mlflow
mlflow.set_tracking_uri("http://mlflow-tracking:5000")
mlflow.set_experiment("fraud-detection")
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("max_depth", 5)
# Training happens here...
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("f1_score", 0.91)
mlflow.sklearn.log_model(model, "model")
Now every experiment is tracked: what parameters were used, what metrics were achieved, and the model artifact itself. When something goes wrong in production, you can trace back to exactly what was deployed.
Monitoring ML Systems - The Hard Part
Standard application monitoring (latency, error rate, throughput) still applies. But it misses the ML-specific failures.
What to Monitor
Prediction distribution. If your fraud model normally predicts between 0.1 and 0.9, and suddenly everything is 0.5, something’s wrong. Track the mean, standard deviation, and percentiles of predictions.
Feature drift. Input data changing from the training distribution. If the model was trained on users aged 18-65 and suddenly you’re getting users aged 70+, predictions might be unreliable.
Concept drift. The relationship between features and labels changing. Fraud patterns evolve. What indicated fraud last year might be normal behavior now.
Data quality. Missing values, null features, unexpected types. Garbage in, garbage out.
Implementing Drift Detection
Here’s a simple approach using Prometheus. First, instrument your serving code:
from prometheus_client import Histogram, Counter
prediction_histogram = Histogram(
'model_prediction_value',
'Distribution of model predictions',
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
feature_missing_counter = Counter(
'feature_missing_total',
'Count of missing features',
['feature_name']
)
@app.post("/predict")
async def predict(features: dict):
# Check for missing features
for expected in ['feature_a', 'feature_b', 'feature_c']:
if expected not in features:
feature_missing_counter.labels(feature_name=expected).inc()
prediction = model.predict([list(features.values())])
prediction_histogram.observe(prediction[0])
return {"prediction": float(prediction[0])}
Then alert on drift:
groups:
- name: ml-monitoring
rules:
- alert: PredictionDistributionShift
expr: |
abs(
avg_over_time(model_prediction_value_sum[1h]) / avg_over_time(model_prediction_value_count[1h])
-
avg_over_time(model_prediction_value_sum[1h] offset 7d) / avg_over_time(model_prediction_value_count[1h] offset 7d)
) > 0.1
for: 30m
labels:
severity: warning
annotations:
summary: "Model prediction distribution has shifted significantly"
- alert: HighMissingFeatureRate
expr: rate(feature_missing_total[5m]) > 0.01
for: 10m
labels:
severity: critical
annotations:
summary: "High rate of missing features in model input"
For more sophisticated drift detection, look at Evidently AI or WhyLabs. They provide statistical tests (Kolmogorov-Smirnov, Population Stability Index) and dashboards specifically designed for ML monitoring.
The Retraining Pipeline
Models degrade. You need automated retraining. Here’s how I set it up with Argo Workflows:
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: fraud-model-retrain
spec:
schedule: "0 2 * * 0" # Weekly, Sunday 2am
workflowSpec:
entrypoint: retrain-pipeline
templates:
- name: retrain-pipeline
dag:
tasks:
- name: extract-data
template: extract-training-data
- name: train
dependencies: [extract-data]
template: train-model
- name: validate
dependencies: [train]
template: validate-model
- name: deploy
dependencies: [validate]
template: deploy-if-better
when: "{{tasks.validate.outputs.parameters.passed}} == true"
- name: extract-training-data
container:
image: data-pipeline:v1
command: [python, extract.py]
args: ["--output", "/data/training.parquet"]
- name: train-model
container:
image: training:v1
resources:
limits:
nvidia.com/gpu: 2
command: [python, train.py]
args: ["--data", "/data/training.parquet"]
- name: validate-model
container:
image: validation:v1
command: [python, validate.py]
outputs:
parameters:
- name: passed
valueFrom:
path: /tmp/validation_passed
- name: deploy-if-better
container:
image: deployer:v1
command: [python, deploy.py]
The key insight: the validation step gates deployment. Never auto-deploy a model that hasn’t been validated against quality thresholds. Compare accuracy, latency, and fairness metrics against the current production model.
Cost Management
ML infrastructure is expensive. Here’s how to keep it under control:
Spot instances for training. I mentioned this, but it bears repeating. Checkpointing + spot = 70% savings.
Right-size GPU instances. A100s are overkill for most inference. T4s often work fine at a fraction of the cost. Profile your model’s actual memory and compute requirements.
Scale to zero. KServe can scale to zero replicas when there’s no traffic. You only pay when the model is being used.
Monitor GPU utilization. I’ve seen teams running GPUs at 10% utilization because they’re processing one request at a time. Enable request batching to improve throughput.
Lifecycle policies for model artifacts. Old model versions accumulate in S3. Set up lifecycle rules to archive or delete after 90 days.
Getting Started
If you’re new to MLOps, don’t try to adopt everything at once. Here’s the order I’d recommend:
-
Containerize models. Get them out of notebooks and into Docker images with pinned dependencies. This alone solves half the “works on my machine” problems.
-
Set up MLflow. Experiment tracking is low effort, high value. You’ll thank yourself when someone asks “what’s in production?”
-
Deploy with KServe. Don’t build your own serving infrastructure. KServe handles the hard parts.
-
Add Prometheus metrics. Start tracking prediction distributions from day one. You need baseline data before you can detect drift.
-
Automate retraining. Once you have monitoring, add scheduled retraining with validation gates.
ML systems are harder to operate than traditional applications. But the patterns are learnable, the tools are maturing, and frankly, this is where infrastructure is heading. Every company is becoming an ML company, whether they realise it or not.
The DevOps engineers who understand this stack will be in high demand. Start learning it now.