SRE for Small Teams

Site Reliability Engineering was invented at Google. The original SRE book describes teams of dozens of engineers, massive scale, and sophisticated tooling.

Most of us don’t work at Google.

But SRE principles apply at any scale. You don’t need a dedicated SRE team to practice SRE. You need the right mindset and a pragmatic approach to reliability.

Here’s how to do SRE when you’re a small team without Google’s resources.

SRE for Small Teams

What SRE Actually Means

SRE is often misunderstood as “ops but fancier” or “DevOps with a different name.” It’s neither.

SRE is a set of principles:

Reliability is a feature that requires engineering effort
Toil (repetitive manual work) should be automated away
Service level objectives define acceptable reliability
Error budgets balance reliability with velocity
Incidents are learning opportunities, not blame games

These principles work whether you have 500 SREs or zero. The implementation scales, but the thinking doesn’t change.

Start with SLOs

Service Level Objectives are the foundation of SRE. They answer: “How reliable does this need to be?”

Most teams skip this step. They monitor everything, alert on everything, and drown in noise. Without SLOs, you can’t distinguish between problems that matter and problems that don’t.

For a small team, start simple:

Availability SLO: “The API returns successful responses 99.9% of the time.”

Latency SLO: “95% of requests complete in under 200ms.”

That’s it. Two SLOs. You can add more later, but two is enough to start.

Calculate your error budget: 99.9% availability means 0.1% allowed downtime. That’s about 43 minutes per month. If you’re under budget, you can take risks. If you’re over, focus on reliability.

Monitoring on a Budget

Enterprise monitoring stacks cost a fortune. You don’t need them.

Prometheus + Grafana. Free, open source, battle-tested. Prometheus scrapes metrics, Grafana visualises them. This handles 90% of monitoring needs.

Loki or CloudWatch Logs. Centralised logging. Loki is free and integrates with Grafana. CloudWatch Logs is cheap if you’re on AWS.

Uptime monitoring. Use a free tier of Pingdom, UptimeRobot, or similar. External monitoring catches issues your internal monitoring misses.

PagerDuty or Opsgenie. Worth paying for. On-call alerting needs to be reliable. Free tiers exist for small teams.

Total cost for a small team: $0-100/month. That’s less than one engineer-hour.

Alerting That Doesn’t Suck

Most alerting is terrible. Pages for non-issues. Silence during actual outages. Alert fatigue that makes everyone ignore everything.

SRE-style alerting follows rules:

Alert on SLO violations, not metrics. Don’t alert when CPU hits 80%. Alert when your error rate threatens your SLO.

Alerts should be actionable. If there’s nothing to do at 3am, it shouldn’t page. Make it a ticket instead.

Every alert needs a runbook. If you’re paged, you should know what to do. Link the runbook in the alert.

Reduce alerts ruthlessly. Start with few alerts. Add only when an incident would have benefited from earlier detection.

A good on-call shift has zero to two pages. If you’re getting more, your alerting is broken.

On-Call Without Burning Out

On-call at small companies often means “the CTO’s phone rings.” This doesn’t scale and leads to burnout.

Even with a small team, structure on-call properly:

Rotate weekly. One person is primary for a week. Clear handoffs.

Compensate fairly. On-call is work. Pay extra, give time off in lieu, or both.

Protect off-hours. If someone gets paged at 3am, they shouldn’t be expected to work a full day. Let them recover.

Two-tier escalation. Primary handles first response. Secondary is backup if primary doesn’t respond. This prevents single points of failure.

Runbooks for everything. The on-call engineer shouldn’t need to be the expert on every system. Good documentation makes anyone effective.

Incident Management Light

Full incident management processes involve incident commanders, scribes, and war rooms. Overkill for a small team.

Light-weight incident management:

Acknowledge quickly. When something breaks, someone owns it immediately. No diffusion of responsibility.

Communicate early. Post in a shared channel. “Investigating elevated error rates on the API.” Stakeholders know you’re on it.

Fix first, investigate later. Get the system working. Root cause analysis happens after recovery.

Brief post-mortem. What happened? Why? What will prevent recurrence? One page, not ten.

Track action items. Post-mortems without follow-through are useless. Assign owners and deadlines.

This whole process can happen in a Slack channel with a shared doc. No special tooling required.

Reducing Toil

Toil is repetitive manual work that could be automated. SRE teams aim to spend less than 50% of time on toil.

Common toil for small teams:

Manual deployments
Restarting crashed services
Provisioning environments
Rotating credentials
Scaling capacity

Pick the biggest time sink and automate it. Then the next one. Then the next.

Automation doesn’t need to be perfect. A shell script that’s run manually is better than a process that’s done by hand. Iterate toward full automation.

Capacity Planning

At Google, capacity planning involves complex models and dedicated teams. For small teams, it’s simpler.

Know your limits. Load test occasionally. Find out where things break.

Monitor utilisation. Track CPU, memory, database connections, whatever constrains you. Set up alerts before you hit limits.

Plan for spikes. If your normal traffic is X, can you handle 3X? 10X? Know the answer.

Scale before you need to. Scaling when you’re already overloaded is stressful. Automate scaling or stay ahead manually.

What to Skip

Not everything from the Google SRE book makes sense for small teams.

Skip complex error budget policies. At Google, teams negotiate error budgets with product managers. For you, just track whether you’re meeting SLOs and adjust accordingly.

Skip separate SRE teams. Embed reliability into your engineering culture. Everyone does SRE work as part of building software.

Skip custom tooling. Google built Borgmon and Monarch because nothing else existed. You have Prometheus. Use it.

Skip perfection. 99.99% availability is expensive to achieve. 99.9% is probably fine. Don’t over-engineer reliability.

Building the Culture

SRE is as much culture as technology.

Blameless post-mortems. When things break, focus on systems, not people. “The deployment process allowed this” not “Bob broke production.”

Reliability as a feature. Include reliability work in sprint planning. It’s not separate from product work.

Celebrate improvements. When you automate away toil or improve reliability, recognise it.

Share on-call pain. Everyone should do on-call, including leadership. It creates empathy and motivation to improve.

Getting Started

If you’re starting from zero, here’s a 30-day plan:

Week 1: Define two SLOs. Set up basic availability monitoring.

Week 2: Set up alerting on SLO violations. Create runbooks for common issues.

Week 3: Establish on-call rotation. Even if it’s just two people.

Week 4: Run a mock incident. Practice your response process.

Ongoing: After each incident, do a brief post-mortem. Automate one piece of toil per month.

You don’t need to transform overnight. Small improvements compound. A year from now, you’ll be doing SRE properly without having hired a single SRE.

The Goal

The goal isn’t to replicate Google. It’s to be reliable enough for your users while maintaining development velocity.

SRE gives you a framework for making reliability decisions. How much downtime is acceptable? When do we prioritise new features versus stability? How do we respond when things break?

Answer those questions thoughtfully, and you’re doing SRE. No massive team required.