Monitoring · Jan 2025 · 6 min read

Setting up SLI/SLO tracking in Grafana that your team will actually check

Most teams set up dashboards, feel good about it… and then never open them again. I’ve seen this happen multiple times.

The problem is not the tools — tools like Grafana are powerful. The problem is how we design what we track.

In this article, I’ll share a simple approach I use to build SLI/SLO dashboards that teams actually rely on during real incidents.

What are SLI and SLO (quickly)

Instead of overcomplicating:

SLI (Service Level Indicator) → What you measure (e.g. success rate, latency)
SLO (Service Level Objective) → Your target (e.g. 99.9% uptime)

Example: If your API succeeds 99.9% of the time, that’s your SLO.

They track everything.

CPU usage, memory, disk, network, random graphs… but nothing answers:

“Is the system actually working for users right now?”

That’s the only question your dashboard should answer first.

I focus on just 3 things:

That’s enough to understand system health in seconds.

Let’s say you run a backend API.

SLI: Successful requests / total requests  
SLO: 99.9% success rate over 30 days

In Grafana, I create:

That’s it. No clutter.

A dashboard is only useful if someone can understand it in 10 seconds.

Here’s what I always do:

If someone has to “figure out” your dashboard, it’s already failed.

Dashboards are passive. Alerts are what actually save you.

I usually set alerts like:

If success rate < 99% for 5 minutes → alert

Keep alerts simple. Too many alerts = people ignore them.

After simplifying dashboards:

That’s the goal — not more data, but better decisions.

Monitoring is not about collecting metrics. It’s about understanding system health quickly.

Start small. Track what matters. Your future self (during an incident) will thank you.

Written by Adarsh Singh — DevOps Engineer