← Back to Blog
Monitoring · Jan 2025 · 6 min read

Setting up SLI/SLO tracking in Grafana that your team will actually check

Most teams set up dashboards, feel good about it… and then never open them again. I’ve seen this happen multiple times.

The problem is not the tools — tools like Grafana are powerful. The problem is how we design what we track.

In this article, I’ll share a simple approach I use to build SLI/SLO dashboards that teams actually rely on during real incidents.

What are SLI and SLO (quickly)

Instead of overcomplicating:

Example: If your API succeeds 99.9% of the time, that’s your SLO.

The mistake most teams make

They track everything.

CPU usage, memory, disk, network, random graphs… but nothing answers:

“Is the system actually working for users right now?”

That’s the only question your dashboard should answer first.

What I track instead (simple rule)

I focus on just 3 things:

That’s enough to understand system health in seconds.

Example: API SLO setup

Let’s say you run a backend API.

SLI: Successful requests / total requests  
SLO: 99.9% success rate over 30 days

In Grafana, I create:

That’s it. No clutter.

Make dashboards usable (very important)

A dashboard is only useful if someone can understand it in 10 seconds.

Here’s what I always do:

If someone has to “figure out” your dashboard, it’s already failed.

Alerting (don’t skip this)

Dashboards are passive. Alerts are what actually save you.

I usually set alerts like:

If success rate < 99% for 5 minutes → alert

Keep alerts simple. Too many alerts = people ignore them.

What changed after doing this

After simplifying dashboards:

That’s the goal — not more data, but better decisions.

Final thoughts

Monitoring is not about collecting metrics. It’s about understanding system health quickly.

Start small. Track what matters. Your future self (during an incident) will thank you.


Written by Adarsh Singh — DevOps Engineer