Simplileap logo

// Case studies

Observability and SLO design for a Series B HR SaaS platform

Pager fatigue and 4-hour MTTR gave way to OpenTelemetry traces, Grafana SLOs, and error budgets on the hiring API.

By Simplileap · Published June 11, 2025 · 10 min read

A Series B HR tech SaaS, 1,200 paying accounts, Node.js API on EKS, React frontend, ran on Datadog logs alone. On-call engineers received 40+ pages per week, many duplicates from retry storms. Mean time to resolve production incidents averaged four hours because traces did not cross the API gateway into worker queues.

Business risk: the hiring workflow API (offer letter generation + e-sign webhook) had no explicit SLO; leadership discovered reliability issues from customer churn calls, not dashboards.

Simplileap engagement: map critical user journeys (apply → interview → offer → sign); instrument OpenTelemetry across Next.js BFF, Node services, and BullMQ workers; export to Grafana Cloud with Tempo traces and Loki logs correlated by trace_id.

Problems faced: legacy middleware stripped custom headers, propagation required gateway config change; PII in span attributes from verbose logging, scrubbing layer added in collector; initial SLO set too aggressive (99.99%) causing constant budget burn alerts, recalibrated to 99.5% with monthly review.

Deliverables: golden signals dashboards per service; SLO burn-rate alerts to Slack; runbooks linked from alert annotations; weekly error-budget review in eng standup.

Outcome: pages per week 41 → 9; MTTR 4h → 22 minutes on P1 incidents in Q3 post-rollout; customer-reported Sev-1 outages down 60% same quarter. Client anonymized as growth-stage HR platform, reference call available under NDA.

← Back to Case studies

Ready to scope your next initiative?

Share your goals with our Bangalore team. We respond within one business day with a clear path from discovery to delivery.