// Case study

Observability and SLO design for a Series B HR SaaS platform

Pager fatigue and 4-hour MTTR gave way to OpenTelemetry traces, Grafana SLOs, and error budgets on the hiring API.

B2B SaaSManaged services & observability design12 weeksBangalore delivery

A Series B HR tech SaaS with 1,200 paying accounts running Node.js on EKS and React. On-call fatigue and four-hour MTTR stemmed from logs-only monitoring without cross-service trace propagation.

41 → 9

Pages per week

On-call alert volume

4h → 22 min

MTTR (P1 incidents)

Q3 post-rollout

−60%

Customer Sev-1 outages

Same quarter as rollout

99.5%

SLO target

Hiring API with monthly error-budget review

Delivered by Deepak Pathak · Published June 11, 2025 · 10 min read

Client context

A Series B HR tech SaaS, 1,200 paying accounts, Node.js API on EKS, React frontend, ran on Datadog logs alone. On-call engineers received 40+ pages per week, many duplicates from retry storms. Mean time to resolve production incidents averaged four hours because traces did not cross the API gateway into worker queues.

The challenge

Business risk: the hiring workflow API (offer letter generation + e-sign webhook) had no explicit SLO; leadership discovered reliability issues from customer churn calls, not dashboards.

Problems faced: legacy middleware stripped custom headers, propagation required gateway config change; PII in span attributes from verbose logging, scrubbing layer added in collector; initial SLO set too aggressive (99.99%) causing constant budget burn alerts, recalibrated to 99.5% with monthly review.

Our approach

Simplileap engagement: map critical user journeys (apply → interview → offer → sign); instrument OpenTelemetry across Next.js BFF, Node services, and BullMQ workers; export to Grafana Cloud with Tempo traces and Loki logs correlated by trace_id.

Deliverables: golden signals dashboards per service; SLO burn-rate alerts to Slack; runbooks linked from alert annotations; weekly error-budget review in eng standup.

Results & impact

Outcome: pages per week 41 → 9; MTTR 4h → 22 minutes on P1 incidents in Q3 post-rollout; customer-reported Sev-1 outages down 60% same quarter. Client anonymized as growth-stage HR platform, reference call available under NDA.

// Related services

← Back to case studies

// Verified entity

Simplileap Digital LLP

// Recognition

Featured in QuickNode Feature Fridays ›

CIN

AAU-8582

Startup India

DIPP83124

Founded

November 2020

Office

Residency Rd, Bengaluru, India

Ready to scope your next initiative?

Share your goals with our Bangalore team. We respond within one business day with a clear path from discovery to delivery.

Start a project ›Engagement models ›See our work ›