Senior Site Reliability Engineer

MiamiHybridApr 10, 2026·Posted 1 month ago

Description

About Iru

Iru is the AI-powered security & IT platform used by the world’s fastest-growing companies to secure their users, apps, and devices. Built for the AI era, Iru unifies identity & access, endpoint security & management, and compliance automation—collapsing the stack and giving IT & security time and control back.

Iru is backed by some of the smartest investors in tech—General Catalyst, Tiger Global, Felicis, Greycroft, and First Round Capital. In July 2024, Iru raised $100 million from General Catalyst, valuing the company at $850 million. Customers include Notion, Cursor, Lovable, Replit, and Mercor, and Iru partners with industry leaders such as ServiceNow and AWS. Iru was named to Forbes’ America’s Best Startup Employers 2025 list for employee engagement and satisfaction.

What You Will Do

Lead and refine the incident lifecycle: detection, triage, communication, mitigation, resolution, and post-incident review.

Define and maintain severity models, escalation paths, on-call expectations, and runbooks/playbooks—keeping them current and usable under pressure.

Facilitate blameless postmortems; turn findings into tracked remediations and shared learning that reduces repeat incidents.

Improve coordination during major incidents: roles, tooling, customer/stakeholder updates, and handoffs.

Partner with security, support, and product on incident communications and regulatory or contractual obligations where applicable.

Observability Standardization & SLI/SLO Evangelism Establish and maintain organization-wide standards for metrics, logs, and traces in Datadog—including naming conventions, cardinality, retention, and sampling—so teams can instrument consistently and confidently.

Define and drive adoption of SLOs, SLIs, and error budgets across engineering teams; meet teams where they are—bootstrapping SLI/SLO programs for teams starting from scratch and improving rigor for teams that already have them, with the long-term goal of teams owning their own observability.

Build and maintain reusable Datadog dashboard templates, monitor templates, and alerting patterns that teams can adopt and adapt—reducing the activation energy for doing observability well.

Champion golden signals and RED/USE-style alerting philosophies; align alerts with user-impacting symptoms, not just low-level infrastructure noise.

Partner with the Infrastructure team on observability stack decisions, multi-tenancy, cost controls, and data lifecycle.

Continuously reduce alert noise through threshold tuning, ownership assignment, and on-call load management.

Reliability Culture Mentor engineers on operational excellence, safe deployment practices, and production readiness; help engineering teams grow their own reliability instincts.

Contribute to capacity planning, chaos/game-day exercises, and reliability reviews for critical changes.

Serve as a connective layer between the SRE and Infrastructure teams—aligning on tooling, standards, and shared goals.

Requirements

Experience

5+ years in SRE, production engineering, or equivalent, including on-call responsibility for customer-facing systems.

Incidents

Proven experience running or significantly improving incident response (process, tooling, or both) in a distributed systems environment.

Observability

Deep, hands-on experience with Datadog—building dashboards, monitors, and instrumentation standards across multiple teams or services. Experience with metrics, logging, and tracing at scale.

SLI/SLO Programs

Demonstrated experience defining SLOs/SLIs and error budget policies in production; comfortable working with teams to codify the metrics their reliability posture is based on.

Systems

Strong understanding of Linux, networking, distributed systems failure modes, and cloud or hybrid infrastructure (Kubernetes, load balancers, databases, queues).

Automation

Proficiency in at least one of Go, Python, or similar for tooling and automation; comfort with IaC concepts (Terraform or equivalent).

Communication

Clear written and verbal communication; ability to facilitate discussions during high-pressure incidents and deliberate postmortems alike.

Collaboration

Track record of influencing without direct authority and driving adoption across engineering teams.

Nice to Have

Experience with OpenTelemetry or similar vendor-neutral instrumentation strategies. Familiarity with PagerDuty, Incident.io , Opsgenie, or similar; Statuspage or equivalent for external communications.

Experience in a hyper-growth startup environment.

Experience in regulated or high-compliance environments.

Contributions to internal developer platforms or shared reliability tooling.

What Success Looks Like Fewer repeated incidents and clearer, actionable postmortem outcomes that teams act on.

Engineering teams across the org have well-defined SLIs/SLOs they own and actively use to drive reliability decisions.

A shared Datadog observability layer with consistent signals, templated dashboards, and actionable alerts tied to user impact.

Engineers know how to instrument, where to look, and how to respond—with sustainable, well-supported on-call.

Location Context