Availability Engineer

Bangalore·Posted today
pythonkubernetesdockerterraform
<div class="content-intro"><p style="line-height: 1.4;"><span style="color: rgb(0, 0, 0); font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>Who we are</strong></span></p> <p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;">DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today’s threats and prepare for a quantum-safe future at&nbsp;<a href="http://www.digicert.com/">www.digicert.com</a></span></p></div><p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>Job summary</strong></span></p> <p style="line-height: 1.4;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">We are seeking a highly skilled Observability &amp; Incident Response Site Reliability Engineer (SRE) to own incident management practices across all production systems. In this role, you will be the subject matter expert for monitoring, alerting, tracing, and logging and lead incident response efforts. You will work at the intersection of product engineering, platform, and security teams to ensure our systems are observable, resilient, and compliant with SLA/SLO commitments.</span></p> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>What you will do</strong></span></p> <ul> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Excellent knowledge on Kubernetes clusters and container workloads for production reliability.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments, repeated manual tasks (Harness, GitHub Actions, etc.)</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Act as the primary Incident Manager for high priority production incidents — coordinating swift resolution across engineering, infrastructure, and business teams.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Own and continuously improve incident response runbooks, escalation matrices, and on-call schedules.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Drive root cause analysis for all major incidents — ensuring root cause analysis, action item tracking, and long-term resolution.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through proactive alerting and automated remediation.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Establish and enforce SLA/SLO/SLI frameworks across all production services.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Build automated runbooks and self-healing mechanisms to reduce manual intervention during incidents.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Implement synthetic monitoring to proactively detect customer-facing issues.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability across production systems.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Exceptional communication skills — able to lead high-pressure incident bridges calmly and clearly.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Detail-oriented with a strong sense of ownership and accountability.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Ability to manage multiple concurrent incidents and priorities without losing composure.</span></li> </ul> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>What you will have</strong></span></p> <ul> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">4+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Hands-on experience leading incident response for high-severity production incidents.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Strong background in Linux systems administration and distributed systems troubleshooting.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Experience defining and managing SLOs, SLIs, and Error Budgets in production.</span></li> </ul> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>Nice to have</strong></span></p> <ul> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Monitoring &amp; alerting: New Relic, Nagios, or equivalent.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Log management: Splunk.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Incident management: PagerDuty, OpsGenie, VictorOps, or equivalent.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Container orchestration: Kubernetes, Helm, Docker — with deep observability integration experience.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Scripting &amp; automation: Python, Bash or similar for building tooling and automations.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Infrastructure as Code: Terraform or Salt.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">CI/CD pipelines: GitHub Actions, Harness.</span></li> </ul> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"><strong>Benefits</strong></span></p> <ul> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Generous time off policies.</span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Top shelf benefits. </span></li> <li style="font-size: 12pt; font-family: arial, helvetica, sans-serif;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">Education, wellness and lifestyle support. </span></li> </ul> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">To protect candidate information and maintain a secure hiring process, all applications must be submitted through our careers portal. Resumes or CVs sent directly via email will not be reviewed or considered.</span></p> <p style="line-height: 1.4;">&nbsp;</p> <p style="line-height: 1.4;"><span style="font-size: 12pt; font-family: arial, helvetica, sans-serif;">#LI-SS1</span></p>