This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
We are looking for a Site Reliability Engineering (SRE) Manager to lead our Cloud Infrastructure Engineering team in Chennai R&D. This team ensures the continuous availability of the technologies and systems that power athenahealth’s services.
-
Manage thousands of servers, petabytes of storage, and process thousands of web requests per second.
-
Create a seamless operating system for the medical office—abstracting administrative complexities so doctors can focus on patient care.
Key Responsibilities
-
Team Leadership & Development:
-
Lead, mentor, and develop a team of SREs, fostering a culture of collaboration, accountability, and continuous learning.
-
Build a high-performing team focused on operational excellence, reliability, and scalability.
-
Partner with Engineering, Product, and Project Management teams to align priorities and drive cross-functional collaboration.
-
Service Reliability & Performance:
-
Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical systems.
-
Monitor and enhance the reliability, availability, and performance of all production services and infrastructure.
-
Drive improvements in incident management, root cause analysis, and postmortem processes.
-
Implement proactive monitoring, alerting, and incident response strategies.
-
System Automation & Scalability:
-
Lead automation efforts to eliminate manual tasks, improve system reliability, and streamline operations.
-
Implement best practices for system design, capacity planning, and cost optimization.
-
Work closely with engineering teams to build scalable, resilient, and efficient systems.
-
Collaboration & Cross-functional Engagement:
-
Advocate for reliability best practices across engineering and product teams.
-
Ensure reliability is embedded in the development lifecycle by reviewing code, design, and deployment strategies.
-
Align with other engineering managers on long-term goals, technical debt, and infrastructure investments.
-
Process & Efficiency Improvement:
-
Continuously improve incident management, deployment pipelines, and system observability.
-
Champion automation, monitoring, alerting, and reporting tools.
-
Use data-driven insights to measure and optimize operational performance.
Qualifications
-
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
-
10+ years of experience in building, scaling, and supporting highly available systems and services.
-
2-3 years of experience in managing and mentoring technical teams, with expertise in containerization (Docker, Kubernetes - On-prem & Cloud).
-
Strong background in Platform Engineering, TechOps, FinOps, and DevSecOps in a hybrid cloud environment.
-
Expertise in Infrastructure-as-Code (Terraform, Crossplane, Puppet, Ansible) and API integration.
-
Proficiency in at least one scripting or programming language (Python, Go, Ruby, etc.).
-
Hands-on experience with Linux systems, VMware, cloud platforms (AWS), and observability tools (Prometheus, Grafana, ELK, CloudWatch, Splunk).
-
Strong understanding of site reliability principles, telemetry, and monitoring best practices.
-
Experience with large-scale distributed systems and cloud-native architectures.
-
Familiarity with configuration management tools (Ansible, Chef, Puppet).
-
Solid grasp of security best practices and compliance standards.
Benefits
-
Health and financial benefits.
-
Perks specific to each location, including commuter support, employee assistance programs, tuition assistance, employee resource groups, and collaborative workspaces.
-
Events throughout the year, including book clubs, external speakers, and hackathons.
-
Company culture based on learning, support of an engaged team, and an inclusive environment.
-
Flexibility to encourage a better work-life balance.