Assistant Vice President - Site Reliability Engineering (SRE)

  • Mangaluru
  • Max Life Insurance Company Limited
Position Assistant Vice President – Site Reliability Engineering (SRE)

Job Summary Responsible for system performance & uptimes, IT Digital operations, maintaining and enhancing systems’ operational efficiency along with focus on deployment automation and system optimization, ensuring consistent performance and reliability. The candidate must have robust hands on problem-solving technical skills and a strong desire to implement scalable and sustainable technological solutions.

Key Responsibilities, relationships & measure of success Anchor & provide strategic direction regarding technologies & solutions in Digital operations. Lead infrastructure & application builds & technical maintenance along with the core engineering & delivery teams. Custodian of SRE SLO, SLI & Error Budgets. Infrastructure scalability & optimization: Assist in designing and implementing scalable, highly available system architectures to handle increasing loads and user demands without compromising performance. Creating and optimizing CI/CD pipelines to automate testing and deployment processes, reducing the time from development to production and ensuring consistent quality control. Designing, Monitoring & Responding to system alerts, Monitoring system performance, identifying bottlenecks, and executing optimization & permanent fixes. Managing incident response protocols, including on-call rotations. Conducting post-incident reviews to prevent recurrence and refine the system reliability framework. Provide primary operational support and engineering for multiple large-scale distributed software applications Collaborate with development operations staff to create, monitor, and troubleshoot the system infrastructure. Increase system resilience and serve larger customer volumes with expert-level coding, bulletproof release, and change management skills. Improve automation and increase the system’s self-healing capability. Collect operating system data and report performance metrics to stakeholders. Manage cloud and database system maintenance, debugging production issues as they arise. Ensuring the effective and seamless integration of security policies and practices to DevOps workflows to reduce overall risks and deliver products and services on time. Implement the E2E automated VAPT for any new or existing application. Reduce the planned deployment downtime by ensuring robust CI/CD setup by 50%. MTTR (Mean time to recovery) to less than 1 hr for any SEV1 issue. MTTD (Mean time to detect) to less than 5 min with help of automated tools & methods.

Key competencies/skills required and experience This is a tech manager role – must have the capability of performing independent POCs and working with cross functional departments along with the below Tech skills.

Bachelor’s degree (B.E. / BTech. preferable) Overall 12+ years of strong experience in DevSecOps & SRE experience in leading a team (>80+). Ability to communicate at all levels. Proven experience in managing large-scale distributed systems and understanding the principles of scalability and reliability. Ownership of DevOps DORA metrics. SRE TOIL reduction – with automation. Experience in security tools like SAST, DAST, container security. Understanding of Node.js, React.js & JAVA. Experience in Infra as Code like Terraform, CloudFormation. Experience in container technologies like Docker, Kubernetes, OpenShift Must have knowledge of DevSecOps tools like Git, Maven, Selenium, Jenkins, Ansible, Security Tool Anyone of the Monitoring tools knowledge Geneos, Nagios, Prometheus, DynaTrace, AppDynamics. Scripting Knowledge: UNIX Shell, (Python groovy, YAML ((good to have)). Experience and understanding in at least one cloud provider like AWS, Azure etc. On demand Infra provisioning – environment spinoffs – environment cloning – EKS, IAAC Working hands-on knowledge of configuring SLA, SLO, SLIs and infra + business rules/logics in AppDynamics, AWS CW, PingDom, DataDog etc (APM – preferably). Understanding network protocols, load balancing, and firewall management for secure and efficient network operations.