Site Reliability Engineer (SRE) [**]

  • Anand
  • Stealth

Key Responsibilities: - Maintain and enhance the reliability, availability, and performance of large-scale distributed systems. - Automate deployment, monitoring, and management of production systems. - Implement and manage CI/CD pipelines for software delivery. - Collaborate with software engineers to design, build, and manage scalable and resilient infrastructure. - Troubleshoot complex system issues, identify root causes and implement long-term solutions. - Monitor system performance and optimize configurations for better performance and cost efficiency. - Implement security best practices and ensure compliance with industry standards. Required Skills: - Proficiency in cloud platforms (AWS, Google Cloud, or Azure) and containerization technologies like Docker and Kubernetes. - Strong scripting and automation skills using Python, Bash, or similar languages. - Experience with infrastructure as code (IaC) tools such as Terraform or Ansible. - Deep understanding of monitoring and logging tools (Prometheus, Grafana, ELK Stack). - Knowledge of database management (SQL/NoSQL) and networking fundamentals. - Experience with CI/CD tools like Jenkins, GitLab CI, or CircleCI. - Strong problem-solving skills and experience in troubleshooting large-scale systems. Education: - A degree in Computer Science, Engineering, or a related field from a recognized institution. - Ideally, 5-10 years of experience in a similar role at a product company.