You'll Make a Difference By: - SRE L2 Support Role: Focus on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications. - Incident Management: Handle and resolve L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.), perform root cause analysis, and communicate to customers during outages or SLA breaches. - Monitoring & Optimization: Proactively monitor infrastructure and application health in AWS, set up and fine-tune AWS monitoring and observability tools (e.g., CloudWatch, CloudTrail), create alarms, dashboards, and reports. - Troubleshooting AWS Services: Resolve issues related to EC2 instances, Autoscaling Groups, Load Balancers (ELB/ALB/NLB), Amazon ECS, EKS, and container workloads. - Log Management: Manage and analyze logs using AWS CloudWatch Logs, CloudTrail, and third-party solutions like ELK Stack, Datadog, Splunk. - Disaster Recovery & Backups: Monitor AWS Backup jobs, ensure regular backups for critical infrastructure, validate DR plans, and participate in recovery testing exercises. - Automation & Scripting: Contribute to automation of repetitive tasks using scripts and support incident recovery processes. - Documentation & Knowledge Sharing: Create and maintain operational runbooks, SOPs, and knowledge base articles for common AWS issues. - Collaboration: Work effectively across teams, shift ownership as required, and communicate with stakeholders during incidents. You'd Describe Yourself As: - An experienced professional with 6 to 9 years of relevant experience in SRE , DevOps , or Cloud Infrastructure Support with strong hands-on expertise in AWS services . - Proficient in monitoring tools like Prometheus, Datadog, and familiar with cloud platforms (AWS, Azure, GCP). - Knowledgeable in Linux/Unix operating systems and basic scripting skills (e.g., Python, GitLab actions). - Familiar with container orchestration (Kubernetes, Docker, Helmcharts), CI/CD pipelines , and GitOps workflows (e.g., ArgoCD for automated deployments). - Strong analytical skills to resolve production incidents and a basic understanding of networking concepts (DNS, Load Balancers, Firewalls). - Experienced with alerting systems (e.g., PagerDuty), incident tracking tools (e.g., JIRA, ServiceNow), and ability to handle high-pressure environments. - A proactive problem-solver with a strong sense of urgency and excellent organizational skills to prioritize tasks effectively. - Able to work as a teammate , collaborating across teams and owning tasks as needed. Preferred Certifications: - AWS Certified SysOps Administrator Associate - AWS Certified Solutions Architect Associate - AWS Certified DevOps Engineer Professional Skills: Devops, Monitoring Tools, Cloud Infrastructure, SRE, Automation, Aws Experience: 6.00-9.00 Years

Site Reliability Engineer (SRE) – L2 Support

Job Description

Get AI-Matched to This Job