Job Description You will collaborate with Cloud Engineering & Operations to develop a unified self-service observability platform that enables teams to instrument, monitor, and troubleshoot applications across on-prem, cloud, and hybrid environments. Your work will involve integrating telemetry pipelines, event management workflows, and automation frameworks. You'll standardize observability practices by building reusable templates, automation scripts, and onboarding accelerators, embedding observability into CI/CD pipelines, and driving OpenTelemetry adoption. These efforts will enhance developer experience, reduce operational overhead, and improve system reliability across Experian’s global technology ecosystem. Primary Responsibility: - Design and implement monitoring and observability solutions using Dynatrace and Splunk (must-have), along with Datadog and Open Telemetry, to build scalable, automated, and developer-friendly platforms - Develop reusable patterns, templates, and automation scripts to drive consistency across observability practices and reduce manual effort in telemetry onboarding. - Build and maintain dashboards that deliver actionable insights into system performance, reliability, and user experience. - Integrate observability into CI/CD workflows using Jenkins and related tooling to enable continuous feedback and faster incident detection. - Automate infrastructure provisioning and deployment using Terraform and Ansible to support observability at scale. - Implement and manage Open Telemetry pipelines for standardized collection of traces, metrics, and logs, supporting vendor-agnostic ingestion strategies. - Collaborate with Business Units (BUs), Developers, and Platform Engineers to embed observability into the software delivery lifecycle and improve developer experience. - Define and implement SLIs/SLOs and error budgets with Business Units to support reliability engineering and improve service health visibility. - Enhance operational excellence by enabling proactive monitoring, reducing customer pain points, and streamlining incident workflows. - Amplify AIOps outcomes by integrating observability data into intelligent automation and decision-making across technology and business teams. - AWS & Cloud Operations: - Manage and operate systems hosted on AWS (EC2, EKS/ECS, RDS, S3, Lambda, CloudWatch, IAM, VPC) - Support cloud deployments and infrastructure changes following best practices - Assist with backup, disaster recovery, and resiliency planning - Incident Management: - Participating in production incident response, troubleshooting, and service restoration - Perform root cause analysis (RCA) and contribute to post‑incident reviews - Help implement preventive actions to avoid incident recurrence - Secondary Skills: Reliability & Operations: - Support high availability, scalability, and performance of production systems - Implement and maintain SLIs, SLOs, and SLAs for services - Identify and reduce operational toil through automation and process improvement - Support design and implementation of fault tolerant and resilient systems - Collaboration: - Work closely with application and Engineering teams to embed reliability into system design - Act as a strong team player, sharing knowledge and supporting team goals - Communicate effectively with technical and nontechnical stakeholders

Senior Observability Engineer

Job Description

Get AI-Matched to This Job