Senior MLOps Engineer
Unico Connect Private LimitedMumbai, Maharashtra
it-jobs
Job Description
Senior MLOps Engineer LLM Operations, Observability & Eval Infrastructure Mumbai (On-site) | Full-time | 5-7 years About the Role: Unico Connect is an AI-first technology partner that builds custom mobile, web, and AI products for clients across multiple geographies. We are hiring a Senior MLOps Engineer for a dedicated client engagement focused on building an AI-powered application builder platform. The platform consumes LLMs at scale through provider APIs. This role owns the operational discipline around production LLM consumption - increasingly called LLMOps - covering observability, evaluation infrastructure, model lifecycle, cost operations, prompt deployment, and agent run reliability. The mandatory requirement is hands-on production experience operating LLM-backed systems, with a strong DevOps or SRE foundation. This is not a model training or ML science role. The work is making the system around the AI engineer's designs observable, controlled, reliable, and economically accountable. You will pair daily with the Senior AI Engineer, who designs prompts, evals, and agent behaviour - you operationalise those systems for production. A typical week includes a tracing audit on a degraded agent run, an eval pipeline build for a new model release, a cost attribution review, and a staged prompt rollout. Responsibilities: Observability and Tracing Build and own end-to-end tracing for agent runs: every prompt, response, tool call, token count, latency, and cost, linked to user session and project. Stand up and operate LLM observability tooling (Langfuse, LangSmith, Braintrust, or Arize Phoenix). Make debugging a single bad agent run among thousands a routine workflow through searchable traces, failure taxonomies, and dashboards segmented by task type. Evaluation Infrastructure as a Production System Operationalise the eval suite designed by the Senior AI Engineer: automated execution in CI on every prompt or model change, with results stored and trended over time. Implement regression gates that block quality-degrading changes from shipping. Build production sampling to continuously score a sample of real agent runs and catch quality drift that offline evals miss. Model Lifecycle Management Pin model versions, never "latest". Own the upgrade process: run the eval suite against new model releases and manage eval-gated migrations. Maintain fallback chains across providers for graceful degradation or queueing during outages. Track provider deprecation schedules and plan migrations ahead of forced cutoffs. Cost Operations Implement per-user and per-task cost attribution - token spend is the platform's largest variable cost and requires the same rigour as cloud cost management. Set up budget alerts and anomaly detection so a single user or bug cannot burn significant spend overnight. Monitor prompt cache hit rates and quantify savings. Manage capacity planning around provider rate limits, including quota negotiation and throughput tiering. Prompt and Configuration Deployment Treat prompts as production artifacts: version control for prompts and agent configurations, staged rollout infrastructure (deploy a prompt change to a percentage of traffic before full rollout), A/B testing infrastructure, instant rollback, and audit history covering which prompt version served which user and when. Reliability Engineering for Agent Runs Agent runs are long, stateful, and failure-prone. Own retry and resume semantics so a run that fails mid-way does not restart from scratch. Implement timeouts and circuit breakers on provider calls, dead-letter handling for failed runs, and queue and concurrency management for agent workloads. SLO Ownership and Incident Response Define and track SLOs for agent run latency and completion rates. Lead incident response when SLOs are breached. Write postmortems. Surface reliability risks proactively before they reach users. Safety and Compliance Operations Run the moderation pipeline (prompt and output classification) in production. Monitor for abuse patterns and own incident response when the agent misbehaves at scale. Maintain audit logs and implement data retention and residency policies for prompts and generated code as enterprise requirements emerge. AI-Assisted Engineering Discipline Use Claude, Cursor, and similar tools day to day for infrastructure code, scripts, and pipelines. Set the team standard for safe use, review, and validation of AI-generated infrastructure before it ships. Requirements: Hands-on production ownership of LLM-backed systems in operation (mandatory). Must have personally shipped and operated at least one LLM-powered system in production, with operational responsibility including oncall, incident response, and reliability ownership. Alternatively: strong DevOps or SRE background with demonstrated hands-on familiarity with LLMOps tooling (Langfuse, LangSmith, Braintrust, Arize, or equivalent). POCs and lab work do not qualify. 5+ years of overall engineering experience With at least 2 years in DevOps, SRE, platform engineering, or LLM operations roles. This is not an ML science role. A DevOps or SRE background with a substantive pivot into LLMOps is a strong qualification. Observability and Tracing Depth Production experience with LLM observability tooling - Langfuse, LangSmith, Braintrust, or Arize Phoenix. Comfortable instrumenting with OpenTelemetry, Prometheus, and Grafana. Able to build and search trace pipelines, define failure taxonomies, and surface quality signals from production traffic. CI/CD and Quality Gate Experience Strong with GitHub Actions or GitLab CI. Experience building automated quality gates: eval-gated pipelines, regression enforcement, or coverage gates that block degrading changes from shipping. Cost Management and Attribution for Usage-Based Services Experience owning cost attribution for cloud API spend or equivalent. Comfortable with budget alerts, anomaly detection, and per-user or per-task cost breakdowns. Reliability Engineering for Long-Running, Stateful Workloads Experience with queues, retry patterns, idempotency, and failure recovery on asynchronous or multi-step workloads. Comfortable defining SLOs and being accountable for them on production systems. Multi-Provider API Management Familiarity with LLM provider rate limits, version pinning, fallback chains, and quota management across OpenAI, Anthropic, Google, or equivalent. Infrastructure as Code and Deployment Automation Hands-on with Terraform or Pulumi and Docker. AWS working knowledge (EC2, S3, IAM, EKS or ECS). Strong with CI/CD for deploying services and configuration changes safely. Nice to Have - Experience with prompt A/B testing or staged rollout infrastructure - Workflow orchestration (BullMQ, Temporal, Celery) - Content moderation pipeline experience - Data residency and compliance requirements for AI systems - Kubernetes (EKS) in production - AWS certifications Skills:- MLOps, LangSmith, OpenTelemetry, LLMOps, Amazon Web Services (AWS), DevOps, Docker, Terraform, CI/CD, Langfuse and SRE
Get AI-Matched to This Job
Upload your resume and our AI will score how well you match this and thousands of similar roles.