Experience Level: Senior MLOps Engineer (5+ years)

About the Role

We are looking for a Senior MLOps Engineer to design, build, and operate the machine learning infrastructure that powers our AI-driven products. In this role, you will own the end-to-end ML lifecycle — from model registration and versioning through production deployment, monitoring, and governance — ensuring our models are reliable, compliant, and performant at scale.

You will work at the intersection of machine learning, platform engineering, and cloud infrastructure, partnering closely with data scientists, software engineers, and compliance teams. Your work will directly impact how quickly and safely we ship models to production, establish evaluation frameworks for LLM-based systems, and build the developer tooling that accelerates the entire ML organization.

Key Responsibilities

ML Platform & Model Lifecycle

Architect and maintain the model lifecycle management platform including model registration, versioning, deployment pipelines
Integrate MLflow with Databricks Unity Catalog for end-to-end model governance, lineage tracking, and access control across environments
Configure and manage AI Gateway and Model Serving endpoints to provide scalable, governed access to both internal and third-party models
Design and implement LLM evaluation pipelines, including judge/grader architectures, for systematic assessment of model quality, safety, and regression detection

Infrastructure & Cloud Engineering

Deploy, scale, and operate ML workloads on AWS EKS / Kubernetes clusters, including GPU scheduling, autoscaling policies, health monitoring, and incident troubleshooting
Author and maintain Terraform modules for all infrastructure provisioning, ensuring reproducible environments and disciplined state management across dev, staging, and production
Design and implement CI/CD pipelines (GitHub Actions preferred) for model training, testing, packaging, and deployment with automated quality gates
Build production-grade Python services, internal SDKs, and shared libraries that enable data scientists and ML engineers to ship models with minimal friction
Design multi-consumer APIs with strong contracts — semantic versioning, backward compatibility, clear deprecation policies — serving diverse teams and use cases
Drive observability across the ML stack: logging, metrics, alerting, and dashboarding for model performance, drift detection, and infrastructure health

Required Qualifications

5+ years of professional experience in MLOps, ML Engineering, Platform Engineering, or a closely related infrastructure role
MLflow: Hands-on experience with model registration, deployment workflows and Unity Catalog integration
Python Development: Proven ability to build production-grade services, internal SDKs/libraries, and comprehensive testing frameworks
AWS EKS / Kubernetes: Strong experience with container orchestration — deployment strategies, horizontal/vertical scaling, monitoring, and production troubleshooting
Terraform: Solid infrastructure-as-code skills, including module authoring, remote state management, and multi-environment provisioning
CI/CD: Demonstrated experience designing and implementing deployment pipelines; GitHub Actions strongly preferred
Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field (or equivalent practical experience)

Strongly Preferred Qualifications

LLM Evaluation Systems: Experience building or operating evaluation frameworks for large language models, including judge/grader architectures, automated scoring pipelines, and quality benchmarking
Regulated Industry Experience: Familiarity with compliance-driven development environments (financial services, healthcare, government, or similar) and the ability to translate regulatory requirements into engineering controls
API Design for Multi-Consumer Platforms: Track record of designing developer-facing APIs and SDKs with semantic versioning, backward compatibility guarantees, and clear deprecation strategies
LLM Provider APIs: Working experience with Gemini and OpenAI model APIs, including prompt engineering, token management, rate limiting, and fallback strategies

Core Technology Stack

ML Platform: MLflow, Databricks (AI Gateway, Model Serving, Unity Catalog)
Languages: Python (primary), Bash, JavaScript/TypeScript
Orchestration: AWS EKS, Kubernetes, Helm, Docker
Infrastructure: Terraform, AWS (S3, IAM, ECR, CloudWatch, SageMaker)
CI/CD: GitHub Actions, ArgoCD/Flux
LLM APIs: OpenAI API, Google Gemini API
Observability: Prometheus, Grafana, Datadog, or equivalent

MLOps Engineer

💰 Wynagrodzenie

📋 Informacje

📝 Opis główny / Wstęp

📡 Metadata statystyk

🔗 Podobne oferty