MLOps Engineer
InData Labs•Polska
💰 Wynagrodzenie
Widełki nieujawnione
📋 Informacje
📝 Opis główny / Wstęp
Experience Level: Senior MLOps Engineer (5+ years)
About the Role
We are looking for a Senior MLOps Engineer to design, build, and operate the machine learning infrastructure that powers our AI-driven products. In this role, you will own the end-to-end ML lifecycle — from model registration and versioning through production deployment, monitoring, and governance — ensuring our models are reliable, compliant, and performant at scale.
You will work at the intersection of machine learning, platform engineering, and cloud infrastructure, partnering closely with data scientists, software engineers, and compliance teams. Your work will directly impact how quickly and safely we ship models to production, establish evaluation frameworks for LLM-based systems, and build the developer tooling that accelerates the entire ML organization.
Key Responsibilities
ML Platform & Model Lifecycle
- Architect and maintain the model lifecycle management platform including model registration, versioning, deployment pipelines
- Integrate MLflow with Databricks Unity Catalog for end-to-end model governance, lineage tracking, and access control across environments
- Configure and manage AI Gateway and Model Serving endpoints to provide scalable, governed access to both internal and third-party models
- Design and implement LLM evaluation pipelines, including judge/grader architectures, for systematic assessment of model quality, safety, and regression detection
Infrastructure & Cloud Engineering
- Deploy, scale, and operate ML workloads on AWS EKS / Kubernetes clusters, including GPU scheduling, autoscaling policies, health monitoring, and incident troubleshooting
- Author and maintain Terraform modules for all infrastructure provisioning, ensuring reproducible environments and disciplined state management across dev, staging, and production
- Design and implement CI/CD pipelines (GitHub Actions preferred) for model training, testing, packaging, and deployment with automated quality gates
- Build production-grade Python services, internal SDKs, and shared libraries that enable data scientists and ML engineers to ship models with minimal friction
- Design multi-consumer APIs with strong contracts — semantic versioning, backward compatibility, clear deprecation policies — serving diverse teams and use cases
- Drive observability across the ML stack: logging, metrics, alerting, and dashboarding for model performance, drift detection, and infrastructure health
Required Qualifications
- 5+ years of professional experience in MLOps, ML Engineering, Platform Engineering, or a closely related infrastructure role
- MLflow: Hands-on experience with model registration, deployment workflows and Unity Catalog integration
- Python Development: Proven ability to build production-grade services, internal SDKs/libraries, and comprehensive testing frameworks
- AWS EKS / Kubernetes: Strong experience with container orchestration — deployment strategies, horizontal/vertical scaling, monitoring, and production troubleshooting
- Terraform: Solid infrastructure-as-code skills, including module authoring, remote state management, and multi-environment provisioning
- CI/CD: Demonstrated experience designing and implementing deployment pipelines; GitHub Actions strongly preferred
- Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field (or equivalent practical experience)
Strongly Preferred Qualifications
- LLM Evaluation Systems: Experience building or operating evaluation frameworks for large language models, including judge/grader architectures, automated scoring pipelines, and quality benchmarking
- Regulated Industry Experience: Familiarity with compliance-driven development environments (financial services, healthcare, government, or similar) and the ability to translate regulatory requirements into engineering controls
- API Design for Multi-Consumer Platforms: Track record of designing developer-facing APIs and SDKs with semantic versioning, backward compatibility guarantees, and clear deprecation strategies
- LLM Provider APIs: Working experience with Gemini and OpenAI model APIs, including prompt engineering, token management, rate limiting, and fallback strategies
Core Technology Stack
- ML Platform: MLflow, Databricks (AI Gateway, Model Serving, Unity Catalog)
- Languages: Python (primary), Bash, JavaScript/TypeScript
- Orchestration: AWS EKS, Kubernetes, Helm, Docker
- Infrastructure: Terraform, AWS (S3, IAM, ECR, CloudWatch, SageMaker)
- CI/CD: GitHub Actions, ArgoCD/Flux
- LLM APIs: OpenAI API, Google Gemini API
- Observability: Prometheus, Grafana, Datadog, or equivalent