MLOps Engineer

Job Description

As MLOps Engineer at CloudlyIO, you will build and own the infrastructure and tooling that takes machine learning from experimentation to reliable, observable, production-grade systems. You are the connective tissue between our ML Engineers who build models and the cloud infrastructure that runs them at scale. You will define how models are versioned, tracked, deployed, monitored, and retrained across our four AI solution areas: CloudlyNet, CloudlyCare, CloudlyPulse, and CloudlyMELT.This is a role for someone who understands both the ML development lifecycle and production engineering deeply enough to design systems that serve both well. You bring order, repeatability, and operational rigor to a discipline that too often relies on tribal knowledge and manual processes.

Job Requirement

ML Platform & Tooling
  • Design, build, and maintain the MLOps platform that supports model development, training, evaluation, and deployment across all four CloudlyIO solution areas
  • Implement and manage experiment tracking, model versioning, and model registry systems using tools such as MLflow, Weights and Biases, or similar
  • Build and maintain feature stores and data pipeline infrastructure supporting model training and inference workloads
  • Develop reusable MLOps frameworks, libraries, and templates that accelerate model development across product teams
Model Deployment & Serving
  • Build and manage model serving infrastructure for real-time and batch inference workloads
  • Design scalable, low-latency inference pipelines on AWS and other cloud platforms, including GPU-accelerated serving where required
  • Implement model packaging, containerization, and deployment automation integrated with CI/CD pipelines
  • Manage GPU compute allocation and optimize training and inference resource utilization
Monitoring & Reliability
  • Implement production monitoring for model performance, data drift, and system health using Prometheus, Grafana, CloudWatch, and related tooling
  • Build automated alerting and retraining triggers based on performance degradation and statistical drift detection
  • Define and track reliability SLOs for ML systems and drive continuous improvement against them
  • Lead root cause analysis for production ML incidents and implement systemic fixes
Standards & Collaboration
  • Work closely with ML Engineers to understand model requirements and translate them into robust, maintainable production systems
  • Collaborate with DevOps and Cloud teams to integrate ML workflows into broader infrastructure and delivery pipelines
  • Define and document MLOps standards, best practices, and operational runbooks across the organization
  • Evaluate and recommend new tools and approaches as the ML platform and business needs evolve
YOU MAY BE A GOOD FIT IF YOU HAVE

  • 3 to 5 years of experience in MLOps, platform engineering, or a combination of ML and DevOps roles in production environments
  • Hands-on experience building and maintaining ML pipelines, model registries, and experiment tracking systems
  • Strong Python proficiency and comfort working with ML frameworks such as PyTorch, TensorFlow, or scikit-learn at the infrastructure level
  • Experience with model serving frameworks such as TorchServe, Triton Inference Server, BentoML, or equivalent
  • Solid AWS experience including SageMaker, EKS, EC2, S3, and Lambda
  • Experience containerizing and deploying ML workloads using Docker and Kubernetes
  • Working knowledge of data pipeline orchestration tools such as Airflow or Prefect
  • Strong observability instincts: you build ML systems you can see into clearly and act on quickly

PREFERRED QUALIFICATIONS
  • Experience with GPU infrastructure management including scheduling, utilization monitoring, and cost optimization
  • Familiarity with distributed training frameworks such as Horovod, Ray, or PyTorch Distributed
  • Experience with feature store technologies such as Feast or Tecton
  • Familiarity with data version control tools such as DVC
  • AWS certification such as Solutions Architect or Machine Learning Specialty
  • Experience in regulated industries such as healthcare or telecommunications where model auditability and traceability are requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS
  • Salary: Competitive and negotiable based on experience
  • Two annual festive bonuses, each equivalent to half a month's salary
  • Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
  • Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
  • Health insurance
  • Direct collaboration with US clients and teams, working on real enterprise AI infrastructure from day one