CLOUDLYIO

MLOps Engineer

Job Description

As MLOps Engineer at CloudlyIO, you will build and own the infrastructure and tooling that takes machine learning from experimentation to reliable, observable, production-grade systems. You are the connective tissue between our ML Engineers who build models and the cloud infrastructure that runs them at scale. You will define how models are versioned, tracked, deployed, monitored, and retrained across our four AI solution areas: CloudlyNet, CloudlyCare, CloudlyPulse, and CloudlyMELT.This is a role for someone who understands both the ML development lifecycle and production engineering deeply enough to design systems that serve both well. You bring order, repeatability, and operational rigor to a discipline that too often relies on tribal knowledge and manual processes.

Job Requirement

ML Platform & Tooling

Design, build, and maintain the MLOps platform that supports model development, training, evaluation, and deployment across all four CloudlyIO solution areas
Implement and manage experiment tracking, model versioning, and model registry systems using tools such as MLflow, Weights and Biases, or similar
Build and maintain feature stores and data pipeline infrastructure supporting model training and inference workloads
Develop reusable MLOps frameworks, libraries, and templates that accelerate model development across product teams

Model Deployment & Serving

Build and manage model serving infrastructure for real-time and batch inference workloads
Design scalable, low-latency inference pipelines on AWS and other cloud platforms, including GPU-accelerated serving where required
Implement model packaging, containerization, and deployment automation integrated with CI/CD pipelines
Manage GPU compute allocation and optimize training and inference resource utilization

Monitoring & Reliability

Implement production monitoring for model performance, data drift, and system health using Prometheus, Grafana, CloudWatch, and related tooling
Build automated alerting and retraining triggers based on performance degradation and statistical drift detection
Define and track reliability SLOs for ML systems and drive continuous improvement against them
Lead root cause analysis for production ML incidents and implement systemic fixes

Standards & Collaboration

Work closely with ML Engineers to understand model requirements and translate them into robust, maintainable production systems
Collaborate with DevOps and Cloud teams to integrate ML workflows into broader infrastructure and delivery pipelines
Define and document MLOps standards, best practices, and operational runbooks across the organization
Evaluate and recommend new tools and approaches as the ML platform and business needs evolve

YOU MAY BE A GOOD FIT IF YOU HAVE

3 to 5 years of experience in MLOps, platform engineering, or a combination of ML and DevOps roles in production environments
Hands-on experience building and maintaining ML pipelines, model registries, and experiment tracking systems
Strong Python proficiency and comfort working with ML frameworks such as PyTorch, TensorFlow, or scikit-learn at the infrastructure level
Experience with model serving frameworks such as TorchServe, Triton Inference Server, BentoML, or equivalent
Solid AWS experience including SageMaker, EKS, EC2, S3, and Lambda
Experience containerizing and deploying ML workloads using Docker and Kubernetes
Working knowledge of data pipeline orchestration tools such as Airflow or Prefect
Strong observability instincts: you build ML systems you can see into clearly and act on quickly

PREFERRED QUALIFICATIONS

Experience with GPU infrastructure management including scheduling, utilization monitoring, and cost optimization
Familiarity with distributed training frameworks such as Horovod, Ray, or PyTorch Distributed
Experience with feature store technologies such as Feast or Tecton
Familiarity with data version control tools such as DVC
AWS certification such as Solutions Architect or Machine Learning Specialty
Experience in regulated industries such as healthcare or telecommunications where model auditability and traceability are requirements
Bachelor's or Master's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS

Salary: Competitive and negotiable based on experience
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Health insurance
Direct collaboration with US clients and teams, working on real enterprise AI infrastructure from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

Infrastructure & Platform

Total Positions

Scan to Apply

Apply for this Job

Share this job opening