ML Engineer, CloudlyMELT

Job Description

As ML Engineer for CloudlyMELT, you are building the intelligence that makes our observability platform genuinely predictive rather than merely reactive. You will develop the ML models that detect GPU failures before they happen, identify straggler GPUs in distributed training jobs, correlate anomalies across network, GPU, and application layers, and power the root cause analysis capabilities that set CloudlyMELT apart from every legacy monitoring tool in the market.Your models operate on high-volume, high-velocity telemetry data from Kubernetes-native AI infrastructure. Speed, precision, and the ability to explain findings in plain English to engineering teams are not nice-to-haves here. They are the product.

ABOUT CLOUDLYMELT

CloudlyMELT is an AI-native observability platform that correlates network, GPU, and application layers in a single unified view, reducing mean time to resolution from hours to seconds. It addresses GPU underutilization averaging 15 to 25% in Kubernetes clusters, idle H100s burning over $500,000 per year, straggler bottlenecks slowing distributed training by up to 50%, and GPU-related incidents where root cause is unknown in 60% of cases. CloudlyMELT delivers ML-powered predictive failure detection, LLM-driven root cause analysis in plain English, cross-layer cost attribution, multi-tenant fairness controls, and straggler detection, all built on OpenTelemetry, Prometheus, and DCGM.

Job Requirement

Build and maintain ML models for GPU failure prediction, delivering 48-hour advance warning of hardware failures with sufficient precision to prevent SLA penalties
Develop straggler detection models that identify underperforming GPUs in distributed training jobs and surface actionable remediation recommendations
Build cross-layer anomaly correlation models that connect network issues, GPU behavior, and application-level symptoms into unified root cause hypotheses
Design and maintain cost attribution models that accurately map GPU usage to teams, jobs, and workloads in multi-tenant Kubernetes environments
Develop multi-tenant fairness scoring systems based on DRF and Jain fairness metrics for GPU resource allocation analysis
Build the LLM-powered root cause analysis layer that translates raw model outputs and KPI deltas into plain-English explanations and recommended operator actions
Work with the CloudlyMELT platform team to integrate ML outputs into the observability interface including dashboards, error budget tracking, and service dependency maps
Build and maintain training data pipelines from OpenTelemetry, Prometheus, and DCGM telemetry streams
Monitor model performance in production and maintain retraining pipelines appropriate for evolving GPU infrastructure and customer environments

YOU MAY BE A GOOD FIT IF YOU HAVE

2 to 4 years of hands-on experience building and deploying ML models in production environments
Strong proficiency in Python and ML frameworks such as PyTorch, scikit-learn, or TensorFlow
Experience with anomaly detection, time series forecasting, or predictive failure modeling on infrastructure telemetry
Familiarity with Kubernetes, GPU infrastructure, or distributed computing environments
Experience building LLM-integrated pipelines or working with language models for structured output generation
Comfort working with high-volume streaming telemetry data and the tooling that produces it such as OpenTelemetry, Prometheus, or DCGM
Strong instinct for model explainability: your outputs need to be actionable by engineers under pressure, not just statistically correct

PREFERRED QUALIFICATIONS

Experience with GPU performance profiling, CUDA metrics, or NVIDIA DCGM
Familiarity with distributed training frameworks such as Horovod, Ray, or PyTorch Distributed and the failure modes they exhibit
Experience with Kubernetes-native monitoring and the observability data it produces
Knowledge of FinOps or cloud cost optimization approaches applied to GPU compute
Experience with root cause analysis systems or causal inference methods
Bachelor's or Master's degree in Computer Science, Machine Learning, Data Science, or a related field

COMPENSATION & BENEFITS

Salary: Competitive base, negotiable based on experience
Performance-based commission structure: your earnings scale directly with your results
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

AI Research & Engineering

Total Positions

Scan to Apply

Apply for this Job

Share this job opening