ML Engineer, CloudlyMELT
Job Description
As ML Engineer for CloudlyMELT, you are building the intelligence that makes our observability platform genuinely predictive rather than merely reactive. You will develop the ML models that detect GPU failures before they happen, identify straggler GPUs in distributed training jobs, correlate anomalies across network, GPU, and application layers, and power the root cause analysis capabilities that set CloudlyMELT apart from every legacy monitoring tool in the market.Your models operate on high-volume, high-velocity telemetry data from Kubernetes-native AI infrastructure. Speed, precision, and the ability to explain findings in plain English to engineering teams are not nice-to-haves here. They are the product.
ABOUT CLOUDLYMELT
CloudlyMELT is an AI-native observability platform that correlates network, GPU, and application layers in a single unified view, reducing mean time to resolution from hours to seconds. It addresses GPU underutilization averaging 15 to 25% in Kubernetes clusters, idle H100s burning over $500,000 per year, straggler bottlenecks slowing distributed training by up to 50%, and GPU-related incidents where root cause is unknown in 60% of cases. CloudlyMELT delivers ML-powered predictive failure detection, LLM-driven root cause analysis in plain English, cross-layer cost attribution, multi-tenant fairness controls, and straggler detection, all built on OpenTelemetry, Prometheus, and DCGM.Job Requirement
- Build and maintain ML models for GPU failure prediction, delivering 48-hour advance warning of hardware failures with sufficient precision to prevent SLA penalties
- Develop straggler detection models that identify underperforming GPUs in distributed training jobs and surface actionable remediation recommendations
- Build cross-layer anomaly correlation models that connect network issues, GPU behavior, and application-level symptoms into unified root cause hypotheses
- Design and maintain cost attribution models that accurately map GPU usage to teams, jobs, and workloads in multi-tenant Kubernetes environments
- Develop multi-tenant fairness scoring systems based on DRF and Jain fairness metrics for GPU resource allocation analysis
- Build the LLM-powered root cause analysis layer that translates raw model outputs and KPI deltas into plain-English explanations and recommended operator actions
- Work with the CloudlyMELT platform team to integrate ML outputs into the observability interface including dashboards, error budget tracking, and service dependency maps
- Build and maintain training data pipelines from OpenTelemetry, Prometheus, and DCGM telemetry streams
- Monitor model performance in production and maintain retraining pipelines appropriate for evolving GPU infrastructure and customer environments
YOU MAY BE A GOOD FIT IF YOU HAVE
- 2 to 4 years of hands-on experience building and deploying ML models in production environments
- Strong proficiency in Python and ML frameworks such as PyTorch, scikit-learn, or TensorFlow
- Experience with anomaly detection, time series forecasting, or predictive failure modeling on infrastructure telemetry
- Familiarity with Kubernetes, GPU infrastructure, or distributed computing environments
- Experience building LLM-integrated pipelines or working with language models for structured output generation
- Comfort working with high-volume streaming telemetry data and the tooling that produces it such as OpenTelemetry, Prometheus, or DCGM
- Strong instinct for model explainability: your outputs need to be actionable by engineers under pressure, not just statistically correct
PREFERRED QUALIFICATIONS
- Experience with GPU performance profiling, CUDA metrics, or NVIDIA DCGM
- Familiarity with distributed training frameworks such as Horovod, Ray, or PyTorch Distributed and the failure modes they exhibit
- Experience with Kubernetes-native monitoring and the observability data it produces
- Knowledge of FinOps or cloud cost optimization approaches applied to GPU compute
- Experience with root cause analysis systems or causal inference methods
- Bachelor's or Master's degree in Computer Science, Machine Learning, Data Science, or a related field
COMPENSATION & BENEFITS
- Salary: Competitive base, negotiable based on experience
- Performance-based commission structure: your earnings scale directly with your results
- Two annual festive bonuses, each equivalent to half a month's salary
- Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
- Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
- Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one