Solution Engineer, CloudlyMELT

Job Description

As Solution Engineer for CloudlyMELT, you will work with AI engineering teams, MLOps leads, and infrastructure architects at organizations running GPU-intensive workloads to deploy and configure CloudlyMELT across their Kubernetes environments. Your customers are deeply technical: they have strong opinions about observability tooling, they know what Prometheus exporters look like, and they will immediately respect or dismiss you based on whether you actually understand their infrastructure.Your job is to get CloudlyMELT integrated, configured, and delivering meaningful insight fast. Every day a customer spends without proper GPU observability is money being wasted at $30 to $50 per H100 hour.

ABOUT CLOUDLYMELT

CloudlyMELT is an AI-native observability platform that correlates network, GPU, and application layers in a single unified view, reducing MTTR from hours to seconds. It solves GPU underutilization, straggler bottlenecks in distributed training, silent thermal throttling, and the blind spot problem where infrastructure teams cannot correlate network issues to GPU failures. Built on OpenTelemetry, Prometheus, and DCGM with no vendor lock-in, CloudlyMELT delivers ML-powered predictive failure detection, LLM-driven root cause analysis, cost attribution, and multi-tenant fairness controls for organizations running serious GPU workloads on Kubernetes.

Job Requirement

Lead technical discovery with AI engineering, MLOps, and infrastructure teams to understand their Kubernetes environment, GPU cluster configuration, existing observability stack, and operational pain points
Design CloudlyMELT integration architectures that connect to customer telemetry infrastructure via OpenTelemetry, Prometheus, and DCGM without requiring lock-in to proprietary data formats or collectors
Own deployment engagements end to end: configure cross-layer correlation between network, GPU, and application telemetry, set up cost attribution models, define multi-tenant fairness parameters, and validate that the system is surfacing meaningful, actionable insight
Work alongside the sales team in pre-sales technical conversations and product demonstrations, showing CloudlyMELT in the context of the customer's actual infrastructure challenges
Configure straggler detection thresholds, failure prediction sensitivity, and LLM root cause analysis parameters appropriate for each customer's workload types
Help customers build custom dashboards, error budget tracking configurations, and service dependency maps for their specific GPU infrastructure
Build and maintain integration guides, deployment documentation, and configuration playbooks for CloudlyMELT customer engagements
Relay customer integration feedback, observability requirements, and feature requests to the CloudlyMELT product and engineering teams

YOU MAY BE A GOOD FIT IF YOU HAVE

2 to 4 years of experience in solution engineering, DevOps, MLOps, platform engineering, or observability systems integration
Strong working knowledge of Kubernetes including cluster architecture, pod scheduling, resource management, and common failure patterns
Hands-on familiarity with observability tooling including Prometheus, Grafana, OpenTelemetry, and DCGM
Understanding of GPU compute infrastructure including NVIDIA H100/A100 hardware, utilization metrics, thermal behavior, and distributed training workloads
Experience integrating monitoring or observability platforms into complex enterprise infrastructure environments
Proficiency in Python, YAML, and REST APIs for configuration, integration scripting, and validation
Strong technical communication skills: you can explain cross-layer correlation findings to an ML engineer and GPU cost attribution to a FinOps lead with equal clarity

PREFERRED QUALIFICATIONS

Experience with distributed training frameworks such as PyTorch Distributed, Horovod, or Ray
Familiarity with FinOps practices and GPU cost optimization strategies
Knowledge of NVIDIA DCGM metrics and GPU profiling tools
Experience with Helm chart deployment and Kubernetes operator patterns
Familiarity with competing observability platforms such as Datadog, New Relic, or Dynatrace and their limitations for GPU-specific use cases
Bachelor's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS

Salary: Competitive base, negotiable based on experience
Performance-based commission structure: your earnings scale directly with your results
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

Solutions & Customer Engineer

Total Positions

Scan to Apply

Apply for this Job

Share this job opening