Solution Engineer, CloudlyMELT

Job Description

As Solution Engineer for CloudlyMELT, you will work with AI engineering teams, MLOps leads, and infrastructure architects at organizations running GPU-intensive workloads to deploy and configure CloudlyMELT across their Kubernetes environments. Your customers are deeply technical: they have strong opinions about observability tooling, they know what Prometheus exporters look like, and they will immediately respect or dismiss you based on whether you actually understand their infrastructure.Your job is to get CloudlyMELT integrated, configured, and delivering meaningful insight fast. Every day a customer spends without proper GPU observability is money being wasted at $30 to $50 per H100 hour.

ABOUT CLOUDLYMELT
CloudlyMELT is an AI-native observability platform that correlates network, GPU, and application layers in a single unified view, reducing MTTR from hours to seconds. It solves GPU underutilization, straggler bottlenecks in distributed training, silent thermal throttling, and the blind spot problem where infrastructure teams cannot correlate network issues to GPU failures. Built on OpenTelemetry, Prometheus, and DCGM with no vendor lock-in, CloudlyMELT delivers ML-powered predictive failure detection, LLM-driven root cause analysis, cost attribution, and multi-tenant fairness controls for organizations running serious GPU workloads on Kubernetes.

Job Requirement

  • Lead technical discovery with AI engineering, MLOps, and infrastructure teams to understand their Kubernetes environment, GPU cluster configuration, existing observability stack, and operational pain points
  • Design CloudlyMELT integration architectures that connect to customer telemetry infrastructure via OpenTelemetry, Prometheus, and DCGM without requiring lock-in to proprietary data formats or collectors
  • Own deployment engagements end to end: configure cross-layer correlation between network, GPU, and application telemetry, set up cost attribution models, define multi-tenant fairness parameters, and validate that the system is surfacing meaningful, actionable insight
  • Work alongside the sales team in pre-sales technical conversations and product demonstrations, showing CloudlyMELT in the context of the customer's actual infrastructure challenges
  • Configure straggler detection thresholds, failure prediction sensitivity, and LLM root cause analysis parameters appropriate for each customer's workload types
  • Help customers build custom dashboards, error budget tracking configurations, and service dependency maps for their specific GPU infrastructure
  • Build and maintain integration guides, deployment documentation, and configuration playbooks for CloudlyMELT customer engagements
  • Relay customer integration feedback, observability requirements, and feature requests to the CloudlyMELT product and engineering teams
YOU MAY BE A GOOD FIT IF YOU HAVE

  • 2 to 4 years of experience in solution engineering, DevOps, MLOps, platform engineering, or observability systems integration
  • Strong working knowledge of Kubernetes including cluster architecture, pod scheduling, resource management, and common failure patterns
  • Hands-on familiarity with observability tooling including Prometheus, Grafana, OpenTelemetry, and DCGM
  • Understanding of GPU compute infrastructure including NVIDIA H100/A100 hardware, utilization metrics, thermal behavior, and distributed training workloads
  • Experience integrating monitoring or observability platforms into complex enterprise infrastructure environments
  • Proficiency in Python, YAML, and REST APIs for configuration, integration scripting, and validation
  • Strong technical communication skills: you can explain cross-layer correlation findings to an ML engineer and GPU cost attribution to a FinOps lead with equal clarity

PREFERRED QUALIFICATIONS
  • Experience with distributed training frameworks such as PyTorch Distributed, Horovod, or Ray
  • Familiarity with FinOps practices and GPU cost optimization strategies
  • Knowledge of NVIDIA DCGM metrics and GPU profiling tools
  • Experience with Helm chart deployment and Kubernetes operator patterns
  • Familiarity with competing observability platforms such as Datadog, New Relic, or Dynatrace and their limitations for GPU-specific use cases
  • Bachelor's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS
  • Salary: Competitive base, negotiable based on experience
  • Performance-based commission structure: your earnings scale directly with your results
  • Two annual festive bonuses, each equivalent to half a month's salary
  • Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
  • Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
  • Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one