Solution Engineer, TorchBridge

Job Description

As Solution Engineer for TorchBridge, you are the person who gets AI engineering teams from evaluation to production. Your customers are ML engineers, MLOps leads, and infrastructure architects at organizations running serious PyTorch workloads on Kubernetes and multi-cloud infrastructure. They have strong opinions about their tooling, high standards for what counts as a credible technical evaluation, and zero patience for vague capability claims.
You need to be able to run pip install torchbridge-ml, scaffold a project with tb-init, execute tb-doctor and tb-validate against a customer's hardware environment, benchmark across backends with real latency and memory figures, and then explain every result clearly to both the engineer who ran the training job and the infrastructure lead who is trying to justify the hardware procurement decision. That combination of hands-on technical depth and customer-facing clarity is what this role requires.

ABOUT TORCHBRIDGE

TorchBridge is a hardware abstraction layer (HAL) for PyTorch that lets AI teams write training and inference code once and run it unchanged on NVIDIA, AMD, AWS Trainium, Google TPU, and CPU, without modifying a single line. It auto-detects available accelerators, routes to the optimal backend and memory manager, and ships production-ready tooling including CLI commands, Docker containers, CI/CD workflows, Prometheus monitoring, and model serving infrastructure out of the box.The market problem TorchBridge solves is real and measurable: GPU vendor lock-in forces engineering teams to maintain separate Dockerfiles, CI pipelines, deployment configurations, and monitoring stacks for each hardware target. Hardware changes require months of rewriting. NVIDIA scarcity and pricing pressure drive enterprises toward alternatives, but switching costs are enormous. TorchBridge eliminates those costs entirely.Currently at v0.5.22 with 6-platform cloud validation, 1,464 tests, and a 9/9 use case pass record. The only solution on the market covering NVIDIA, AMD, TPU, and Trainium with both training and inference in a single unified PyTorch-native interface.

Job Requirement

Lead technical discovery with AI engineering teams, MLOps leads, and infrastructure architects to understand their current hardware environment, existing PyTorch codebase, vendor lock-in pain points, and multi-cloud or multi-accelerator ambitions
Design and execute TorchBridge proof-of-concept engagements: configure the customer's backend environment, run tb-doctor system diagnostics, execute tb-validate at standard or full validation levels, and produce benchmark results using tb-benchmark with CSV or JSON output that quantifies latency, throughput, and memory across backends
Demonstrate TorchBridge's full capability stack during pre-sales technical evaluations: backend auto-detection and priority routing, hardware-agnostic model training with AMP, cross-backend output consistency validation, CLI tooling, Docker container deployment, and Prometheus/Grafana monitoring
Work alongside the sales team in technical discovery conversations, handling detailed questions about architecture, backend support matrix, precision management, memory optimization, distributed training, and competitive differentiation against Modular, vLLM, Lightning, DeepSpeed, and others
Guide customers through integration of TorchBridge into existing PyTorch codebases using tb-migrate for automated CUDA-to-HAL migration where relevant, and through CI/CD integration using tb-validate with GitHub Actions
Configure model export pipelines to TorchScript, ONNX (opset 17+), and SafeTensors formats and support customers through deployment to FastAPI, TorchServe, and Triton Inference Server using TorchBridge's Docker serving infrastructure
Build and maintain integration guides, deployment documentation, and POC playbooks for TorchBridge customer engagements covering each supported backend and hardware configuration
Relay customer integration feedback, hardware compatibility observations, and feature requests to the TorchBridge engineering and product teams

YOU MAY BE A GOOD FIT IF YOU HAVE

2 to 4 years of experience in solution engineering, MLOps, AI platform engineering, or technical consulting for AI infrastructure teams
Strong hands-on proficiency with PyTorch including training pipelines, model evaluation, and deployment workflows
Working knowledge of at least two hardware backends among NVIDIA CUDA, AMD ROCm, AWS Trainium/NeuronX, and Google TPU/XLA
Experience with Kubernetes, containerized ML workloads, and CI/CD pipeline configuration using GitHub Actions or equivalent
Familiarity with observability tooling including Prometheus, Grafana, and structured logging in production ML systems
Comfort running technical benchmarks and presenting results with the context and caveats required to make them meaningful for both engineering and infrastructure decision-makers
Strong written and verbal communication skills with the ability to explain hardware-level behavior to ML engineers and infrastructure cost implications to FinOps or platform leadership

PREFERRED QUALIFICATIONS

Experience with model export formats including ONNX, TorchScript, and SafeTensors and cross-platform deployment validation
Familiarity with distributed training frameworks including FSDP, PyTorch Distributed, or Horovod
Experience with LLM inference serving using FastAPI, TorchServe, or Triton Inference Server
Knowledge of the competitive landscape for PyTorch HAL and inference tooling including Modular/MAX, vLLM, Lightning AI, DeepSpeed, and HuggingFace Optimum
Experience with FinOps practices and GPU cost optimization for organizations evaluating hardware diversification from NVIDIA
Bachelor's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS

Salary: Competitive base, negotiable based on experience
Performance-based commission structure: your earnings scale directly with your results
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

Solutions & Customer Engineer

Total Positions

Scan to Apply

Apply for this Job

Share this job opening