Solution Engineer, TorchBridge
Job Description
As Solution Engineer for TorchBridge, you are the person who gets AI engineering teams from evaluation to production. Your customers are ML engineers, MLOps leads, and infrastructure architects at organizations running serious PyTorch workloads on Kubernetes and multi-cloud infrastructure. They have strong opinions about their tooling, high standards for what counts as a credible technical evaluation, and zero patience for vague capability claims.
You need to be able to run pip install torchbridge-ml, scaffold a project with tb-init, execute tb-doctor and tb-validate against a customer's hardware environment, benchmark across backends with real latency and memory figures, and then explain every result clearly to both the engineer who ran the training job and the infrastructure lead who is trying to justify the hardware procurement decision. That combination of hands-on technical depth and customer-facing clarity is what this role requires.
You need to be able to run pip install torchbridge-ml, scaffold a project with tb-init, execute tb-doctor and tb-validate against a customer's hardware environment, benchmark across backends with real latency and memory figures, and then explain every result clearly to both the engineer who ran the training job and the infrastructure lead who is trying to justify the hardware procurement decision. That combination of hands-on technical depth and customer-facing clarity is what this role requires.
ABOUT TORCHBRIDGE
TorchBridge is a hardware abstraction layer (HAL) for PyTorch that lets AI teams write training and inference code once and run it unchanged on NVIDIA, AMD, AWS Trainium, Google TPU, and CPU, without modifying a single line. It auto-detects available accelerators, routes to the optimal backend and memory manager, and ships production-ready tooling including CLI commands, Docker containers, CI/CD workflows, Prometheus monitoring, and model serving infrastructure out of the box.The market problem TorchBridge solves is real and measurable: GPU vendor lock-in forces engineering teams to maintain separate Dockerfiles, CI pipelines, deployment configurations, and monitoring stacks for each hardware target. Hardware changes require months of rewriting. NVIDIA scarcity and pricing pressure drive enterprises toward alternatives, but switching costs are enormous. TorchBridge eliminates those costs entirely.Currently at v0.5.22 with 6-platform cloud validation, 1,464 tests, and a 9/9 use case pass record. The only solution on the market covering NVIDIA, AMD, TPU, and Trainium with both training and inference in a single unified PyTorch-native interface.Job Requirement
- Lead technical discovery with AI engineering teams, MLOps leads, and infrastructure architects to understand their current hardware environment, existing PyTorch codebase, vendor lock-in pain points, and multi-cloud or multi-accelerator ambitions
- Design and execute TorchBridge proof-of-concept engagements: configure the customer's backend environment, run tb-doctor system diagnostics, execute tb-validate at standard or full validation levels, and produce benchmark results using tb-benchmark with CSV or JSON output that quantifies latency, throughput, and memory across backends
- Demonstrate TorchBridge's full capability stack during pre-sales technical evaluations: backend auto-detection and priority routing, hardware-agnostic model training with AMP, cross-backend output consistency validation, CLI tooling, Docker container deployment, and Prometheus/Grafana monitoring
- Work alongside the sales team in technical discovery conversations, handling detailed questions about architecture, backend support matrix, precision management, memory optimization, distributed training, and competitive differentiation against Modular, vLLM, Lightning, DeepSpeed, and others
- Guide customers through integration of TorchBridge into existing PyTorch codebases using tb-migrate for automated CUDA-to-HAL migration where relevant, and through CI/CD integration using tb-validate with GitHub Actions
- Configure model export pipelines to TorchScript, ONNX (opset 17+), and SafeTensors formats and support customers through deployment to FastAPI, TorchServe, and Triton Inference Server using TorchBridge's Docker serving infrastructure
- Build and maintain integration guides, deployment documentation, and POC playbooks for TorchBridge customer engagements covering each supported backend and hardware configuration
- Relay customer integration feedback, hardware compatibility observations, and feature requests to the TorchBridge engineering and product teams
YOU MAY BE A GOOD FIT IF YOU HAVE
- 2 to 4 years of experience in solution engineering, MLOps, AI platform engineering, or technical consulting for AI infrastructure teams
- Strong hands-on proficiency with PyTorch including training pipelines, model evaluation, and deployment workflows
- Working knowledge of at least two hardware backends among NVIDIA CUDA, AMD ROCm, AWS Trainium/NeuronX, and Google TPU/XLA
- Experience with Kubernetes, containerized ML workloads, and CI/CD pipeline configuration using GitHub Actions or equivalent
- Familiarity with observability tooling including Prometheus, Grafana, and structured logging in production ML systems
- Comfort running technical benchmarks and presenting results with the context and caveats required to make them meaningful for both engineering and infrastructure decision-makers
- Strong written and verbal communication skills with the ability to explain hardware-level behavior to ML engineers and infrastructure cost implications to FinOps or platform leadership
PREFERRED QUALIFICATIONS
- Experience with model export formats including ONNX, TorchScript, and SafeTensors and cross-platform deployment validation
- Familiarity with distributed training frameworks including FSDP, PyTorch Distributed, or Horovod
- Experience with LLM inference serving using FastAPI, TorchServe, or Triton Inference Server
- Knowledge of the competitive landscape for PyTorch HAL and inference tooling including Modular/MAX, vLLM, Lightning AI, DeepSpeed, and HuggingFace Optimum
- Experience with FinOps practices and GPU cost optimization for organizations evaluating hardware diversification from NVIDIA
- Bachelor's degree in Computer Science, Engineering, or a related field
COMPENSATION & BENEFITS
- Salary: Competitive base, negotiable based on experience
- Performance-based commission structure: your earnings scale directly with your results
- Two annual festive bonuses, each equivalent to half a month's salary
- Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
- Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
- Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one