ML Engineer, TorchBridge

Job Description

As ML Engineer for TorchBridge, you will work at the intersection of ML systems engineering and hardware-level optimization. Your job is to ensure that PyTorch workloads run correctly, efficiently, and predictably across every backend TorchBridge supports, and to build the intelligence that makes cross-backend execution genuinely automatic rather than merely theoretically possible.This is not a role for someone who works primarily at the model API level. You need to be comfortable thinking about what happens below model.train(): memory layout, kernel dispatch, precision management, attention mechanisms, distributed training communication, and the specific failure modes each hardware backend introduces. The ML you work on here is deeply systems-adjacent, and the quality bar is defined by 1,464 test functions and six hardware platforms that all have to pass.

ABOUT TORCHBRIDGE

TorchBridge is CloudlyIO's hardware abstraction layer for PyTorch. It solves one of the most expensive and underappreciated problems in enterprise AI: hardware vendor lock-in. Code written for NVIDIA CUDA does not run on AMD ROCm. Switching hardware targets requires months of rewriting. Each vendor ships its own SDK, compiler, memory manager, and optimization pipeline with no unified interface in sight.TorchBridge changes that. Write your PyTorch training and inference code once. Run it unchanged on NVIDIA (B200, H100, A10G, T4), AMD (MI350X, MI325X, MI300X), AWS Trainium (Trn1, Trn2, Trn3), Google TPU (v4 through v7 Ironwood), and CPU fallback on x86, ARM, and Apple Silicon. TorchBridge auto-detects available accelerators, selects the optimal backend, memory manager, and execution strategy, and ships production-ready tooling for CLI, Docker, CI/CD, monitoring, and serving out of the box.Currently at v0.5.22 with 73,521 lines of production code, 1,464 test functions, zero lint violations, zero type errors, and six-platform cloud validation with 9/9 use case pass results. It is the only solution on the market providing a complete HAL across NVIDIA, AMD, TPU, and Trainium with full training and inference support.

Job Requirement

Build and maintain backend-specific optimization pathways within TorchBridge's Hardware Abstraction Layer, covering NVIDIA CUDA with Flash Attention 3 and FP8/NVFP4, AMD ROCm/HIP with Composable Kernels, Trainium with NeuronX SDK and NKI kernels, TPU with XLA/PJRT and Pallas kernels, and CPU fallback
Develop and improve TorchBridge's memory optimization systems including gradient checkpointing (targeting 30 to 40% savings), activation offloading (20 to 50% GPU memory savings), optimizer state sharding via FSDP (50 to 80% savings), and KV-cache optimization for LLM inference
Implement and maintain cross-backend precision management: FP8 native training on H100+, mixed precision across FP32/FP16/BF16/FP8, automatic precision selection per backend, and gradient and loss scaling
Build and maintain support for pre-built model architectures including LLMs (Qwen3, DeepSeek R1/V3, Llama 4, Gemma 3), vision models (SAM 3), and distributed MoE architectures (Qwen3-30B-A3B, Llama 4 Scout, DeepSeek V3) with FSDP2 data parallelism and expert routing
Develop and improve the attention mechanism layer including Flash Attention 2 integration, sliding window and ring attention, dynamic sparse attention, and multi-head and grouped query attention
Execute and maintain real hardware benchmarks across the six validated cloud platforms: AWS A10G, GCP T4, AMD MI300X, H100 NVL, TPU v5e, and Apple Silicon MPS, ensuring cross-backend output consistency and performance regression detection
Build toward roadmap milestones including backend-aware quantization via torchao integration (Q1 2026), FlexAttention kernel routing and PagedAttention with KV-cache (Q2 2026), FSDP2 and DTensor integration (Q2 2026), and adapter training with LoRA/QLoRA
Maintain and extend TorchBridge's test infrastructure including GPU-aware pytest markers, the tb-validate cross-backend validation framework, and the tb-doctor system diagnostics tooling
Contribute to the hw-aware model optimization pipeline: five progressive optimization levels from basic baseline through JIT, torch.compile, custom Triton kernel compilation, and full production graph fusion

YOU MAY BE A GOOD FIT IF YOU HAVE

2 to 4 years of hands-on experience building production ML systems, with meaningful work below the model API level
Strong proficiency in Python and PyTorch at a depth that includes custom training loops, mixed precision, gradient management, and model serialization
Genuine understanding of GPU and accelerator memory management: you know what gradient checkpointing, activation offloading, and optimizer state sharding actually do and why each tradeoff exists
Experience with at least one non-NVIDIA hardware backend: AMD ROCm, AWS Trainium/NeuronX, or Google TPU/XLA
Familiarity with distributed training frameworks including FSDP, FSDP2, and PyTorch Distributed
Comfort writing and debugging Triton kernels, custom CUDA extensions, or equivalent backend-specific optimization code
Strong instinct for correctness validation: you know how to verify that a model produces equivalent outputs across different hardware backends and precision configurations
Fluency with production engineering practices: type annotations, linting standards, test coverage, CI/CD integration, and documentation

PREFERRED QUALIFICATIONS

Experience with Flash Attention implementations, KV-cache optimization, or LLM inference systems
Familiarity with MoE model architectures and their specific hardware challenges including expert routing and load balancing
Experience with torch.compile, torch.export, TorchScript, ONNX, or SafeTensors model export pipelines
Knowledge of Prometheus metrics, Grafana dashboards, or structured production observability tooling
Familiarity with the competitive HAL and inference landscape including Modular, vLLM, Lightning AI, or DeepSpeed
Bachelor's or Master's degree in Computer Science, Electrical Engineering, Machine Learning, or a related field

COMPENSATION & BENEFITS

Salary: Competitive base, negotiable based on experience
Performance-based commission structure: your earnings scale directly with your results
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

AI Research & Engineering

Total Positions

Scan to Apply

Apply for this Job

Share this job opening