Product Manager, TorchBridge
Job Description
As Product Manager for TorchBridge, you will own the roadmap for the most technically complex and strategically significant product in CloudlyIO's portfolio. TorchBridge sits at the center of one of the fastest-moving markets in enterprise technology, where hardware vendor dynamics shift quarterly, new accelerator architectures ship constantly, and the competitive landscape attracts billions in venture funding.Your buyers are ML engineers, MLOps leads, and AI infrastructure architects who will evaluate TorchBridge against tools with tens of thousands of GitHub stars and hundreds of millions in funding. They will test every claim, benchmark every capability, and dismiss any product story that does not match their operational reality. You need to understand their world well enough to define a roadmap that outpaces the competition on the dimensions that actually matter to the people running production AI infrastructure.This role requires genuine technical depth in AI infrastructure, clear thinking about competitive strategy, and the ability to run a product development process that ships meaningful value every two to three weeks without losing sight of the longer arc.
ABOUT TORCHBRIDGE
TorchBridge is a hardware abstraction layer for PyTorch that lets engineering teams write once and run on any accelerator: NVIDIA, AMD, AWS Trainium, Google TPU, and CPU, with zero code modifications. It auto-detects available hardware, routes to the optimal backend and memory manager, and ships production-grade CLI tools, Docker containers, CI/CD workflows, monitoring, and model serving infrastructure as part of the package. The market opportunity is substantial: a $182B AI infrastructure market in 2025 growing to an estimated $466B by 2030, with custom silicon projected to reach 45% of AI compute by 2028. NVIDIA's share is compressing. AMD, Trainium, and TPU are scaling rapidly. Multi-cloud is standard. No competitor covers all four major cloud accelerators with both training and inference support. TorchBridge does, and it does so as the only PyTorch-native, full-lifecycle solution: detection, optimization, training, export, and serving in a single coherent interface.Currently at v0.5.22 with a shipping cadence of every two to three weeks, 1,464 tests, 73,521 lines of production code, six-platform cloud validation, and a roadmap running through backend-aware quantization, PagedAttention, FSDP2, LoRA/QLoRA adapter training, and cross-backend profiling.Job Requirement
- Own and maintain the TorchBridge product roadmap from current v0.5.22 through the quantization, inference differentiation, training excellence, and industry differentiation milestones, with clear prioritization rationale at each stage
- Conduct ongoing discovery with AI engineering teams, MLOps practitioners, infrastructure architects, and FinOps stakeholders to understand the hardware diversification pressures, vendor lock-in costs, and operational overhead that TorchBridge must solve better than any alternative
- Define detailed product requirements for technically complex HAL features including backend-aware quantization via torchao, FlexAttention kernel routing, PagedAttention with KV-cache, FSDP2 and DTensor integration, LoRA/QLoRA adapter training, and cross-backend profiling and energy monitoring
- Track and analyze the competitive landscape with rigorous specificity: Modular/MAX, vLLM, HuggingFace Optimum, Lightning AI, DeepSpeed, Fireworks, Together AI, Ray Serve, and NVIDIA TensorRT each have specific gaps that define TorchBridge's differentiation. Your job is to know exactly what those gaps are, validate that they remain gaps, and ensure TorchBridge's roadmap widens them
- Define success metrics for every TorchBridge capability in terms that engineering buyers and infrastructure leaders care about: benchmark performance on validated hardware, memory reduction percentages, MTTR for cross-backend migration, and test coverage and reliability standards
- Lead go-to-market planning for each release milestone in collaboration with marketing and sales, including technical benchmark publication, competitive positioning documentation, and developer community engagement strategy
- Manage TorchBridge's PyPI presence, documentation quality standards, and developer experience from pip install torchbridge-ml through full production deployment
- Maintain the two-to-three week release cadence and the internal culture of incremental, validated, production-quality shipping that the v0.5.x release train represents
YOU MAY BE A GOOD FIT IF YOU HAVE
- 2 to 4 years of product management experience, ideally at a developer tools, AI infrastructure, MLOps, or open-source software company
- Genuine technical depth in AI infrastructure: you understand what a hardware abstraction layer does, why backend-specific kernel dispatch matters, and what problems PyTorch teams face when they want to move a workload from NVIDIA to AMD or Trainium
- Demonstrated ability to define technically precise product requirements for complex systems engineering work that ML engineers and backend developers can build from with full clarity
- Strong competitive analysis instincts: you can read a competitor's documentation, identify their actual capabilities and limitations, and translate that into clear positioning and roadmap decisions
- Analytical discipline: you define meaningful success metrics before building and evaluate outcomes honestly, including when something shipped but did not deliver the intended value
- Comfort operating in a fast release cadence with real production quality standards: shipping every two to three weeks means your requirements need to be right the first time, not refined across three sprint cycles
- Strong technical communication: you can write a roadmap document that an ML engineer trusts and a VP of Engineering can present to a board
PREFERRED QUALIFICATIONS
- Experience shipping developer tools, ML infrastructure, or open-source Python packages with a real engineering user base
- Familiarity with PyTorch internals, distributed training frameworks, or hardware-specific ML optimization
- Knowledge of the AI accelerator market including NVIDIA, AMD, AWS Trainium, Google TPU, and the hardware procurement dynamics that drive enterprise interest in hardware diversification
- Experience with developer community building, technical documentation strategy, and PyPI/open-source ecosystem positioning
- Familiarity with FinOps practices and how organizations justify GPU infrastructure cost decisions
- Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree is an advantage
COMPENSATION & BENEFITS
- Salary: Competitive base, negotiable based on experience
- Performance-based commission structure: your earnings scale directly with your results
- Two annual festive bonuses, each equivalent to half a month's salary
- Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
- Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
- Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one