Product Manager, TorchBridge

Job Description

As Product Manager for TorchBridge, you will own the roadmap for the most technically complex and strategically significant product in CloudlyIO's portfolio. TorchBridge sits at the center of one of the fastest-moving markets in enterprise technology, where hardware vendor dynamics shift quarterly, new accelerator architectures ship constantly, and the competitive landscape attracts billions in venture funding.Your buyers are ML engineers, MLOps leads, and AI infrastructure architects who will evaluate TorchBridge against tools with tens of thousands of GitHub stars and hundreds of millions in funding. They will test every claim, benchmark every capability, and dismiss any product story that does not match their operational reality. You need to understand their world well enough to define a roadmap that outpaces the competition on the dimensions that actually matter to the people running production AI infrastructure.This role requires genuine technical depth in AI infrastructure, clear thinking about competitive strategy, and the ability to run a product development process that ships meaningful value every two to three weeks without losing sight of the longer arc.

ABOUT TORCHBRIDGE

TorchBridge is a hardware abstraction layer for PyTorch that lets engineering teams write once and run on any accelerator: NVIDIA, AMD, AWS Trainium, Google TPU, and CPU, with zero code modifications. It auto-detects available hardware, routes to the optimal backend and memory manager, and ships production-grade CLI tools, Docker containers, CI/CD workflows, monitoring, and model serving infrastructure as part of the package. The market opportunity is substantial: a $182B AI infrastructure market in 2025 growing to an estimated $466B by 2030, with custom silicon projected to reach 45% of AI compute by 2028. NVIDIA's share is compressing. AMD, Trainium, and TPU are scaling rapidly. Multi-cloud is standard. No competitor covers all four major cloud accelerators with both training and inference support. TorchBridge does, and it does so as the only PyTorch-native, full-lifecycle solution: detection, optimization, training, export, and serving in a single coherent interface.Currently at v0.5.22 with a shipping cadence of every two to three weeks, 1,464 tests, 73,521 lines of production code, six-platform cloud validation, and a roadmap running through backend-aware quantization, PagedAttention, FSDP2, LoRA/QLoRA adapter training, and cross-backend profiling.

Job Requirement

Own and maintain the TorchBridge product roadmap from current v0.5.22 through the quantization, inference differentiation, training excellence, and industry differentiation milestones, with clear prioritization rationale at each stage
Conduct ongoing discovery with AI engineering teams, MLOps practitioners, infrastructure architects, and FinOps stakeholders to understand the hardware diversification pressures, vendor lock-in costs, and operational overhead that TorchBridge must solve better than any alternative
Define detailed product requirements for technically complex HAL features including backend-aware quantization via torchao, FlexAttention kernel routing, PagedAttention with KV-cache, FSDP2 and DTensor integration, LoRA/QLoRA adapter training, and cross-backend profiling and energy monitoring
Track and analyze the competitive landscape with rigorous specificity: Modular/MAX, vLLM, HuggingFace Optimum, Lightning AI, DeepSpeed, Fireworks, Together AI, Ray Serve, and NVIDIA TensorRT each have specific gaps that define TorchBridge's differentiation. Your job is to know exactly what those gaps are, validate that they remain gaps, and ensure TorchBridge's roadmap widens them
Define success metrics for every TorchBridge capability in terms that engineering buyers and infrastructure leaders care about: benchmark performance on validated hardware, memory reduction percentages, MTTR for cross-backend migration, and test coverage and reliability standards
Lead go-to-market planning for each release milestone in collaboration with marketing and sales, including technical benchmark publication, competitive positioning documentation, and developer community engagement strategy
Manage TorchBridge's PyPI presence, documentation quality standards, and developer experience from pip install torchbridge-ml through full production deployment
Maintain the two-to-three week release cadence and the internal culture of incremental, validated, production-quality shipping that the v0.5.x release train represents

YOU MAY BE A GOOD FIT IF YOU HAVE

2 to 4 years of product management experience, ideally at a developer tools, AI infrastructure, MLOps, or open-source software company
Genuine technical depth in AI infrastructure: you understand what a hardware abstraction layer does, why backend-specific kernel dispatch matters, and what problems PyTorch teams face when they want to move a workload from NVIDIA to AMD or Trainium
Demonstrated ability to define technically precise product requirements for complex systems engineering work that ML engineers and backend developers can build from with full clarity
Strong competitive analysis instincts: you can read a competitor's documentation, identify their actual capabilities and limitations, and translate that into clear positioning and roadmap decisions
Analytical discipline: you define meaningful success metrics before building and evaluate outcomes honestly, including when something shipped but did not deliver the intended value
Comfort operating in a fast release cadence with real production quality standards: shipping every two to three weeks means your requirements need to be right the first time, not refined across three sprint cycles
Strong technical communication: you can write a roadmap document that an ML engineer trusts and a VP of Engineering can present to a board

PREFERRED QUALIFICATIONS

Experience shipping developer tools, ML infrastructure, or open-source Python packages with a real engineering user base
Familiarity with PyTorch internals, distributed training frameworks, or hardware-specific ML optimization
Knowledge of the AI accelerator market including NVIDIA, AMD, AWS Trainium, Google TPU, and the hardware procurement dynamics that drive enterprise interest in hardware diversification
Experience with developer community building, technical documentation strategy, and PyPI/open-source ecosystem positioning
Familiarity with FinOps practices and how organizations justify GPU infrastructure cost decisions
Bachelor's degree in Computer Science, Engineering, or a related field; Master's degree is an advantage

COMPENSATION & BENEFITS

Salary: Competitive base, negotiable based on experience
Performance-based commission structure: your earnings scale directly with your results
Two annual festive bonuses, each equivalent to half a month's salary
Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
Direct collaboration with US clients and teams, with real exposure to global enterprise AI deals from day one

Posted By

by CloudlyIO, Inc.

Job Locations

Remote, Bangladesh

Job Category

Product

Total Positions

Scan to Apply

Apply for this Job

Share this job opening