CloudOps Engineer

Job Description

As CloudOps Engineer at CloudlyIO, you will own the day-to-day operational health, reliability, and cost efficiency of our cloud environments across AWS, Azure, and GCP. Where the Cloud Solution Architect designs the systems, you keep them running at their best. You will monitor, optimize, respond to incidents, manage capacity, and continuously improve the operational posture of our cloud infrastructure as it scales to support our growing AI product portfolio and enterprise customer base.
This is a role for someone who is deeply comfortable in cloud consoles and CLI alike, moves fast when incidents arise, and brings a continuous improvement mindset to everything they operate.

Job Requirement

Cloud Operations & Reliability
  • Own the operational health of CloudlyIO's cloud infrastructure across AWS, Azure, and GCP, ensuring high availability, performance, and reliability for all production systems
  • Monitor infrastructure and application health using CloudWatch, Grafana, Prometheus, and related tooling, and respond to alerts with urgency and precision
  • Manage and optimize cloud resources including compute, storage, networking, and database services across environments
  • Lead incident response for cloud infrastructure issues: triage, contain, resolve, and conduct thorough post-incident reviews
  • Maintain and improve infrastructure runbooks, operational playbooks, and on-call procedures
Cost Management & Optimization
  • Own cloud cost management and FinOps practices across all cloud accounts, identifying and acting on opportunities to reduce waste without sacrificing performance
  • Monitor resource utilization, right-size workloads, and implement reserved capacity and savings plans strategies
  • Produce regular cloud cost reports and optimization recommendations for engineering and leadership teams
  • Capacity Planning & Scaling
  • Monitor capacity trends and proactively plan scaling strategies for growing AI workloads and customer deployments
  • Configure and manage auto-scaling, load balancing, and traffic routing across production environments
  • Support infrastructure provisioning for new product deployments and customer onboarding
Governance & Compliance Operations
  • Enforce tagging standards, access controls, and governance policies across cloud accounts
  • Support audit and compliance activities by maintaining accurate infrastructure documentation and access logs
  • Work with the SecOps team to ensure operational practices align with security and compliance requirements
Collaboration

  • Work closely with DevOps, SecOps, and MLOps teams to ensure cloud operations practices are integrated across the delivery and platform lifecycle
  • Provide infrastructure support and guidance to product engineering teams as a reliable operational partner
YOU MAY BE A GOOD FIT IF YOU HAVE
  • 3 to 5 years of hands-on cloud operations experience, primarily on AWS with working knowledge of Azure or GCP
  • Strong operational proficiency with AWS services including EC2, EKS, RDS, S3, VPC, Route 53, CloudWatch, Auto-Scaling, ELB, and IAM
  • Experience with container operations including Docker and Kubernetes in production environments
  • Hands-on experience with infrastructure as code tools such as Terraform or CloudFormation for operational changes and provisioning
  • Proficiency with monitoring and observability tools including CloudWatch, Prometheus, Grafana, and Elasticsearch
  • Strong incident response instincts: you triage quickly, communicate clearly, and follow through to root cause
  • Experience with cloud cost management and FinOps practices
  • Scripting proficiency in Python or Bash for operational automation

PREFERRED QUALIFICATIONS


  • AWS certification such as SysOps Administrator or Solutions Architect
  • Experience supporting AI/ML workloads including GPU compute and large-scale data pipelines
  • Familiarity with site reliability engineering (SRE) principles and service level objective (SLO) frameworks
  • Experience with multi-account AWS Organizations and governance at scale
  • Knowledge of database operations including MySQL, PostgreSQL, Redis, and Elasticsearch
  • Bachelor's degree in Computer Science, Engineering, or a related field

COMPENSATION & BENEFITS
  • Salary: Competitive and negotiable based on experience
  • Two annual festive bonuses, each equivalent to half a month's salary
  • Two-day weekends, 10 days casual leave, 10 days sick leave, and 14 public holidays per CloudlyIO's global holiday calendar for Bangladesh
  • Fully subsidized lunch and evening snacks, plus tea and coffee throughout the day
  • Health insurance
  • Direct collaboration with US clients and teams, working on real enterprise AI infrastructure from day one