Found Description
Elevate your career as a Senior Site Reliability Engineer in Toronto, managing cutting-edge HPC infrastructure with NVIDIA GPUs. Join a dynamic team focusing on advanced AI and ML clusters.
You will oversee the lifecycle of our high-performance computing (HPC) infrastructure. This role requires hands-on experience in planning, deploying, and maintaining resilient systems. Collaborate with engineering and research teams to optimize operations and ensure seamless performance.
Key Responsibilities: • Manage and optimize operations of HPC clusters • Deploy and maintain infrastructure-as-code solutions • Support research teams by optimizing cluster usage • Operate and troubleshoot Ceph storage clusters • Develop tooling and automation for efficiency
Requirements: • 5+ years experience in SRE or HPC operations • Proficiency in Linux systems (Ubuntu/Debian) • Experience with Kubernetes container orchestration • Knowledge of Ceph deployments over 1PB • Skilled in Pyt...
You will oversee the lifecycle of our high-performance computing (HPC) infrastructure. This role requires hands-on experience in planning, deploying, and maintaining resilient systems. Collaborate with engineering and research teams to optimize operations and ensure seamless performance.
Key Responsibilities: • Manage and optimize operations of HPC clusters • Deploy and maintain infrastructure-as-code solutions • Support research teams by optimizing cluster usage • Operate and troubleshoot Ceph storage clusters • Develop tooling and automation for efficiency
Requirements: • 5+ years experience in SRE or HPC operations • Proficiency in Linux systems (Ubuntu/Debian) • Experience with Kubernetes container orchestration • Knowledge of Ceph deployments over 1PB • Skilled in Pyt...
Ready to Apply?
Submit your application for Senior Site Reliability Engineer in Toronto at Boson AI
Apply Now