Senior Site Reliability Engineer in Toronto

Boson AI

winnipeg, mb, Canada Full-time June 07, 2026

Found Description

Elevate your career as a Senior Site Reliability Engineer in Toronto, managing cutting-edge HPC infrastructure with NVIDIA GPUs. Join a dynamic team focusing on advanced AI and ML clusters.

You will oversee the lifecycle of our high-performance computing (HPC) infrastructure. This role requires hands-on experience in planning, deploying, and maintaining resilient systems. Collaborate with engineering and research teams to optimize operations and ensure seamless performance.

Key Responsibilities: • Manage and optimize operations of HPC clusters • Deploy and maintain infrastructure-as-code solutions • Support research teams by optimizing cluster usage • Operate and troubleshoot Ceph storage clusters • Develop tooling and automation for efficiency

Requirements: • 5+ years experience in SRE or HPC operations • Proficiency in Linux systems (Ubuntu/Debian) • Experience with Kubernetes container orchestration • Knowledge of Ceph deployments over 1PB • Skilled in Pyt...

Ready to Apply?

Submit your application for Senior Site Reliability Engineer in Toronto at Boson AI

Apply Now

Senior Site Reliability Engineer in Toronto

Found Description

Ready to Apply?

Found Details

About Boson AI

Boson AI

Share