Found Description
We are seeking a highly skilled Site Reliability Engineer (SRE) to support and enhance the reliability, scalability, and performance of our production systems. Some positions may require full-time on‑site attendance.
Responsibilities
- Own the reliability, availability, and performance of production systems in a containerized, microservices‑based environment.
- Monitor system health using Grafana dashboards, alerts, and observability tools; proactively identify and resolve issues.
- Manage and operate Kubernetes clusters (via Rancher), including deployments, scaling, and troubleshooting.
- Lead and participate in incident management using OpsGenie, including on‑call rotations, escalations, and post‑incident reviews.
- Troubleshoot issues across application, infrastructure, messaging, database, and container layers.
- Build and maintain automation scripts and tools using Bash, Go, and/or Python to improve operational ef...