Found Description
This role is responsible for building, operating, and scaling highly reliable ai/ml and cloud infrastructure platforms. The position combines site reliability engineering (sre), platform engineering, and ai operations (aiops) to ensure production systems remain stable, automated, and scalable.
Key Responsibilities
- Build and scale agentic ai systems for incident triage, anomaly detection, and self-healing automation.
- Maintain and improve the reliability and performance of ai/ml model-serving infrastructure.
- Operate, optimize, and scale distributed cloud-native systems.
- Drive automation initiatives to reduce manual operational work and improve efficiency.
- Define and manage slos, monitoring, observability, and incident response processes.
- Participate in troubleshooting, root-cause analysis, and continuous system improvement.
Required Skills & Experience
- 5+ years of experience in sre, produc...
Ready to Apply?
Submit your application for Site reliability engineer (Xico) at Link-Worldwide
Apply Now