Found Description
This role is responsible for building, operating, and scaling highly reliable AI/ML and cloud infrastructure platforms. The position combines Site Reliability Engineering (SRE), Platform Engineering, and AI Operations (AIOps) to ensure production systems remain stable, automated, and scalable.
Key Responsibilities
- Build and scale agentic AI systems for incident triage, anomaly detection, and self-healing automation.
- Maintain and improve the reliability and performance of AI/ML model-serving infrastructure.
- Operate, optimize, and scale distributed cloud-native systems.
- Drive automation initiatives to reduce manual operational work and improve efficiency.
- Define and manage SLOs, monitoring, observability, and incident response processes.
- Participate in troubleshooting, root-cause analysis, and continuous system improvement.
Required Skills & Experience
Ready to Apply?
Submit your application for Site Reliability Engineer at Otyms Consultings Services Inc.
Apply Now