Found Description
Key Responsibilities
- Build and maintain automated tests for AI agent workflows, APIs, tools, telemetry backed analysis, remediation flows, and ticketing behavior.
- Design evaluation suites for LLM and agentic behavior, including expected-answer checks, rubric-based grading, regression datasets, tool-call validation, and safety/approval checks.
- Use or help implement evaluation frameworks such as Pydantic Evals / Pydantic AI, Strands Evals, LangSmith, DeepEval, Ragas, promptfoo, or similar tools.
- Validate multi-turn support scenarios, clarification flows, knowledge retrieval, script/remediation recommendations, escalation paths, and failure handling.
- Test on-device agent behavior where needed, including Windows service/tray behavior, telemetry collection, anomaly detection, local remediation handoff, logs, and resource impact.
- Debug quality issues directly by reading logs, tracing requests, reproducing failures, and m...