AI systems are rapidly evolving from assistive tools into autonomous decision-makers operating in production environments. While most organizations still evaluate AI success primarily through accuracy, latency, or model performance—often overlooking reliability, failure modes, and human trust.
In this talk, I will introduce AI Reliability Engineering (AIRE)—an emerging discipline that applies Site Reliability Engineering principles to AI-driven systems. I will explore how AI fails differently than traditional software, why AI-related incidents often go undetected, and how foundational SRE concepts such as SLOs, error budgets, observability, and graceful degradation must evolve to support reliable AI workloads at scale.
Akash Thakur is a Site Reliability Engineering leader and IT Architect with 17+ years of experience modernizing mission-critical systems across finance, healthcare, and the public sector. He currently serves as an SRE Architect at Cognizant, where he leads automation-first SRE and AI-driven resilience initiatives for Fortune 500 enterprises. He writes and speaks on the evolving intersection of SRE and AI infrastructure