Abstract: Modern cloud operations have reached a tipping point where hyperscale environments, characterized by decades of heterogeneous tooling and thousands of human-oriented runbooks, are becoming too complex for manual management. While organizations are rushing to deploy AI-for-Productivity, they often encounter significant friction because existing infrastructure lacks the feedback loops, safety validation buffers, and machine-readable knowledge required for autonomous agents. This talk argues that successful AIOps requires a paradigm shift: investing in “Operations-for-AI” to transform human-centric systems into agent-operable ones. By building these foundational layers, organizations can move beyond incremental patches toward a future where infrastructure natively understands intent and executes reliably.
We present a framework for agent-operable infrastructure centered on four technical layers: a Declarative Core to turn dangerous operations into reviewable code, Intent Translation to map high-level goals to validated actions, Tiered Guardrails for escalating safety controls, and Multi-Layer Observability to enable tight-loop validation. Beyond technical upgrades, this transition requires cross-functional governance and new engineering competencies in declarative patterns and agent supervision. Ultimately, the “missing half” of AIOps—the foundational investment in agent-operability—is what allows AI-driven diagnostics and automated incident response to be deployed safely and effectively at scale.
Bio: Ariane Lanier is a Software Engineer and technical lead in Reliability Engineering at Meta, where she leads efforts to make Meta’s global infrastructure safely operable by autonomous AI agents. Her work spans the full breadth of this challenge—from establishing safety guardrails for agentic systems, to driving the operational standards and oversight models needed as agents take on increasingly complex infrastructure tasks at scale. She also serves as a Site Incident Manager on Call (IMOC), coordinating Meta’s response to critical production incidents, and has led cross-organizational initiatives in large-scale incident prevention and operational readiness for products serving billions of users.
Before joining Meta in 2015, Ariane spent a decade at Microsoft, where she worked on the Windows CE kernel and file system before moving into logging and telemetry—building the instrumentation system that shipped across Windows CE and Windows Phone, and leading the observability systems for Windows 10.
Ariane holds a B.S.E. in Computer Science from Princeton University (2005). Her career—from OS kernels to telemetry pipelines to hyperscale infrastructure reliability to AI agent governance—traces the evolving frontier of systems operations, where classical principles of observability, safe deployment, and operational discipline are being fundamentally re-examined for a world of increasingly autonomous software.