Program

7th Cloud Intelligence / AIOps Workshop @ ASPLOS'26 Program Schedule

March 22nd, 2026

Pittsburgh, PA
Note: All times listed on this page are local to Pittsburgh (ET).
9:00 - 9:10 am

Opening Remarks

Daehyeok Kim, University of Texas at Austin
9:10 - 9:55 am

Keynote 1: Towards AI-centric Cloud Platforms: Infusing AI into Azure Management and Operations

Ricardo Bianchini, Microsoft Azure
Ricardo Bianchini

Abstract: Managing resources and operating cloud platforms at scale are extremely challenging given the complexity of these systems and their need for high efficiency and availability. Over the years, platform designers have applied a range of AI/ML techniques to tackle these challenges with varying levels of success. The advent of high-quality generative and agentic AI systems creates a new opportunity to significantly advance both management and operations. In this talk, I will describe some of Azure's efforts in this space, including its use of AI in resource management, and large language models (LLMs) for incident management and root-causing quality regressions at scale. I will conclude by discussing open research directions and challenges on the path towards AI-centric cloud platforms.

Bio: Dr. Ricardo Bianchini is a Technical Fellow and Corporate Vice President at Microsoft Azure, where he leads the team responsible for managing Azure's Compute workload, server capacity, and datacenter infrastructure, with a strong focus on efficiency and sustainability. During his tenure at Microsoft, he spearheaded research in AI-centric resource management that resulted in Azure's Intelligence Platform infrastructure, which he continues to oversee. Prior to joining Microsoft, he spent more than a decade in academia. He is a Fellow of both the ACM and IEEE.

10:00 - 10:55 am

Technical Session 1


Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
Taeyoon Kim, Woohyeok Park (Hanyang University); Hoyeong Yun (OKESTRO Co., Ltd.); Kyungyong Lee (Hanyang University)

ActionNex: A Virtual Outage Manager for Cloud Computing
Zhenfeng Lin, Haoji Hu, Ming Hao (Microsoft); Xuchao Zhang (Microsoft Research); Ryan Zhang, Junhao Li, Oleg Kulygin, Ze Li, Sheila Jiang, Chetan Bansal, Hatay Tuna, Salman Zafar (Microsoft)

Abstract: DAGR: Agentic RCA for Internet-Scale Services
Sayan Sinha (Georgia Tech/Conviva); Vipul Harsh (Conviva); B. Aditya Prakash (Georgia Tech); Vyas Sekar, Hui Zhang (Carnegie Mellon University/Conviva)

Abstract: Markov Models for Improved Outage Candidate Generation in Supervised Outage Prediction Models
Aditya Mate, Youjiang Wu, Joe Hu, Udaivir Yadav (Microsoft); Yingnong Dang (Microsoft Azure)
10:55 - 11:05 am

Break

11:05 - 11:50 am

Keynote 2: Putting the AI in AIOps: From Artificial Intelligence to Actionable Insights

Vyas Sekar, Carnegie Mellon University
Vyas Sekar

Abstract: As Dickens famously wrote, “It was the best of times, it was the worst of times.” We are entering an era where the ability to develop and deploy cloud-scale applications has been democratized, yet our capacity to observe, manage, and troubleshoot these environments has far outpaced human cognition. In this landscape, AIOps—the automation of systems management—is critical for transitioning from “human timescales” to the “machine timescales” at which modern infrastructure operates. However, rather than blindly applying “Artificial Intelligence” to the problem, we argue for a fundamental shift: reframing the “AI” in AIOps to focus on Actionable Insights. In this keynote, we identify the foundational requirements for AIOps in internet-scale applications—Expressivity, Efficiency, Explainability, and Effort (the “4Es”)—and present the design principles necessary to build the next generation of these systems. Finally, we will share illustrative examples from academic and industry collaborations that embody these principles to achieve the 4E goals.

Bio: Vyas Sekar is the Tan Family Professor of Electrical and Computer Engineering at Carnegie Mellon University. He is a member of CMU CyLab, where he co-directs the Future of Enterprise Security Initiative. He is also the co-founder and Chief Technologist at Rockfish Data, and the Chief Scientist at Conviva. His research sits at the intersection of networks, systems, and security. His work has been recognized with the SIGCOMM Rising Star Award, the NSA Science of Security prize, the Intel Outstanding Researcher Award, the SIGCOMM Test of Time Award, and multiple Best Paper awards.

11:50 am - 12:50 pm

Lunch Break

12:50 - 1:50 pm

Panel Discussion: The Future of Cloud Operations: Building Agent-Operable Infrastructure for Actionable AIOps

This panel brings together leaders from academia and industry to examine the next phase of cloud operations in the age of AI. Beyond applying AI to management and troubleshooting, the discussion will explore what it takes to build cloud platforms that are actionable, agentic, and inherently operable by AI. Panelists will discuss the foundations of autonomous operations, including declarative infrastructure, guardrails, observability, human oversight, and the open research challenges in making AIOps safe, scalable, and effective in practice.

Moderator: Mathew John, Microsoft
Panelists: Yingnong Dang, Microsoft; Vyas Sekar, Carnegie Mellon University; Ang Chen, University of Michigan; Ariane Lanier, Meta
2:00 - 2:45 pm

Keynote 3: Cloud and AIOps: A Match Made In Heaven

Ang Chen, University of Michigan
Ang Chen

Abstract: Cloud operations are an essential but challenging task. DevOps engineers need to perform a substantial amount of “plumbing work” – creating cloud resources, monitoring infrastructure health, and rolling out updates. Compounding this difficulty, cloud providers evolve their services and APIs quickly, and the diversifying range of providers introduce further complexity. As such, it is increasingly untenable for DevOps engineers to master all cloud details, or develop static, symbolic tools as they are easily outdated by cloud changes. This talk argues that AI automation is an ideal fit for cloud management tasks. These tasks are routinely structured, making their operations easier to plan and validate; they are already programmable via well-defined APIs; and they come with extensive documentation and usage examples—a treasure trove of information from which an AI agent can learn. Agentic operations can seamlessly generalize across cloud providers, and can easily capture service, API, and documentation changes by periodic relearning. By reliably automating via “AIOps,” engineers can focus on developing their key business applications without feeling the cloud. We are taking three steps to enable this vision: developing a high-fidelity cloud emulator as “gym,” curating a DevOps curriculum for agent learning, and specializing the agent to various downstream tasks.

Bio: Ang Chen is an associate professor at the University of Michigan, Ann Arbor. His recent work is in cloud operations with AI agents, to enable high-velocity Cloud AIOps. His team has made a range of contributions: developing DevOps datasets and benchmarks, as well as systems for mining IaC semantic checks, reconciling infrastructure drifts, performing reliable updates, and for cloud infrastructure lifting.

2:45 - 3:45 pm

Technical Session 2


Towards Automated Monitor Configuration for Cloud Services: A Data-Driven AIOps Framework Informed by Industrial Practice
Anson Bastos, Anjaly Parayil (Microsoft); Ayush Choure (Independent); Chetan Bansal, Rujia Wang (Microsoft)

CAPES: Causal Analysis of Power Effect under Score-based Scheduling
Jiali Xing, William Meng, Ziqi Meng (University of Pennsylvania); Liangcheng Yu (Microsoft Research); Vincent Liu, Benjamin Lee (University of Pennsylvania)

An Empirical Study of Automation Gaps in LLM Serving Systems
Bhala Ranganathan, Minghua Ma, Mickey Zhang, Klein Hu, Rakesh Kelkar, Chetan Bansal (Microsoft)

Abstract: Speculative Load Fusion for Cloud Efficiency
Deepanjali Mishra (Carnegie Mellon University); Tanvir Ahmed Khan (Columbia University in the City of New York); Gilles Pokam (Intel); Heiner Litz (UC Santa Cruz); Akshitha Sriraman (Carnegie Mellon University)

Abstract: Multi-Modal Outage Detection via Graph-Enhanced Retrieval
Udaivir Yadav, Francisco Mandujano-Reyes, Youjiang Wu (Microsoft)
3:45 - 3:50 pm

Break

3:50 - 4:35 pm

Keynote 4: The Missing Half of AIOps

Ariane Lanier, Meta
Ariane Lanier

Abstract: Modern cloud operations have reached a tipping point where hyperscale environments, characterized by decades of heterogeneous tooling and thousands of human-oriented runbooks, are becoming too complex for manual management. While organizations are rushing to deploy AI-for-Productivity, they often encounter significant friction because existing infrastructure lacks the feedback loops, safety validation buffers, and machine-readable knowledge required for autonomous agents. This talk argues that successful AIOps requires a paradigm shift: investing in “Operations-for-AI” to transform human-centric systems into agent-operable ones. By building these foundational layers, organizations can move beyond incremental patches toward a future where infrastructure natively understands intent and executes reliably.

We present a framework for agent-operable infrastructure centered on four technical layers: a Declarative Core to turn dangerous operations into reviewable code, Intent Translation to map high-level goals to validated actions, Tiered Guardrails for escalating safety controls, and Multi-Layer Observability to enable tight-loop validation. Beyond technical upgrades, this transition requires cross-functional governance and new engineering competencies in declarative patterns and agent supervision. Ultimately, the “missing half” of AIOps—the foundational investment in agent-operability—is what allows AI-driven diagnostics and automated incident response to be deployed safely and effectively at scale.

Bio: Ariane Lanier is a Software Engineer and technical lead in Reliability Engineering at Meta, where she leads efforts to make Meta’s global infrastructure safely operable by autonomous AI agents. Her work spans the full breadth of this challenge—from establishing safety guardrails for agentic systems, to driving the operational standards and oversight models needed as agents take on increasingly complex infrastructure tasks at scale. She also serves as a Site Incident Manager on Call (IMOC), coordinating Meta’s response to critical production incidents, and has led cross-organizational initiatives in large-scale incident prevention and operational readiness for products serving billions of users.

Before joining Meta in 2015, Ariane spent a decade at Microsoft, where she worked on the Windows CE kernel and file system before moving into logging and telemetry—building the instrumentation system that shipped across Windows CE and Windows Phone, and leading the observability systems for Windows 10.

Ariane holds a B.S.E. in Computer Science from Princeton University (2005). Her career—from OS kernels to telemetry pipelines to hyperscale infrastructure reliability to AI agent governance—traces the evolving frontier of systems operations, where classical principles of observability, safe deployment, and operational discipline are being fundamentally re-examined for a world of increasingly autonomous software.

4:35 - 5:00 pm

Closing & Networking Q&A