Program

6th Cloud Intelligence / AIOps Workshop @ ICSE'25 Program Schedule

May 3rd, 2025

Click here for details of the program in the ICSE system

Venue: Rogers Centre, Ottawa (formerly Shaw Centre) Room 210
Note: All times listed on this page are local to Ottawa.

9:00 - 9:10 am

Opening

9:10 - 9:55 am

Keynote #1: AIOps Unleashed: Transforming Cloud Operations at Microsoft

Zhangwei Xu, Microsoft Azure

Zhangwei Xu is a Distinguished Engineer at Microsoft Azure Edge+Platform, where he leads the AIOps platform and experience team. In this role, he oversees the development of intelligent health and monitoring infrastructure for all Microsoft services including Azure and Office, and Azure Monitor customers. Prior to his work with Azure, he worked as a software engineer and engineering manager in various Microsoft divisions, including Windows, Xbox, and Bing.

Abstract: This presentation explores how Microsoft is pioneering the integration of artificial intelligence with service operations to drive tangible improvements in cloud service reliability and on-call efficiency. In this keynote, we delve into the transformative journey of applying AIOps to Azure and Microsoft services, sharing real-world insights and use cases that demonstrate enhancements in service health monitoring, predictive maintenance, and rapid incident resolution. Attendees will learn how advanced AI driven automation and machine learning are not only optimizing operational workflows but also enabling more agile and proactive cloud management. Join us as we reveal the strategies behind a more resilient, scalable, and intelligent cloud operational platform at Microsoft.

Keynote deck [PDF]

10:00 - 10:30 am

Invited Talk #1: Towards Pragmatic Pre-processing of Logs

Weiyi Shang, University of Waterloo

Abstract: Software systems usually record important runtime information in their logs. Logs help practitioners understand system runtime behaviors and diagnose field failures. As logs are usually very large in size, automated log analysis is needed to assist practitioners in their software operation and maintenance efforts. The success of adopting log analysis in practice often depends on sophisticated preprocessing. In particular, to enable systematic analysis of logs, logs are often parsed that converts the raw logs from unstructured text to a structured format and groups based particular IDs of time slots. In this talk, I provide an overview of some of our recent work on automated techniques of log pre-processing especially considering the adoption of log analysis in practice.

Weiyi Shang is an Associate Professor at the University of Waterloo. His research interests include AIOps, big data software engineering, software log analytics and software performance engineering. He serves as a Steering committee member of the SPEC Research Group. He is ranked top worldwide SE research stars in a recent bibliometrics assessment of software engineering scholars. He is a recipient of various premium awards, including the SIGSOFT Distinguished paper award at ICSE 2013 and ICSE 2020, best paper award at WCRE 2011 and the Distinguished reviewer award for the Empirical software Engineering journal. His research has been adopted by industrial collaborators (e.g., BlackBerry and Ericsson) to improve the quality and performance of their software systems that are used by millions of users worldwide. Contact him at wshang@uwaterloo.ca; uwaterloo.ca/electrical-computer-engineering/profile/wshang.

10:30 - 11:00 am

Morning Break

11:00 - 11:30 am

Invited Talk #2: Scaling Intelligence: AIOps for Complex Systems

Andriy Miranskyy, Toronto Metropolitan University

Abstract: Artificial Intelligence for IT Operations (AIOps) has made strong progress over the last decade, especially in monitoring and managing small to mid-sized systems. But applying AIOps at scale in large, complex enterprise systems remains a major challenge.

This talk looks at the core obstacles to scaling AIOps, from managing high-dimensional, fast-changing data to ensuring effective governance across distributed systems. We’ll also explore future directions and research opportunities for building smarter, more resilient operations at scale.

Andriy Miranskyy is an associate professor in the Department of Computer Science at Toronto Metropolitan University (formerly Ryerson University), Canada. His research interests are in mitigating risk in software engineering, with the focus on large-scale software systems. Andriy received his Ph.D. in Applied Mathematics at the University of Western Ontario, Canada. He has 20+ years of software engineering experience in various industries.

11:30 - 12:30 pm

Accepted Paper Presentation Session #1

Towards Using LLMs for Distributed Trace Comparison

Vaastav Anand, Pedro Las-Casas, Rodrigo Fonseca, and Antoine Kaufmann

Abstract: Troubleshooting issues in modern cloud systems is a time-consuming and tedious task. Distributed Tracing captures relevant information such as the graphical structure as well as the temporal execution information to make it easier for operators to investigate and triage issues. To investigate incidents, operators often need to compare the structural and temporal properties of a trace representing an error execution with a reference trace representing a successful execution. Pair-wise comparison of traces is challenging as the traces are large and multi-dimensional. In this abstract, we propose a novel way of comparing distributed traces by leveraging the generative powers of LLMs. We present an initial design of our system, Parallax, which uses two different types of LLM agents - Summarizer and Comparer. Summarizer generates independent summaries of the two traces and feeds them as input into the Comparer which highlights the differences between the two summarized executions.

Automated Bug Discovery in Cloud Infrastructure-as-Code Updates with LLM Agents

Yiming Xiang, Zhenning Yang, Jingjia Peng, Hermann Bauer, Patrick Tser Jern Kon, Yiming Qiu, and Ang Chen

Abstract: Cloud environments are increasingly managed by Infrastructure-as-Code (IaC) platforms (e.g., Terraform), which allow developers to define their desired infrastructure as a configuration program that describes cloud resources and their dependencies. This shields developers from low-level operations for creating and maintaining resources, since they are automatically performed by IaC platforms when compiling and deploying the configuration. However, while IaC platforms are rigorously tested for initial deployments, they exhibit myriad errors for runtime updates, e.g., adding/removing resources and dependencies. IaC updates are common because cloud infrastructures are long-lived but user requirements fluctuate over time. Unfortunately, our experience shows that updates often introduce subtle yet impactful bugs. The update logic in IaC frameworks is hard to test due to the vast and evolving search space, which includes diverse infrastructure setups and a wide range of provided resources with new ones frequently added. We introduce TerraFault, an automated, efficient, LLM-guided system for discovering update bugs, and report our findings with an initial prototype. TerraFault incorporates various optimizations to navigate the large search space efficiently and employs techniques to accelerate the testing process. Our prototype has successfully identified bugs even in simple IaC updates, showing early promise in systematically identifying update bugs in today’s IaC frameworks to increase their reliability.

Breaking the Cycle of Recurring Failures: Applying Generative AI to Root Cause Analysis in Legacy Banking Systems

Siyuan Jin, Zhendong Bei, Bichao Chen, and Yong Xia

Abstract: Traditional banks face significant challenges in digital transformation, primarily due to legacy system constraints and fragmented ownership. Recent incidents show that such fragmentation often results in superficial incident resolutions, leaving root causes unaddressed and causing recurring failures. We introduce a novel approach to post-incident analysis, integrating knowledge-based GenAI agents with the "Five Whys" technique to examine problem descriptions and change request data. This method uncovered that approximately 70% of the incidents previously attributed to management or vendor failures were due to underlying internal code issues. We present a case study to show the impact of our method. By scanning over 5,000 projects, we identified over 400 files with a similar root cause. Overall, we leverage the knowledge-based agents to automate and elevate root cause analysis, transforming it into a more proactive process. These agents can be applied across other phases of the software development lifecycle, further improving development processes.

12:30 - 1:30 pm

Lunch Break

1:30 - 2:30 pm

Invited Talk #3: Collaborative Journey Towards AI Using Big Data

Lee Scarborough, Solutions Architect Big Data and Data Science Platforms, ExxonMobil

Ryan Hill, Principal Solutions Engineer, Cloudera

Tony Davis, Principal Solutions Engineer, Cloudera

Abstract: Over the past 10 years, ExxonMobil and Cloudera have collaborated on big data solutions, transitioning from disparate, trapped datasets to a centralized Data Lake. Hosted on an enterprise Hadoop platform and fed by automated data flows, the data lake laid the foundation to deploy global analytics applications providing improved engineering solutions and predictions. This led to further platform advancements enabling advanced analytics combining modern consumption solutions and hybrid cloud integrations. Now the journey continues with Cloudera AI data science platform leveraging the latest LLM and GenAI developments to enhance our data operations across business units to realize continued joint achievements.

2:40 - 3:30 pm

Accepted Paper Presentation Session #2

Automated Lifting for Cloud Infrastructure-as-Code Programs

Jingjia Peng, Yiming Qiu, Patrick Tser Jern Kon, Pinhan Zhao, Yibo Huang, Zheng Guo, Xinyu Wang, and Ang Chen

Abstract: Infrastructure-as-code (IaC) is reshaping how cloud resources are managed. IaC users write high-level programs to define their desired infrastructure, and the underlying IaC platforms automatically deploy the constituent resources into the cloud. While proven powerful at creating greenfield deployments (i.e., new cloud deployments from scratch), existing IaC platforms provide limited support for managing brownfield infrastructure (i.e., transplanting an existing non-IaC deployment to an IaC platform). This hampers the migration from legacy cloud management approaches to an IaC workflow and hinders wider IaC adoption. Managing brownfield deployments requires techniques to lift low-level cloud resource states and translate them into corresponding IaC programs — the reversal of the regular deployment process. Existing tools rely heavily on rule-based reverse engineering, which suffers from the lack of automation, limited resource coverage, and prevalence of errors. In this work, we lay out a vision for Lilac, a new approach that frees IaC lifting from extensive manual engineering. Lilac brings the best of both worlds: leveraging Large Language Models to automate lifting rule extraction, coupled with symbolic methods to control the cloud environment and provide correctness assurance. We envision that Lilac would enable the construction of an automated and provider-agnostic lifting tool with high coverage and accuracy.

Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure

Ze Li, Chang Lou, Vignatha Yenugutala, Vivek Ramamurthy, Eion Blanchard, Minghua Ma and Murali Chintalapati

Abstract: Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond hostlevel scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates finegrained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.

Project showcase: Automated Service Design with Cerulean

Vaastav Anand, Alok Kumbhare, Celine Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, and Rodrigo Fonseca

Abstract: Modern cloud applications are commonly developed as microservice systems. Automating the design, operation, and optimization of these systems, has been a longstanding yet an elusive goal due to lack of standardized tools and techniques for converting high-level user requirements into low-level system implementations. With the recent rise of standardization tools such as Kubernetes, Blueprint, ServiceWeaver, and OpenTelemetry along with the emergence of LLMs, we believe this goal is now within reach. In this project showcase, we introduce Cerulean, a modular, extensible, human-in-the-loop system that combines standardized tools with the expressive power of LLMs to automatically generate implementations of microservice systems.

3:30 - 4:00 pm

Afternoon Break

4:00 - 5:00 pm

Panel Interview: The Future of AIOps: From Reactive to Autonomous IT Operations

This panel explores how AIOps is evolving from simple automation and monitoring to enabling predictive, proactive, and ultimately autonomous IT operations. Experts from AI, DevOps, cloud service, and enterprise IT will discuss the transformative impact of AIOps on modern IT environments, the challenges in adoption and what the next 5–10 years hold for the intersection of AI and Ops.

Moderator: Mathew John, Microsoft

Panelists: Zhangwei Xu, Microsoft; Abdelwahab Hamou-Lhadj, Concordia University; Weiyi Shang, University of Waterloo; Andriy Miranskyy, Toronto Metropolitan University