Program

Cloud Intelligence Workshop Program

Schedule: 11:00am - 7:30pm CET on May 29th, 2021
Location: Virtual

11:00 - 11:05am
Opening Remarks

11:00 - 11:35am
Keynote: From Software Analytics to Cloud Intelligence - Reflection and Path Forward

Dongmei Zhang, Microsoft Research Asia

Dr. Dongmei Zhang is the Assistant Managing Director of Microsoft Research Asia (MSRA), leading the research of Data, Knowledge and Intelligence (DKI) which is an interdisciplinary area across data intelligence, knowledge computing, information visualization, and software engineering. Dr. Zhang founded the Software Analytics Group in MSRA in 2009. Since then she has been leading the group to research software analytics technologies. In addition to conducting state-of-the-art research, her group collaborates closely with multiple product teams in Microsoft, and has developed and deployed software analytics tools that have created high business impact. In recent years, Dr. Zhang and her teams have expanded the impact into the business intelligence area, and helped Microsoft establish technology leadership in the direction of Smart Data Discovery.

Talk Abstract
Software Analytics focuses on utilizing data-driven approaches to help improve the quality of software systems, the user experience of interacting with software systems, and the productivity of software development process. Software Analytics is an important research area in the software engineering community for more than a decade. It has already made broad impact in software industry. As the computing paradigm was shifting towards cloud computing, we started to focus our Software Analytics research on cloud computing and created the research topic Cloud Intelligence. Cloud Intelligence targets at utilizing AI/ML technologies to help design, build, and operate high-quality and high-efficiency cloud systems at scale. Due to the distributed nature, great complexity, and enormous scale of cloud systems, Cloud Intelligence presents unique challenges and opportunities to the Software Analytics research. In this talk, I will first introduce the research landscape of software analytics and Cloud Intelligence. Then using a couple of projects as examples, I will talk about our research on Cloud Intelligence and its impact, as well as the research challenges and opportunities in Cloud Intelligence moving forward.

11:35 - 11:55am
Invited Talk #1: Towards Autonomous IT Operations through Artificial Intelligence

Dan Pei, Tsinghua University

Dr. Dan Pei is a tenured Associate Professor in Computer Science Department at Tsinghua University in Beijing, China. Before joining Tsinghua, Dr. Pei was a Principal Researcher at AT&T Research in the U.S. He received his PhD degree (with honor) from UCLA in 2005, and his Bachelor's and Master's degrees from Tsinghua University in 1997 and 2000. His current research interests are in AIOps (Artificial Intelligence for Operations). He is an ACM Senior Member and an IEEE Senior Member.

Title: Towards Autonomous IT Operations through Artificial Intelligence
IT operations are important techniques for ensuring our increasingly digitalized world run reliably, efficiently and safely, yet current IT Operations are labor-intensive, very stressful, and ineffective. In this talk, I will first explain why AI is necessary for IT Operations. I will then present a suite of unsupervised anomaly detection algorithms for time series, log, and trace data, followed by the lessons learned. I will conclude my talk by envisioning that future IT Operations will become more and more autonomous, which will enable an even better digitalized world.

11:55am - 12:55pm
Technical paper session #1 (Detection and Diagnosis)

4 papers *15 mins = 60 mins

Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF

Tianjun Weng, Wanqi Yang, Guangba Yu, Pengfei Chen, Jieqi Cui, Chuanfu Zhang (Sun Yat-sen University)

PerfEstimator: A Generic and Extensible Performance Estimator for Data Parallel DNN Training

Chengru Yang, Zhehao Li, Chaoyi Ruan, Guanbin Xu, Cheng Li (University of Science and Technology of China); Ruichuan Chen (Nokia Bell Labs); Feng Yan (University of Nevada)

TraceLingo: Trace representation and learning for performance issue diagnosis in cloud services

Yong Xu, Yaokang Zhu, Bo Qiao, Hongshu Che, Pu Zhao, Xu Zhang (Microsoft Research); Ze Li, Yingnong Dang (Microsoft Azure); Qingwei Lin (Microsoft Research)

MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems

Li Wu (Elastisys AB; Technische Universität Berlin); Johan Tordsson (Elastisys AB; Umeå University); Jasmin Bogatinovski (Technische Universität Berlin); Erik Elmroth (Elastisys AB; Umeå University); Odej Kao (Technische Universität Berlin)

12:55 - 1:55pm
Break

1:55 - 2:15pm
Invited Talk #2: Large-Scale Trace Analysis for Microservice Architecture Understanding and Fault Analysis

Xin Peng, Fudan University

Xin Peng received the bachelor’s and PhD degrees in computer science from Fudan University, in 2001 and 2006, respectively. He is a professor of the School of Computer Science, Fudan University, China. His research interests include data-driven intelligent software development, cloud-native software and AIOps, software engineering for AI and cyber-physical-social Systems. His work won the ICSM 2011 Best Paper Award, the ACM SIGSOFT Distinguished Paper Award at ASE 2018, the IEEE TCSE Distinguished Paper Awards at ICSME 2018/2019/2020, and the IEEE Transactions on Software Engineering 2018 Best Paper Award. He was a steering committee member of International Conference on Software Maintenance and Evolution (ICSME) during 2017-2020. Now he is a co-editor of Journal of Software: Evolution and Process (JSEP), an editorial board member of ACM Transactions on Software Engineering and Methodology (TOSEM), Empirical Software Engineering (EMSE), and Chinese Journal of Software.

Title: Large-Scale Trace Analysis for Microservice Architecture Understanding and Fault Analysis
Operation engineers and developers highly rely on trace analysis to understand architectures and diagnose various problems such as service failures and quality degradation. However, the huge number of traces produced at runtime makes it challenging to capture the required information in real-time. In this talk, I will present two recent works collaborating with industrial partners, GMTA and MicroHECL, on large-scale trace analysis for microservice architecture understanding and fault analysis. Built on a graph-based representation, GMTA abstracts traces into different paths and further groups them into business flows and supports various analytical applications based on an efficient storage and access mechanism. MicroHECL is a high-efficient root cause localization approach for availability issues of microservice systems. It analyzes possible anomaly propagation chains, and ranks candidate root causes based on correlation analysis. Both the two works have been applied in the production systems of our industrial partners.

2:15 - 3:00pm
Technical paper session #2 (Monitoring and prediction)

3 papers *15 mins = 45 mins

Robust and Transferable Anomaly Detection in Log Data using Pre-Trained Language Models

Jasmin Bogatinovski, Harald Ott, Alexander Acker, Sasho Nedelkoski, Odej Kao (Technische Universität Berlin)

Rapid Trend Prediction for Large-Scale Cloud Database KPIs by Clustering

Xiaoling Wang, Ning Li, Lijun Zhang, Xiaofang Zhang (Northwestern Polytechnical University); Qiong Zhao (Bank of Communications)

Learning Dependencies in Distributed Cloud Applications to Identify and Localize Anomalies

Dominik Scheinert, Alexander Acker, Lauritz Thamsen, Morgan K. Geldenhuys, and Odej Kao (Technische Universität Berlin)

3:00 - 4:00pm
Panel #1 (Europe + Asia):

How to accelerate collaboration on Cloud Intelligence across academia and industry?

Organizer: Xin Peng, Fudan University

Ranjita Bhagwan, Microsoft Research India

Dr. Ranjita Bhagwan is Senior Principal Researcher at Microsoft Research India. Her research predominantly focuses on problems related to networked and distributed systems. Ranjita has worked for more than a decade on applying machine learning to improve system reliability, security and performance. She is the recipient of the 2020 ACM India Outstanding Contributions to Computing by a Woman Award. She has also chaired multiple top conferences in the field of systems and networking. Ranjita received her PhD and MS in Computer Engineering from University of California, San Diego and a BTech in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur.

Odej Kao, Technische Universität Berlin

Odej Kao is full professor at Technische Universität Berlin, head of the research group on distributed and operating systems, chairman of the Einstein Center Digital Future, and chairman of the national research network board. Dr. Kao is a graduate from the TU Clausthal. His research interests include AIOps, big data / streaming analytics, cloud computing, and fault tolerance. He has published over 350 papers in peer-reviewed proceedings and journals.

Shan Lu, University of Chicago

Shan Lu is a Professor in the Department of Computer Science at the University of Chicago. Her research focuses on software reliability and efficiency. Shan is an ACM Distinguished Member (2019 class) and an Alfred P. Sloan Research Fellow (2014). Shan has served as the Chair of ACM-SIGOPS (2019 --), and the technical program co-chair for ASPLOS'22, OSDI'20, and USENIX ATC'15. Her co-authored papers won Best Paper Awards at SOSP'19, OSDI'16 and FAST'13, SIGSOFT Distinguished Paper Awards at ICSE'19, ICSE'15, FSE'14, an SIGPLAN Research Highlight Award at PLDI'11, an IEEE Micro Top Picks in ASPLOS'06, a CHI Honorable Mention Award 2021, and a Google Scholar Classic Paper 2017.

Dan Pei, Tsinghua University

Dr. Dan Pei is a tenured Associate Professor in Computer Science Department at Tsinghua University in Beijing, China. Before joining Tsinghua, Dr. Pei was a Principal Researcher at AT&T Research in U.S. He received his PhD degree (with honor) from UCLA in 2005, and his Bachelor's and Master's degrees from Tsinghua University in 1997 and 2000. His current research interests are in AIOps (Artificial Intelligence for Operations). He is an ACM Senior Member and an IEEE Senior Member.

Hongyu Zhang, University of Newcastle

Hongyu Zhang is interested in software analytics, intelligent software engineering, AIOps, software maintenance, and software quality. The main theme of his research is to improve software quality and developer productivity by mining software data. He has published more than 160 research papers in leading international journals and conferences. He received four ACM Distinguished Paper awards and three best paper awards. He has also served as a program committee member/track chair for many international conferences. He is a Distinguished Member of ACM, and a Fellow of Engineers Australia (FIEAust). More information about him can be found at: https://sites.google.com/site/hongyujohn.

4:00 - 4:30pm
Break

4:30 - 5:20pm
Project showcase session

4 papers *12 mins = 48 minutes

Building a Secured Data Intelligence Platform

Conan Yang (Salesforce)

Infusing ML into VM Provisioning in Cloud

Randolph Yao (Microsoft Azure); Chuan Luo, Bo Qiao, Qingwei Lin (Microsoft Research); Tri M Tran, Gil Shafriri, Yingnong Dang, Raphael Ghelman, Pulak Goyal, Eli Cortez, Daud Howlader, Sushant Rewaskar, Murali Chintalapati (Microsoft Azure); Dongmei Zhang (Microsoft Research)

F3: Fault Forecasting Framework for Cloud Systems

Pu Zhao, Chuan Luo, Bo Qiao (Microsoft Research); Youjiang Wu, Yingnong Dang, Murali Chintalapati (Microsoft Azure); Susy Yi, Paul Wang, Andrew Zhou, Saravanakumar Rajmohan (Microsoft 365); Qingwei Lin, Dongmei Zhang (Microsoft Research)

SEAT: statistically sound infra-side deployment and integration testing

Nutcha Temiyasathit, Tao Yang, Karan Luthra, Nick Ruff, Petar Zuljevic, Ethan Benowitz, Boris Baracaldo, Oytun Eskiyeneturk, Xin Fu (Facebook)

5:20 - 5:40pm
Invited Talk #3: Leveraging ML to Handle the Increasing Complexity of the Cloud

Christina Delimitrou, Cornell University

Christina Delimitrou is an Assistant Professor and the John and Norma Balen Sesquicentennial Faculty Fellow at Cornell University, where she works on computer architecture and computer systems. She specifically focuses on improving the performance predictability and resource efficiency of large-scale cloud infrastructures by revisiting the way these systems are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Research Awards, and a Facebook Faculty Research Award. Her work has also received 4 IEEE Micro Top Picks awards and several best paper awards. Before joining Cornell, Christina received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at: http://www.csl.cornell.edu/~delimitrou/

Title: Leveraging ML to Handle the Increasing Complexity of the Cloud
Cloud services are increasingly adopting new programming models, such as microservices and serverless compute. While these frameworks offer several advantages, such as better modularity, ease of maintenance and deployment, they also introduce new hardware and software challenges.

In this talk, I will briefly discuss the challenges that these new cloud models introduce in hardware and software, and present some of our work on employing ML to improve the cloud’s performance predictability and resource efficiency. I will first discuss Seer, a performance debugging system that identifies root causes of unpredictable performance in multi-tier interactive microservices, and Sage, which improves on Seer by taking a completely unsupervised learning approach to data-driven performance debugging, making it both practical and scalable.

5:40 - 6:00pm
Invited Talk #4: AIOps: Automating and Optimizing IT Operations Management with AI

Rama Akkiraju, IBM

Rama Akkiraju is an IBM Fellow, Master Inventor, and IBM Academy Member at IBM’s Automation Division where she is the CTO of AI Operations. AI Operations is about optimizing information technology (IT) operations management using Artificial Intelligence (AI). Prior to this role, Rama led the AI mission of enabling natural, personalized, and compassionate conversations between computers and humans where she and her team developed and delivered several differentiating AI Services such as Personality Insights, Tone Analyzer, Emotion Analysis, and Sentiment Analysis services to the IBM Watson platform. Before this, Rama led various projects and Research teams at IBM Watson Research Center and IBM Almaden Research Center in the areas of AI, analytics, business process optimization and delivered innovative analytical assets to IBM’s Global Business Services (GBS), and Global Technology Services (GTS) organizations. Rama has been named by Forbes as one of the ‘Top 20 Women in AI Research’ in May 2017, has been featured in ‘A-Team in AI’ by Fortune magazine in July 2018, and named ‘Top 10 pioneering women in AI and Machine Learning’ by Enterprise Management 360 in April 2019. Rama is also the recipient of the University of California, Berkeley’s Athena award for Technical and Executive Leadership for 2020.

In her career, Rama has worked on agent-based decision support systems, business process management, electronic market places, and semantic Web services, for which she led a World-Wide-Web (W3C) standard. Rama has co-authored 4 book chapters and over 100 technical papers. Rama has 30+ issued patents and 25+ pending. She is the recipient of 4 best paper awards in AI and Operations Research. Rama served as the President for ISSIP, a Service Science professional society for 2018, and continues to actively drive AI projects through this professional society. Rama holds a Master’s degree in Computer Science and has received a gold medal from New York University for her MBA for the highest academic excellence.

Talk Abstract
The vision of self-aware, self-healing and self-managing Information Technology (IT) systems has remained elusive till recently. Recent advancements in Cloud computing, Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) in general, are all making it possible to realize this vision now. AI can optimize IT operations management processes by increasing application availability, predicting and detecting problems early, reducing the time it takes to resolve problems, proactively avoiding problems, and optimizing the resources and cost of running business applications on hybrid Clouds. In this talk, I will discuss the opportunity for AI to optimize IT Operations Management. I will describe how semi-structured application and infrastructure logs can be analyzed to predict anomalies early, how entities can be extracted and linked from logs, alerts and events to reduce alert noise for IT operations admins, how NLP can be put to use on unstructured content in prior incident tickets to extract next-best-action recommendations to resolve problems, and how deployment change request descriptions can be analyzed in combination with past incident root cause information to predict risks of deployment changes to prevent issues from happening in the first place.

6:00 - 6:30pm
Break

6:30 - 7:30pm
Panel #2 (Europe + North America):

Cloud Intelligence Across Academia and Industry

Organizer: Siddhartha Sen, Microsoft Research New York

Panelists:

Rama Akkiraju, IBM

Christina Delimitrou, Cornell University

Jim Kleewein, Microsoft

In his more than 15 years at Microsoft, Kleewein has worked on Office servers and services. While he began working on the lower layers of Exchange Server stack such as indexing, data storage, high availability, disaster recovery, and other bits of deep plumbing, as Office moved to the cloud so did he. He helped establish the architecture and implementation of O365 as both a hosted service and a self-hosted piece of enterprise software, started new technology initiatives such as the project that eventual became the Microsoft Graph and the advanced machine learning efforts that enable user action prediction like focused inbox, and regularly carried the ‘incident manager’ pager helping pragmatically assure the service was running correctly. Kleewein is listed as inventor of dozens of patents in the area of electronic communications, databases, data analytics, and hyper-scale cloud services. He is currently director of development for the core O365 hosted services, leading the worldwide team responsible for delivering Office productivity services to hundreds of millions of people every hour of every day.

Martin Maas, Google
Neeraja Yadwadkar, Stanford University

Neeraja Yadwadkar is a post-doctoral research fellow in the Computer Science Department at Stanford University, working with Christos Kozyrakis. She is a Cloud Computing Systems researcher, with a strong background in Machine Learning (ML). Neeraja's research focuses on using and developing ML techniques for systems, and building systems for ML. Neeraja graduated with a PhD in Computer Science from the RISE Lab at University of California, Berkeley, where she was advised by Randy Katz and Joseph Gonzalez. Before starting her PhD, she received her masters in Computer Science from the Indian Institute of Science, Bangalore, India, and her bachelors from the Government College of Engineering, Pune, India.