Program

Cloud Intelligence / AIOps Workshop Program

September 1st, 2022
Venue: Santa Clara Convention Center
9:00 - 9:10am

Opening

9:10 - 10:00am

Keynote: Advances in ML for Systems in Microsoft Azure

Ricardo Bianchini, Microsoft Research

Dr. Ricardo Bianchini is a Distinguished Engineer at Microsoft, where he leads the team responsible for managing Azure's computing capacity and datacenter infrastructure with a strong emphasis on efficiency and sustainability. Before joining Azure in 2022, he led the Systems Research Group and the Cloud Efficiency team at Microsoft Research. During that time, he collaborated closely with Azure to create and deploy Resource Central, an ML and prediction-serving system that provides intelligence to other Azure components. He is an ACM Fellow and an IEEE Fellow.

Abstract: In this talk, I will describe our vision, experience, and latest advances in ML for Systems in Azure. Among other topics, I will discuss our latest research and production experience in infusing ML into cloud platforms and services.

Joseph Gonzalez is a founding member of the UC Berkeley Sky Computing Lab and the RISE Lab where he studies the design of future cloud architectures and machine learning systems. He is also a member of the Berkeley AI Research Group where he works on new neural architectures for computer vision, natural language processing, and robotics. Gonzalez's research addresses problems in neural network design, efficient inference, computer vision, prediction serving, autonomous vehicles, graph analytics, and distributed systems. Building on his research, Gonzalez co-founded Aqueduct to commercialize a radically simpler production data science platform. Finally, Gonzalez helped develop the Data Science program at UC Berkeley and co-created Data100 which is now taught to over 1500 students a semester.

Prior to joining Berkeley, Gonzalez co-founded Turi Inc (formerly GraphLab) based on his thesis work and created the GraphX project (now part of Apache Spark). Gonzalez's innovative work has earned him significant recognition, including the Okawa Research Grant, NSF Early Career Award, and the NSF Expedition Award.

Abstract: Over the past decade, I have worked on projects ranging from the early graph processing frameworks (GraphLab) and distributed data processing frameworks (Apache Spark) to more recent large-scale systems for machine learning and data processing (Clipper, Ray, CloudBurst, and Modin). I have seen the rise (and fall) of various ML systems and the importance of data and data systems driving the field forward. In this talk, I will present the evolution of prediction serving systems, what we got wrong, and where things are headed. I will introduce our new work on feature stores and try to explain why they exist in the first place. I will then conclude by presenting a new vision for the future of cloud computing, one in which we attempt to defy data gravity and disrupt the economics of the cloud.

10:30 - 11:00am

Break

A Survey of Multi-Tenant Deep Learning Inference on GPU

Fuxun Yu, Yongbo Yu (George Mason University); Di Wang (Microsoft); Minjia Zhang (Microsoft AI and Research); Longfei Shangguan (Microsoft); Chenchen Liu (University of Maryland, Baltimore County), Tolga Soyata (GMU); Xiang Chen (George Mason University)

CWP: A Machine Learning based Approach to Detect Unknown Cloud Workload

Derssie Mebratu, Mohammad Hossain, Niranjan Hasabnis, Jun Jin, Gaurav Chaudhary, Noah Shen (Intel)

Multi-level Explanation of Deep Reinforcement Learning-based Scheduling

Shaojun Zhang (USYD); Chen Wang (DATA61, CSIRO); Albert Zomaya (The University of Sydney)
11:45 - 1:00pm

Lunch break

Martin Maas is a Staff Research Scientist at Google Research and part of the Brain team. His research interests are in language runtimes, computer architecture, systems, and machine learning, with a focus on applying machine learning to systems problems. Before joining Google, Martin completed his PhD in Computer Science at the University of California at Berkeley, where he worked on hardware support for managed languages and architectural support for memory-trace obliviousness.

Abstract: Machine learning has the potential to significantly improve computer systems. While recent research in this area has shown great promise, not all problems are equally well-suited for applying ML techniques, and some remaining challenges have prevented wider adoption of ML techniques in systems. In this talk, I will introduce a taxonomy to classify machine learning for systems approaches, discuss how to identify cases that are a good fit for machine learning, and lay out a longer-term vision of how we can improve systems using ML techniques, ranging from computer architecture to language runtimes.

Auto-scaling for Spot and On-demand VM Mixture

Fangkai Yang, Bo Qiao, Eli Cortez, Inigo Goiri, Chetan Bansal, Si Qin, Victor Rühle, Qingwei Lin, Dongmei Zhang (Microsoft)

LOGIC: Log Intelligence in Cloud

Lingling Zheng (Microsoft); Xu Zhang (Microsoft Research); Ze Li, Cong Chen (Microsoft); Shilin He, Liqun Li (Microsoft Research); Yu Kang (MSRA); Yudong Liu (Microsoft); Qingwei Lin (Microsoft Research); Yingnong Dang, Murali Chintalapati (Microsoft)
2:00 - 2:30pm

Break

2:30 - 3:00pm

Closing Keynote: ML for Building an Efficient Cloud

Neeraja J. Yadwadkar, University of Texas at Austin

Bio: Neeraja is an assistant professor in the department of Electrical and Computer Engineering at UT Austin. She is a Cloud Computing Systems researcher, with a strong background in Machine Learning (ML). Most of her research straddles the boundaries of Systems and ML: using and developing ML techniques for systems, and building systems for ML. Before joining UT Austin, she was a postdoctoral fellow in the Computer Science department at Stanford University and before that, received her PhD in Computer Science from UC Berkeley. She had previously earned a bachelors in Computer Engineering from the Government College of Engineering, Pune, India.

Abstract: The variety of user workloads, application requirements, heterogeneous hardware resources, and large number of management tasks have rendered today’s cloud fairly complex. Recent work has shown promise in using Machine Learning for efficient resource management for such dynamically changing cloud execution environments. These approaches range from offline to online learning agents. In this talk, I will focus on the challenges that arise when building such agents and those that arise when these agents are deployed in real systems. To do so, I will use some of my work, Wrangler, PARIS, and SmartHarvest, as examples. I will talk about my experience and draw attention to two key questions behind robust solutions in this context: how should we formulate a problem and secondly, how/where should we deploy a model?

3:00 - 3:30pm

Break

3:30 - 4:30pm

Panel: AIOps: Challenges and Opportunities

Moderator: Christina Delimitrou, Cornell University
Panelists: Daniel Oneill, Stanford University; Neeraja J. Yadwadkar, University of Texas Austin; Dan Crankshaw, Microsoft; Martin Maas, Google