Program

Cloud Intelligence Workshop Program

Schedule: 8am - 5pm on Feb. 7th, 2020
Location: New York, NY
9:00 - 9:10am
Opening Remarks
9:10 - 10:00am
Marcus Fontoura joined by Murali Chintalapati and Yingnong Dang, Microsoft Azure

Services that rely heavily on Artificial Intelligence (AI), such as speech understanding and image recognition, have been receiving an enormous amount of attention. In the meanwhile, the ever increasing scale and complexity of cloud platform itself calls for leveraging more AI for building and managing cloud platform, to deliver highly efficient and reliable cloud service, enable high customer satisfaction, and achieve high engineering productivity. In this talk, I will first share our vision of infusing AI into Azure platform and DEVOps process, we call it intelligent cloud platform and AIOps. I will then give a brief overview of our AIOps efforts. I will also use Resource Central as a case study. Resource Central (RC) is a novel machine learning and prediction-serving system for improving cloud resource management. We are placing RC right at the center of Microsoft Azure. To conclude, I will discuss some lessons from deploying such research efforts in production and how they relate to academic research.

Cloud Intelligence Keynote Slide Deck

Dr. Marcus Fontoura is currently a Technical Fellow at Microsoft, where he works as chief architect for Azure compute. In his previous roles at Microsoft, Marcus worked on the production infrastructure for Bing and in several Bing Ads projects. Prior to Microsoft, Marcus worked as staff research scientist and researcher in Google, Yahoo, and IBM Almaden Research Center.

Marcus is an ACM Distinguished Member and an IEEE Senior Member. He has more than 25 issued patents (and many others filed) and more than 50 published papers. He has been serving in several program committees over the years, including SIGIR, WWW, WSDM, KDD, and CIKM. You can find more about him at his personal website Fontoura.org.

Murali Chintalapati is a partner group engineering manager in the Microsoft Azure. His team develops end to end AI and analytics solutions for Azure Core. Their work spans infrastructure availability and reliability as well as performance. Their solutions enable early detection of issues, fast recovery as well as prediction and proactive mitigation.

Yingnong Dang is a Principal Data Scientist Manager at Microsoft Azure. Yingnong’s focus at Azure is building innovative analytics and ML solutions for improving Azure Infrastructure quality and efficiency, boosting engineering productivity, and increasing customer satisfaction. Before joining Azure in December 2013, Yingnong was a researcher in Microsoft Research Asia lab. His research areas include software analytics, data visualization, data mining, and human-compute interaction. He owns 45+ U.S. patents and has published papers in top conferences including ICSE, FSE, VLDB, USNIX ATC, and NSDI. Yingnong has a Ph.D degree from Xi’an Jiaotong University, China.

Software Analytics focuses on utilizing data-driven approaches to help improve the quality of software systems, the user experience of interacting with software systems, and the productivity of software development process. Software analytics is an important research area in the software engineering community for more than a decade. It has already made broad impact in software industry.

In the era of cloud computing, the entire software stack, ranging from user facing experiences to fundamental storage and computing platform, is often manifested as cloud services. Due to its distributed nature, great complexity and enormous scale, cloud services pose unique challenges and opportunities to the software analytics research. In particular, AIOps and AI for Software Systems are two emerging topics both researchers and practitioners are actively working on in recent years.

In this talk, I will first introduce the research landscape of software analytics and cloud Intelligence. Then using a couple of projects as examples, I will talk about our research and impact in software analytics, as well as our experiences working with different product teams on joint innovations across Microsoft. I will also discuss the research challenges and opportunities in cloud Intelligence moving forward.

Dr. Dongmei Zhang is the Assistant Managing Director of Microsoft Research Asia (MSRA), leading the research of Data, Knowledge and Intelligence (DKI) which is an interdisciplinary area across data intelligence, knowledge computing, information visualization, and software engineering. Dr. Zhang founded the Software Analytics Group in MSRA in 2009. Since then she has been leading the group to research software analytics technologies. In addition to conducting state-of-the-art research, her group collaborates closely with multiple product teams in Microsoft, and has developed and deployed software analytics tools that have created high business impact. In recent years, Dr. Zhang and her teams have expanded the impact into the business intelligence area, and helped Microsoft establish technology leadership in the direction of Smart Data Discovery.

10:30 - 11:00am
Coffee Break - Group Photo
11:00 - 12:00am

HotspotRank: Hotspot Detection in Large-Scale Microservice Architecture

Yi Zhen (LinkedIn Corporation)*; Yung-Yu Chung (LinkedIn); Yang Yang (LinkedIn Corporation); Lei Zhang (LinkedIn Corporation); Ruoying Wang (LinkedIn); Bo Long (LinkedIn Corporation); Tie Wang (LinkedIn Corporation); Pranay Kanwar (LinkedIn Corporation); Dong Wang (LinkedIn Corporation); Mike Snow (LinkedIn Corporation); Sanket Patel (LinkedIn Corporation); Stephen Bisordi (LinkedIn Corporation); Viji Nair (LinkedIn Corporation)

Moving Metric Detection and Alerting System at eBay [PDF]

Zezhong Zhang (eBay Inc)*; Keyu Nie (eBay Inc); Tao Yuan (eBay Inc)

CapPredictor: A Capacity Headroom Prediction Framework in Cloud [PDF]

Ruoying Wang (LinkedIn)*; Lei Zhang (LinkedIn Corporation); Yang Yang (LinkedIn Corporation); Yi Zhen (LinkedIn Corporation); Bo Long (LinkedIn Corporation); Tie Wang (LinkedIn Corporation); Vinoth Govindaraj (LinkedIn Corporation); Todd Palino (LinkedIn Corporation); Samir Tata (LinkedIn Corporation); Viji Nair (LinkedIn Corporation)

12:00 - 1:00pm
Lunch

All systems and applications are composed from basic data structures and algorithms, such as index structures, priority queues, and sorting algorithms. Most of these primitives have been around since the early beginnings of computer science (CS) and form the basis for every CS intro lecture. Yet, we might be in front of inflection point. A recent result by my group shows that machine learning has the potential to significantly alter the way those primitives are implemented and the performance they can provide.

In this talk, I will use index structures, such as B-Trees, Hash-Maps, and Bloom-Filters, as an example to explain the intuition behind learned data structures and algorithms, and outline opportunities and existing research challenges for using this technology in practice.

Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory and co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL). Currently, his research focuses on building systems for machine learning, and using machine learning for systems.

Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the 2018 VLDB Early Career Research Contribution Award, the 2017 VMware Systems Research Award, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.

Cloud computing infrastructure is becoming ubiquitous worldwide. With the rapid growth of digitization and IoT devices, the need of large-scale Cloud infrastructure keeps increasing, which presents greater challenges to its management and operational efficiency. At Alibaba Cloud Intelligence, we focus on using data and the very best techniques that Cloud enables, such as AI algorithms, to manage the Cloud infrastructure itself in an autonomous fashion. In this talk, we give an overview of the top issues Cloud infrastructure operation is facing. Then we share some recent progress on specific topics such as Cloud resource capacity planning, fast datacenter anomaly detection, hardware failure prediction, cluster-level self-healing and so on.

Wendy Zhao is a Principal Engineer and Senior Director of Engineering at Alibaba Cloud Intelligence Business Group. She leads the Cloud infrastructure intelligent operation automation and system platforms team in both US and China. Prior joining Alibaba, She has worked at Google’s Technical Infrastructure organization for nearly 10 years. Her technical career has touched many different areas, from hardware component/subsystem technologies to digital system design and integration, from datacenter cluster innovation to backbone network architecture, from content distribution network to infrastructure business process and operation automation platform design. She received her bachelor of science degree from Peking University, and her PhD of Electrical Engineering from University of California at Berkeley. She holds numerous research publications and more than 10 patents.

2:00 - 3:00pm

AIOps Innovations of Incident Management for Cloud Services [PDF]

Zhuangbin Chen (The Chinese University of Hong Kong)*; Yu Kang (MSRA); Feng Gao (Microsoft, Redmond); Li Yang (Microsoft, Redmond); Jeffery Sun (Microsoft, Redmond); Zhangwei Xu (Microsoft, Redmond); Pu Zhao (Microsoft Research); Bo Qiao (Microsoft Research); Liqun Li (Microsoft Research); Xu Zhang (Microsoft Research); Qingwei Lin (Microsoft Research); Michael Lyu (The Chinese University of Hong Kong)

Batch Job Run Time Prediction for Auto-Scaling in the Cloud [PDF]

Minghua Ma (Tsinghua University )*; Christopher Zheng (McGill University); Junjie Chen (Tianjin University); Yilin Li (Tsinghua University); Xiao Peng (China EverBright Bank); Gang Wang (China EverBright Bank); Yong Wu (China EverBright Bank); Fang Zhou (China EverBright Bank); Wenchi Zhang (China EverBright Bank); Kaixin Sui (Bizseer Technology); Dan Pei (Tsinghua University)

Intelligent Cyber-attack Defense System using Virtual Honeynets for Cloud Security [PDF]

Jargalsaikhan Narantuya (Gwangju Institute of Science and Technology); Jiwon Yang (GIST); Jongwon Kim (GIST); Hyuk Lim (Gwangju Institute of Science and Technology)*

Scaling Performance Issue Detection and Diagnosis in Cloud Infrastructures [PDF]

Yigong Hu (Johns Hopkins University)*; Ze Li (Microsoft); Peng Huang (Johns Hopkins University); Suhas Pinnamaneni (Microsoft); Francis David (Microsoft); Yingnong Dang (Microsoft, USA); Murali Chintalapati (Microsoft)

Prediction-guided Design for Software Systems [PDF]

Si Qin (Microsoft Research)*; Yong Xu (Microsoft Research ); Shandan Zhou (Microsoft, USA); Qingwei Lin (Microsoft Research); Thomas Moscibroda (Microsoft, USA); Hongyu Zhang (University of Newcastle); Saurabh Agarwal (Microsoft Azure); Karthikeyan Subramanian (Microsoft Azure); Eli Cortez (Microsoft Research); John Miller (Microsoft Azure); Chris Cowdery (Microsoft Azure); Shanti Kemburu (Microsoft Azure); Dongmei Zhang (Microsoft Research)

Salesforce is world's #1 customer relationship management (CRM) platform. Salesforce Trusted Infrastructure is the foundation of our services. In this talk, we'll share Salesforce infrastructure data science approach to augment and enhance the efficiency of data center operations with an interpretable ML model.

Dr. Elena Novakovskaia is a principal data scientist at Salesforce in the infrastructure organization that manages data centers worldwide supporting customer traffic at scale and providing business intelligence to ensure stability of operations as well as their optimal lifecycle performance. She works with interdisciplinary teams developing innovative and intelligent software tools for reliable hardware operations. Dr. Novakovskaia holds a Ph.D. degree in Computational Sciences and Informatics from George Mason University. Prior to Salesforce she worked at NASA, IBM as well as other technology companies focusing on data science and modeling projects with numerous publications and patents. Areas of her professional interest also include business intelligence, resource management optimization and sustainability.

3:30 - 4:00pm
Coffee Break

Panelists

Asaf recently joined Columbia University as an Assistant professor of Electrical Engineering and Computer Science (jointly affiliated), and as a member of Columbia's Data Science Institute. His current primary research focus is on how to use machine learning to build adaptive networked systems. Before joining Columbia, Asaf was a Senior Vice President at Barracuda Network, where he led the development of several machine learning based security products. He joined Barracuda via the acquisition of his startup, Sookasa. Asaf completed his PhD at Stanford University, and is a recipient of award papers in Usenix Security and Usenix ATC, as well as SC media's rising star award.

Siddhartha Sen is a Principal Researcher in the Microsoft Research New York City lab, and previously a researcher in the MSR Silicon Valley lab. He uses data structures, algorithms, and machine learning to build more powerful distributed systems. His current mission is to optimize cloud infrastructure decisions in a way that is minimally disruptive, synergistic with human solutions, and safe. Siddhartha received his BS degrees in computer science and mathematics and his MEng degree in computer science from MIT. From 2004-2007 he worked as a developer at Microsoft and built a network load balancer for Windows Server. He returned to academia and completed his PhD from Princeton University in 2013. Siddhartha received the inaugural Google Fellowship in Fault-Tolerant Computing in 2009, the best student paper award at PODC 2012, and the best paper award at ASPLOS 2017.

Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory and co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL). Currently, his research focuses on building systems for machine learning, and using machine learning for systems.

Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the 2018 VLDB Early Career Research Contribution Award, the 2017 VMware Systems Research Award, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.

Wendy Zhao is a Principal Engineer and Senior Director of Engineering at Alibaba Cloud Intelligence Business Group. She leads the Cloud infrastructure intelligent operation automation and system platforms team in both US and China. Prior joining Alibaba, She has worked at Google’s Technical Infrastructure organization for nearly 10 years. Her technical career has touched many different areas, from hardware component/subsystem technologies to digital system design and integration, from datacenter cluster innovation to backbone network architecture, from content distribution network to infrastructure business process and operation automation platform design. She received her bachelor of science degree from Peking University, and her PhD of Electrical Engineering from University of California at Berkeley. She holds numerous research publications and more than 10 patents.

Dr. Xiaolin Andy Li is a Partner of Tongdun Technology, heading the AI Institute. With the vision of intelligent decision-making and analytics, Tongdun AI Institute is missioned to make fundamental breakthroughs in AI, transform industries with cutting-edge AI technologies, and empower our clients with the best AI solutions, products, and services. AI Institute is composed of research labs in deep learning, federated learning, reinforcement learning, computer vision, natural language processing, intelligent interaction, blockchain, and AI operating systems. AI Labs in USA (Palo Alto Square) and Singapore are being established. Dr. Li was a Professor in Electrical and Computer Engineering at the University of Florida and the founding center director of NSF Center for Big Learning (CBL), the first national center on deep learning in USA, with four academic sites and dozens of industrial members. He received the NSF CAREER Award, the Internet2 Innovative Application Award, NSF I-Corps Top Team Award, Top Team Award (DeepBipolar) in the CAGI Challenge, and Best Paper Awards (IEEE ICMLA 2016, IEEE SECON 2016, ACM CAC 2013, and IEEE UbiSafe 2007). He has published over 150 peer-reviewed papers. His research interests include deep learning, cloud computing, security & privacy, IoT, precision medicine, and FinTech.

Moderator

Igal is focused on creating the best Cloud Compute platform in the industry. Igal is managing Fundamentals team of Azure Compute – high-availability, reliability, manageability, performance, scalability, security and supportability of the Cloud Infrastructure as well as IaaS and PaaS compute experiences on top of it. Igal’s team is focused on driving cross-team initiatives across Microsoft, encompassing the customer app and infrastructure high-availability and reliability, customer app lifecycle and performance, and platform scale and reliability. Igal’s team goal is to drive the creation and the optimization of the right customer experiences in a competitive world-class services platform. To implement this goal, Igal and his team apply data science and ML techniques to create self-managed and self-operated distributed platform which optimizes itself for the defined customer experiences and scenarios, heavily focused on generating insights, predict customer and platform behaviors, perform failure recovery and blast radius reduction, optimize cost and performance, and improve overall customer satisfaction. Igal is interested in partnering across industry and academy to advance solutions and methodologies pertaining to application of ML techniques in large scale distributed systems to optimize all the Fundamentals aspects of the Cloud, beyond what is feasible with manual operations – creating a Cloud that operates itself in a way it’s users want it to.

Previously, Igal has held various roles in Azure and Security divisions in Microsoft, focusing on Azure Fabric Controller distributed resource management, datacenter management, and enterprise-grade distributed security challenges. Beforehand, Igal worked on architecture and development of networking devices, drivers and network management solutions.