Smart Cloud Operations

Motivation:

The ubiquitous digitalization and the wide-spread of computer technologies during last 10 years allowed to significantly increase the quality of life by means of the work processes automation. The availability of the compute resources is the major enabling factor behind the worldwide digital revolution. With the compute power becoming a commodity available through Internet, people all over the world gained an opportunity to provide the society with the digital services solving daily issues as well as the highly complex tasks. The technology bringing the unlimited compute power to the society became known as Cloud Computing. Aside from the definite economic benefit when one does not have to buy expensive hardware, cloud computing enables the sustainability - the economy of cloud computing stimulates companies and people to convert from owning the resources to renting the resources. This paradigm shift reduces the demand for energy. However, with the penetration of cloud computing in every area of our lives, the significant drawback of the technology became obvious - cloud infrastructure is too complex to operate and is subject to failures which hinders its adoption in such critical domains as healthcare and transportation. In this proposal we address the challenge of making the cloud computing-based digital infrastructure highly reliable and fault-tolerant for these critical areas by using the artificial intelligence (AI) techniques. High quality of operation and maintenance (O&M) for cloud services is at the heart of reliability and fault-tolerance. Cloud O&M conventionally relies on the expertise of well-trained cloud professionals. With the growing infrastructure and diversity of cloud services, it becomes hard to track the infrastructure or service failure and resolve it instantly. For example, 2016 was marked with a 20 hours-long outage of Salesforce cloud services provider, which resulted in significant losses of data and money. Naturally, one could not be insured neither against the power outages, nor against the human factor. But with the lives at stake, it could be wise to assist professionals in operating and maintaining the cloud services by making the cloud smarter. The key to making the cloud smarter is to predict the demand for the cloud services and proactively change the resource capacity, thus decreasing the probability of a fault. Forecasting was addressed by a number of researchers as an enabling technology in AI-operated cloud. This technology helps to automate the management of the critical cloud services by employing the resource utilization forecasting, energy consumption forecasting, and the user-level parameters forecasting. However, the state of the cloud automation is not yet mature enough to cope with the areas that are critical to the society and individual. Real-world cloud applications in such areas as e.g. healthcare, production, and transportation introduce new challenges: highly volatile cloud applications usage patterns, dynamic dependencies between variety of cloud services, high availability and area-specific non-functional requirements, timeliness of the cloud maintenance. Accounting for these challenges on the scale of large clouds like Open Telekom Cloud demands AI-based approaches to forecasting that combine adaptive time series analysis techniques with online machine learning models and graph dynamical systems concept. The project proposes AI-based forecasting approach to capture the complexity of the cloud-based digital infrastructure and thus accelerate the digitalization of the critical areas of human life.

Goal & Focus:

Cloud computing paradigm exists since 2006. Cloud secured its position as a key enabling technology for the digital infrastructure, which resulted in numerous digital services being provided via cloud. For example, the cloud allows to store both personal and business data (e.g. Google Docs, Dropbox) thus decreasing the demand for the storage on the personal computer. Hospitals and other healthcare-related institutions found the cloud useful for storing the patients data and analyzing it (e.g. UC Irvine Health, Beth Israel Deaconess Medical Center). Even the highly complex computational tasks from the research and engineering domain are also brought to the cloud for higher efficiency and reduced costs (e.g. online CAD by Autodesk, SimScale). In general, cloud-based digital services improve the quality of life for each individual, e.g. they allow to easily reserve the time of the visit to the authority or doctor, they allow to get the access to the data anywhere through the Internet, they allow to book the car and find the path free of traffic jams, they allow to communicate with friends and family from any part of the world, and they significantly extend the job search opportunities for people all around the world. However, even with these multiple applications making the life of a person way easier, some areas remain poorly covered by cloud services, e.g.: healthcare, public transportation, manufacturing, logistics, construction works. The major cause is that these areas are considered critical both for the life of an individual and for the societal well-being, and hence they provide strict requirements to the availability of the computation resources, to the persistence of the data storage, and to the timeliness of the data processing results. The listed branches prefer to have their own (i.e. on-premise) hardware and software to ensure the full control over storing and processing the data, even though the overall cost of buying and maintaining this infrastructure may become way higher than that proposed by cloud services providers. With the critical applications being hosted in the cloud, these branches have to rely on the service guarantees of the cloud services provider. But with the increasing demand for the cloud services, the cloud infrastructure becomes more and more complex making it difficult to maintain even with an army of highly-trained professional engineers. The solution would be to automate the operation and the maintenance of the cloud infrastructure as much as possible allowing to anticipate and resolve the infrastructural issues in advance.

The goal of the proposed research project is to significantly increase the reliability of the cloud-based digital infrastructure via means of AI-based autonomous maintenance thus allowing cloud services in such critical domains as e.g. healthcare, logistics, and manufacturing. The long-term benefit for the society lies in the significant improvement of the quality of life through reliable and easily-accessible cloud services in the listed critical domains as well as in supporting the professionals working in the areas with the high social value (e.g. public health and public transportation).

In order to achieve the stated goal, the project focuses on enabling the automatic management of the cloud services and virtual infrastructure based on the AI-based approaches. At the core of the research is the development of forecasting models and software to enable the smart data-centric cloud Operations and Maintenance (O&M), considered to be the main prerequisite for the highly-available and reliable digital infrastructure.

A specific goal of the project is the implementation of the Forecasting Platform prototype for smart cloud O&M. The prototype will act as a provider of accurate and timely monitored parameters forecasts to the multiple cloud O&M automation services. The variety of these services includes but is not limited to: faults/anomalies detection, root cause analysis, predictive autoscaling, and optimal scheduling of the virtual machines. Employing the forecasting will allow to substitute the reactive management of the cloud with the proactive paradigm thus increasing the availability and improving the reliability of the cloud. A number of researchers addressed the forecasting as the technology enabling the smart cloud O&M by focusing on the resource utilization forecasting, energy consumption forecasting or on the user-level parameters forecasting, e.g. request rate. In contrast to the existing techniques, we address the challenge of the forecasting for the automation of large-scale public clouds as e.g. Open Telekom Cloud.

The project methodology's main point is in coupling the hybrid adaptive statistical and machine learning models with the software-based approach to high-frequency monitoring data stream processing and structural graph-based analysis of the cloud services traces. This AI-based methodology is built around the key characteristics of real-world applications from critical domains:

  • dynamism: with the volatile workloads, multitenancy and elasticity, cloud becomes a dynamic entity. Conventional statistical analysis coupled with the machine learning approaches allows to detect the fine-granular features and patterns in time series as well as the general trends.
  • interdependence of digital services: as the majority of cloud services depends on the functionality provided by other services, the structural information can provide insights on how some services influence other services of the cloud. Such an information will enable to provide significantly more accurate forecasts in case an event occurs in the service which other services depend on. The structural information can be captured by graph dynamical systems models.
  • complexity of maintenance: a number of activities needs to be conducted to manage and maintain the cloud services, including scaling of the virtual infrastructure and applications, failures detection and root cause analysis. By testing the forecasting approach on the cases of predictive autoscaling and root cause identification, the sufficiency of the forecasting models generalization will be proved.
  • demand for timeliness: with the real-time requirements from the user's side, the cloud services need to process the data in the shortest time possible and provide the results by the specified time. Extending the time series and machine learning models to make them adaptive will significantly reduce the time required for the cloud automation activities.

To make the research feasible for the specified duration, the project scope is limited to building the forecasting models and traces analysis framework prototype that provides its forecasting services for the accurate and timely automation of the cloud O&M. The implementation of the root cause analysis, cloud predictive maintenance, and predictive autoscaling are the demonstrator cases for this framework that enable to prove the sufficiency of these automation activities for the highly-available and reliable cloud-based digital infrastructure.

The socially-important innovation of the proposed project is in enabling the usage of cloud services in the areas with real-time and high-availability demands (critical areas), such as healthcare, public transportation, manufacturing, and logistics. The reliable and self-adaptive cloud-based digital infrastructure will enable both large companies and SMEs to improve their services via means of digitalization through easily-accessible and reliable compute and storage resources. This innovation is made achievable through adaptive AI-based approach to forecasting of the cloud services parameters. The AI-based approach to forecasting that takes into account the structure of the dependencies between the services enables a variety of automation actions for cloud operations and maintenance which, in turn, improves the reliability of the cloud. In particular, we aim to enhance the time series analysis and machine learning models with the support for the fast and cheap refitting of the model for the incoming data. Our research will also incorporate the cloud services interactions graph-based model to increase the accuracy of the forecasting models. Last but not least, the existing approaches do not address challenges of the processing of the volumes of monitoring and tracing data generated by the production-level cloud systems. The proposed research will address this problem by adapting the forecasting models to the monitoring data coming as a stream from multiple sources and by employing the distributed streams processing frameworks like e.g. Apache Flink.

Industrial partner: Huawei

Funding: BMBF

Team: Vladimir Podolskiy, Anshul Jindal

Supervisor: Prof. Dr. Hans Michael Gerndt

Duration: 2018 - 2020

Contact: Vladimir Podolskiy