Search CORE

897 research outputs found

Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems

Author: Haibo Mi
Hua Cai
Huaimin Wang
Michael Rung-Tsong Lyu
Yangfan Zhou
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

Author: Cheng Qian
Hoi Steven C. H.
Liu Chenghao
Saha Amrita
Sahoo Doyen
Saverese Silvio
Singh Manpreet
Woo Gerald
Yang Wenzhuo
Publication venue
Publication date: 10/04/2023
Field of study

Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

arXiv.org e-Print Archive

Towards effective performance diagnosis for distributed applications

Author: Xin R.
Publication venue
Publication date: 01/01/2023
Field of study

International Migration, Integration and Social Cohesion online publications

Online Anomaly Detection in HPC Systems

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/08/2019
Field of study

open4siReliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoencoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.embargoed_20200225Borghesi A.; Libri A.; Benini L.; Bartolini A.Borghesi A.; Libri A.; Benini L.; Bartolini A

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Computing at massive scale: Scalability and dependability challenges

Author: Xu J
Yang R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2016
Field of study

Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications

Crossref

White Rose Research Online

Spectrum-Based Fault Localization for Microservices via Log Analysis

Author: João Miguel Ribeiro de Castro Silva Martins
Publication venue
Publication date: 25/07/2022
Field of study

Repositório Aberto da Universidade do Porto

Distributed Anomaly Detection and Prevention for Virtual Platforms

Author: Jehangiri Ali Imran
Publication venue
Publication date: 17/07/2015
Field of study

Georg-August-University Göttingen

Artificial Intelligence based Anomaly Detection of Energy Consumption in Buildings: A Review, Current Trends and New Perspectives

Author: Alsalemi Abdullah
Amira Abbes
Bensaali Faycal
Ghanem Khalida
Himeur Yassine
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Enormous amounts of data are being produced everyday by sub-meters and smart sensors installed in residential buildings. If leveraged properly, that data could assist end-users, energy producers and utility companies in detecting anomalous power consumption and understanding the causes of each anomaly. Therefore, anomaly detection could stop a minor problem becoming overwhelming. Moreover, it will aid in better decision-making to reduce wasted energy and promote sustainable and energy efficient behavior. In this regard, this paper is an in-depth review of existing anomaly detection frameworks for building energy consumption based on artificial intelligence. Specifically, an extensive survey is presented, in which a comprehensive taxonomy is introduced to classify existing algorithms based on different modules and parameters adopted, such as machine learning algorithms, feature extraction approaches, anomaly detection levels, computing platforms and application scenarios. To the best of the authors' knowledge, this is the first review article that discusses anomaly detection in building energy consumption. Moving forward, important findings along with domain-specific problems, difficulties and challenges that remain unresolved are thoroughly discussed, including the absence of: (i) precise definitions of anomalous power consumption, (ii) annotated datasets, (iii) unified metrics to assess the performance of existing solutions, (iv) platforms for reproducibility and (v) privacy-preservation. Following, insights about current research trends are discussed to widen the applications and effectiveness of the anomaly detection technology before deriving future directions attracting significant attention. This article serves as a comprehensive reference to understand the current technological progress in anomaly detection of energy consumption based on artificial intelligence.Comment: 11 Figures, 3 Table

arXiv.org e-Print Archive

Qatar University Institutional Repository