897 research outputs found
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
Artificial Intelligence for IT operations (AIOps) aims to combine the power
of AI with the big data generated by IT Operations processes, particularly in
cloud infrastructures, to provide actionable insights with the primary goal of
maximizing availability. There are a wide variety of problems to address, and
multiple use-cases, where AI capabilities can be leveraged to enhance
operational efficiency. Here we provide a review of the AIOps vision, trends
challenges and opportunities, specifically focusing on the underlying AI
techniques. We discuss in depth the key types of data emitted by IT Operations
activities, the scale and challenges in analyzing them, and where they can be
helpful. We categorize the key AIOps tasks as - incident detection, failure
prediction, root cause analysis and automated actions. We discuss the problem
formulation for each task, and then present a taxonomy of techniques to solve
these problems. We also identify relatively under explored topics, especially
those that could significantly benefit from advances in AI literature. We also
provide insights into the trends in this field, and what are the key investment
opportunities
Online Anomaly Detection in HPC Systems
open4siReliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoencoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.embargoed_20200225Borghesi A.; Libri A.; Benini L.; Bartolini A.Borghesi A.; Libri A.; Benini L.; Bartolini A
Computing at massive scale: Scalability and dependability challenges
Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications
Artificial Intelligence based Anomaly Detection of Energy Consumption in Buildings: A Review, Current Trends and New Perspectives
Enormous amounts of data are being produced everyday by sub-meters and smart
sensors installed in residential buildings. If leveraged properly, that data
could assist end-users, energy producers and utility companies in detecting
anomalous power consumption and understanding the causes of each anomaly.
Therefore, anomaly detection could stop a minor problem becoming overwhelming.
Moreover, it will aid in better decision-making to reduce wasted energy and
promote sustainable and energy efficient behavior. In this regard, this paper
is an in-depth review of existing anomaly detection frameworks for building
energy consumption based on artificial intelligence. Specifically, an extensive
survey is presented, in which a comprehensive taxonomy is introduced to
classify existing algorithms based on different modules and parameters adopted,
such as machine learning algorithms, feature extraction approaches, anomaly
detection levels, computing platforms and application scenarios. To the best of
the authors' knowledge, this is the first review article that discusses anomaly
detection in building energy consumption. Moving forward, important findings
along with domain-specific problems, difficulties and challenges that remain
unresolved are thoroughly discussed, including the absence of: (i) precise
definitions of anomalous power consumption, (ii) annotated datasets, (iii)
unified metrics to assess the performance of existing solutions, (iv) platforms
for reproducibility and (v) privacy-preservation. Following, insights about
current research trends are discussed to widen the applications and
effectiveness of the anomaly detection technology before deriving future
directions attracting significant attention. This article serves as a
comprehensive reference to understand the current technological progress in
anomaly detection of energy consumption based on artificial intelligence.Comment: 11 Figures, 3 Table
- …