50 research outputs found
AIOps for a Cloud Object Storage Service
With the growing reliance on the ubiquitous availability of IT systems and
services, these systems become more global, scaled, and complex to operate. To
maintain business viability, IT service providers must put in place reliable
and cost efficient operations support. Artificial Intelligence for IT
Operations (AIOps) is a promising technology for alleviating operational
complexity of IT systems and services. AIOps platforms utilize big data,
machine learning and other advanced analytics technologies to enhance IT
operations with proactive actionable dynamic insight.
In this paper we share our experience applying the AIOps approach to a
production cloud object storage service to get actionable insights into
system's behavior and health. We describe a real-life production cloud scale
service and its operational data, present the AIOps platform we have created,
and show how it has helped us resolving operational pain points.Comment: 5 page
Leveraging data-driven infrastructure management to facilitate AIOps for big data applications and operations
As institutions increasingly shift to distributed and containerized application deployments on remote heterogeneous cloud/cluster infrastructures, the cost and difficulty of efficiently managing and maintaining data-intensive applications have risen. A new emerging solution to this issue is Data-Driven Infrastructure Management (DDIM), where the decisions regarding the management of resources are taken based on data aspects and operations (both on the infrastructure and on the application levels). This chapter will introduce readers to the core concepts underpinning DDIM, based on experience gained from development of the Kubernetes-based BigDataStack DDIM platform (https://bigdatastack.eu/). This chapter involves multiple important BDV topics, including development, deployment, and operations for cluster/cloud-based big data applications, as well as data-driven analytics and artificial intelligence for smart automated infrastructure self-management. Readers will gain important insights into how next-generation DDIM platforms function, as well as how they can be used in practical deployments to improve quality of service for Big Data Applications.
This chapter relates to the technical priority Data Processing Architectures of the European Big Data Value Strategic Research & Innovation Agenda [33], as well as the Data Processing Architectures horizontal and Engineering and DevOps for building Big Data Value vertical concerns. The chapter relates to the Reasoning and Decision Making cross-sectorial technology enablers of the AI, Data and Robotics Strategic Research, Innovation & Deployment Agenda [34]
An automated closed-loop framework to enforce security policies from anomaly detection
Due to the growing complexity and scale of IT systems, there is an increasing need to automate and streamline routine maintenance and security management procedures, to reduce costs and improve productivity. In the case of security incidents, the implementation and application of response actions require significant efforts from operators and developers in translating policies to code. Even if Machine Learning (ML) models are used to find anomalies, they need to be regularly trained/updated to avoid becoming outdated. In an evolving environment, a ML model with outdated training might put at risk the organization it was supposed to defend.
To overcome those issues, in this paper we propose an automated closed-loop process with three stages. The first stage focuses on obtaining the Decision Trees (DT) that classify anomalies. In the second stage, DTs are translated into security Policies as Code based on languages recognized by the Policy Engine (PE). In the last stage, the translated security policies feed the Policy Engines that enforce them by converting them into specific instruction sets. We also demonstrate the feasibility of the proposed framework, by presenting an example that encompasses the three stages of the closed-loop process.
The proposed framework may integrate a broad spectrum of domains and use cases, being able for instance to support the decide and the act stages of the ETSI Zero-touch Network & Service Management (ZSM) framework.info:eu-repo/semantics/publishedVersio
Architecture for Enabling Edge Inference via Model Transfer from Cloud Domain in a Kubernetes Environment
The current approaches for energy consumption optimisation in buildings are mainly reactive or focus on scheduling of daily/weekly operation modes in heating. Machine Learning (ML)-based advanced control methods have been demonstrated to improve energy efficiency when compared to these traditional methods. However, placing of ML-based models close to the buildings is not straightforward. Firstly, edge-devices typically have lower capabilities in terms of processing power, memory, and storage, which may limit execution of ML-based inference at the edge. Secondly, associated building information should be kept private. Thirdly, network access may be limited for serving a large number of edge devices. The contribution of this paper is an architecture, which enables training of ML-based models for energy consumption prediction in private cloud domain, and transfer of the models to edge nodes for prediction in Kubernetes environment. Additionally, predictors at the edge nodes can be automatically updated without interrupting operation. Performance results with sensor-based devices (Raspberry Pi 4 and Jetson Nano) indicated that a satisfactory prediction latency (~7–9 s) can be achieved within the research context. However, model switching led to an increase in prediction latency (~9–13 s). Partial evaluation of a Reference Architecture for edge computing systems, which was used as a starting point for architecture design, may be considered as an additional contribution of the paper
LWS: A Framework for Log-based Workload Simulation in Session-based SUT
Microservice-based applications and cloud-native systems have been widely
applied in large IT enterprises. The operation and management of
microservice-based applications and cloud-native systems have become the focus
of research. Essential and real workloads are the premise and basis of
prominent research topics including performance testing, dynamic resource
provisioning and scheduling, and AIOps. Due to the privacy restriction, the
complexity and variety of workloads, and the requirements for reasonable
intervention, it is difficult to copy or generate real workloads directly. In
this paper, we formulate the task of workload simulation and propose a
framework for Log-based Workload Simulation (LWS) in session-based application
systems. First, LWS collects session logs and transforms them into grouped and
well-organized sessions. Then LWS extracts the user behavior abstraction based
on a relational model and the intervenable workload intensity by three methods
from different perspectives. LWS combines the user behavior abstraction and the
workload intensity for simulated workload generation and designs a
domain-specific language for better execution. The experimental evaluation is
performed on an open-source cloud-native application and a public real-world
e-commerce workload. The experimental results show that the simulated workload
generated by LWS is effective and intervenable
SLA-aware operational efficiency in AI-enabled service chains: challenges ahead
Service providers compose services in service chains that require deep integra tion of core operational information systems across organizations.
Additionally, advanced analytics inform data-driven decision-making in
corresponding AI-ena-bled business processes in today’s complex
environments. However, individual partner engagements with service
consumers and providers often entail individu-ally negotiated, highly customized
Service Level Agreements (SLAs) comprising engagement-specific metrics that
semantically differ from general KPIs utilized on a broader operational (i.e.,
cross-client) level. Furthermore, the number of unique SLAs to be managed
increases with the size of such service chains. The resulting complexity pushes
large organizations to employ dedicated SLA management sys-tems, but such
‘siloed’ approaches make it difficult to leverage insights from SLA evaluations
and predictions for decision-making in core business processes, and vice versa.
Consequently, simultaneous optimization for both global operational process
efficiency and engagement-specific SLA compliance is hampered. To address
these shortcomings, we propose our vision of supplying online, AI-supported SLA
analyt-ics to data-driven, intelligent core workflows of the enterprise and discuss
current research challenges arising from this vision. Exemplified by two scenarios
derived from real use cases in industry and public administration, we demonstrate
the need for improved semantic alignment of heavily customized SLAs with
AI-enabled operational systems. Moreover, we discuss specific challenges of
prescriptive SLA analytics under multi-engagement SLA awareness and how the
dual role of AI in such scenarios demands bidirectional data exchange between
operational processes and SLA management. Finally, we discuss the implications
of federating AI-sup-ported SLA analytics across organizations
Managing Distributed Cloud Applications and Infrastructure
The emergence of the Internet of Things (IoT), combined with greater heterogeneity not only online in cloud computing architectures but across the cloud-to-edge continuum, is introducing new challenges for managing applications and infrastructure across this continuum. The scale and complexity is simply so complex that it is no longer realistic for IT teams to manually foresee the potential issues and manage the dynamism and dependencies across an increasing inter-dependent chain of service provision. This Open Access Pivot explores these challenges and offers a solution for the intelligent and reliable management of physical infrastructure and the optimal placement of applications for the provision of services on distributed clouds. This book provides a conceptual reference model for reliable capacity provisioning for distributed clouds and discusses how data analytics and machine learning, application and infrastructure optimization, and simulation can deliver quality of service requirements cost-efficiently in this complex feature space. These are illustrated through a series of case studies in cloud computing, telecommunications, big data analytics, and smart cities
TensorBank:Tensor Lakehouse for Foundation Model Training
Storing and streaming high dimensional data for foundation model training
became a critical requirement with the rise of foundation models beyond natural
language. In this paper we introduce TensorBank, a petabyte scale tensor
lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU
memory at wire speed based on complex relational queries. We use Hierarchical
Statistical Indices (HSI) for query acceleration. Our architecture allows to
directly address tensors on block level using HTTP range reads. Once in GPU
memory, data can be transformed using PyTorch transforms. We provide a generic
PyTorch dataset type with a corresponding dataset factory translating
relational queries and requested transformations as an instance. By making use
of the HSI, irrelevant blocks can be skipped without reading them as those
indices contain statistics on their content at different hierarchical
resolution levels. This is an opinionated architecture powered by open
standards and making heavy use of open-source technology. Although, hardened
for production use using geospatial-temporal data, this architecture
generalizes to other use case like computer vision, computational neuroscience,
biological sequence analysis and more
Managing Distributed Cloud Applications and Infrastructure
The emergence of the Internet of Things (IoT), combined with greater heterogeneity not only online in cloud computing architectures but across the cloud-to-edge continuum, is introducing new challenges for managing applications and infrastructure across this continuum. The scale and complexity is simply so complex that it is no longer realistic for IT teams to manually foresee the potential issues and manage the dynamism and dependencies across an increasing inter-dependent chain of service provision. This Open Access Pivot explores these challenges and offers a solution for the intelligent and reliable management of physical infrastructure and the optimal placement of applications for the provision of services on distributed clouds. This book provides a conceptual reference model for reliable capacity provisioning for distributed clouds and discusses how data analytics and machine learning, application and infrastructure optimization, and simulation can deliver quality of service requirements cost-efficiently in this complex feature space. These are illustrated through a series of case studies in cloud computing, telecommunications, big data analytics, and smart cities