50 research outputs found

    AIOps for a Cloud Object Storage Service

    Full text link
    With the growing reliance on the ubiquitous availability of IT systems and services, these systems become more global, scaled, and complex to operate. To maintain business viability, IT service providers must put in place reliable and cost efficient operations support. Artificial Intelligence for IT Operations (AIOps) is a promising technology for alleviating operational complexity of IT systems and services. AIOps platforms utilize big data, machine learning and other advanced analytics technologies to enhance IT operations with proactive actionable dynamic insight. In this paper we share our experience applying the AIOps approach to a production cloud object storage service to get actionable insights into system's behavior and health. We describe a real-life production cloud scale service and its operational data, present the AIOps platform we have created, and show how it has helped us resolving operational pain points.Comment: 5 page

    Leveraging data-driven infrastructure management to facilitate AIOps for big data applications and operations

    Get PDF
    As institutions increasingly shift to distributed and containerized application deployments on remote heterogeneous cloud/cluster infrastructures, the cost and difficulty of efficiently managing and maintaining data-intensive applications have risen. A new emerging solution to this issue is Data-Driven Infrastructure Management (DDIM), where the decisions regarding the management of resources are taken based on data aspects and operations (both on the infrastructure and on the application levels). This chapter will introduce readers to the core concepts underpinning DDIM, based on experience gained from development of the Kubernetes-based BigDataStack DDIM platform (https://bigdatastack.eu/). This chapter involves multiple important BDV topics, including development, deployment, and operations for cluster/cloud-based big data applications, as well as data-driven analytics and artificial intelligence for smart automated infrastructure self-management. Readers will gain important insights into how next-generation DDIM platforms function, as well as how they can be used in practical deployments to improve quality of service for Big Data Applications. This chapter relates to the technical priority Data Processing Architectures of the European Big Data Value Strategic Research & Innovation Agenda [33], as well as the Data Processing Architectures horizontal and Engineering and DevOps for building Big Data Value vertical concerns. The chapter relates to the Reasoning and Decision Making cross-sectorial technology enablers of the AI, Data and Robotics Strategic Research, Innovation & Deployment Agenda [34]

    An automated closed-loop framework to enforce security policies from anomaly detection

    Get PDF
    Due to the growing complexity and scale of IT systems, there is an increasing need to automate and streamline routine maintenance and security management procedures, to reduce costs and improve productivity. In the case of security incidents, the implementation and application of response actions require significant efforts from operators and developers in translating policies to code. Even if Machine Learning (ML) models are used to find anomalies, they need to be regularly trained/updated to avoid becoming outdated. In an evolving environment, a ML model with outdated training might put at risk the organization it was supposed to defend. To overcome those issues, in this paper we propose an automated closed-loop process with three stages. The first stage focuses on obtaining the Decision Trees (DT) that classify anomalies. In the second stage, DTs are translated into security Policies as Code based on languages recognized by the Policy Engine (PE). In the last stage, the translated security policies feed the Policy Engines that enforce them by converting them into specific instruction sets. We also demonstrate the feasibility of the proposed framework, by presenting an example that encompasses the three stages of the closed-loop process. The proposed framework may integrate a broad spectrum of domains and use cases, being able for instance to support the decide and the act stages of the ETSI Zero-touch Network & Service Management (ZSM) framework.info:eu-repo/semantics/publishedVersio

    Architecture for Enabling Edge Inference via Model Transfer from Cloud Domain in a Kubernetes Environment

    Get PDF
    The current approaches for energy consumption optimisation in buildings are mainly reactive or focus on scheduling of daily/weekly operation modes in heating. Machine Learning (ML)-based advanced control methods have been demonstrated to improve energy efficiency when compared to these traditional methods. However, placing of ML-based models close to the buildings is not straightforward. Firstly, edge-devices typically have lower capabilities in terms of processing power, memory, and storage, which may limit execution of ML-based inference at the edge. Secondly, associated building information should be kept private. Thirdly, network access may be limited for serving a large number of edge devices. The contribution of this paper is an architecture, which enables training of ML-based models for energy consumption prediction in private cloud domain, and transfer of the models to edge nodes for prediction in Kubernetes environment. Additionally, predictors at the edge nodes can be automatically updated without interrupting operation. Performance results with sensor-based devices (Raspberry Pi 4 and Jetson Nano) indicated that a satisfactory prediction latency (~7–9 s) can be achieved within the research context. However, model switching led to an increase in prediction latency (~9–13 s). Partial evaluation of a Reference Architecture for edge computing systems, which was used as a starting point for architecture design, may be considered as an additional contribution of the paper

    LWS: A Framework for Log-based Workload Simulation in Session-based SUT

    Full text link
    Microservice-based applications and cloud-native systems have been widely applied in large IT enterprises. The operation and management of microservice-based applications and cloud-native systems have become the focus of research. Essential and real workloads are the premise and basis of prominent research topics including performance testing, dynamic resource provisioning and scheduling, and AIOps. Due to the privacy restriction, the complexity and variety of workloads, and the requirements for reasonable intervention, it is difficult to copy or generate real workloads directly. In this paper, we formulate the task of workload simulation and propose a framework for Log-based Workload Simulation (LWS) in session-based application systems. First, LWS collects session logs and transforms them into grouped and well-organized sessions. Then LWS extracts the user behavior abstraction based on a relational model and the intervenable workload intensity by three methods from different perspectives. LWS combines the user behavior abstraction and the workload intensity for simulated workload generation and designs a domain-specific language for better execution. The experimental evaluation is performed on an open-source cloud-native application and a public real-world e-commerce workload. The experimental results show that the simulated workload generated by LWS is effective and intervenable

    SLA-aware operational efficiency in AI-enabled service chains: challenges ahead

    Get PDF
    Service providers compose services in service chains that require deep integra tion of core operational information systems across organizations. Additionally, advanced analytics inform data-driven decision-making in corresponding AI-ena-bled business processes in today’s complex environments. However, individual partner engagements with service consumers and providers often entail individu-ally negotiated, highly customized Service Level Agreements (SLAs) comprising engagement-specific metrics that semantically differ from general KPIs utilized on a broader operational (i.e., cross-client) level. Furthermore, the number of unique SLAs to be managed increases with the size of such service chains. The resulting complexity pushes large organizations to employ dedicated SLA management sys-tems, but such ‘siloed’ approaches make it difficult to leverage insights from SLA evaluations and predictions for decision-making in core business processes, and vice versa. Consequently, simultaneous optimization for both global operational process efficiency and engagement-specific SLA compliance is hampered. To address these shortcomings, we propose our vision of supplying online, AI-supported SLA analyt-ics to data-driven, intelligent core workflows of the enterprise and discuss current research challenges arising from this vision. Exemplified by two scenarios derived from real use cases in industry and public administration, we demonstrate the need for improved semantic alignment of heavily customized SLAs with AI-enabled operational systems. Moreover, we discuss specific challenges of prescriptive SLA analytics under multi-engagement SLA awareness and how the dual role of AI in such scenarios demands bidirectional data exchange between operational processes and SLA management. Finally, we discuss the implications of federating AI-sup-ported SLA analytics across organizations

    Managing Distributed Cloud Applications and Infrastructure

    Get PDF
    The emergence of the Internet of Things (IoT), combined with greater heterogeneity not only online in cloud computing architectures but across the cloud-to-edge continuum, is introducing new challenges for managing applications and infrastructure across this continuum. The scale and complexity is simply so complex that it is no longer realistic for IT teams to manually foresee the potential issues and manage the dynamism and dependencies across an increasing inter-dependent chain of service provision. This Open Access Pivot explores these challenges and offers a solution for the intelligent and reliable management of physical infrastructure and the optimal placement of applications for the provision of services on distributed clouds. This book provides a conceptual reference model for reliable capacity provisioning for distributed clouds and discusses how data analytics and machine learning, application and infrastructure optimization, and simulation can deliver quality of service requirements cost-efficiently in this complex feature space. These are illustrated through a series of case studies in cloud computing, telecommunications, big data analytics, and smart cities

    TensorBank:Tensor Lakehouse for Foundation Model Training

    Full text link
    Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more

    Managing Distributed Cloud Applications and Infrastructure

    Get PDF
    The emergence of the Internet of Things (IoT), combined with greater heterogeneity not only online in cloud computing architectures but across the cloud-to-edge continuum, is introducing new challenges for managing applications and infrastructure across this continuum. The scale and complexity is simply so complex that it is no longer realistic for IT teams to manually foresee the potential issues and manage the dynamism and dependencies across an increasing inter-dependent chain of service provision. This Open Access Pivot explores these challenges and offers a solution for the intelligent and reliable management of physical infrastructure and the optimal placement of applications for the provision of services on distributed clouds. This book provides a conceptual reference model for reliable capacity provisioning for distributed clouds and discusses how data analytics and machine learning, application and infrastructure optimization, and simulation can deliver quality of service requirements cost-efficiently in this complex feature space. These are illustrated through a series of case studies in cloud computing, telecommunications, big data analytics, and smart cities
    corecore