Search CORE

105 research outputs found

Reconfiguration of optical-NFV network architectures based on cloud resource allocation and QoS degradation cost-aware prediction techniques

Author: Catena T.
Di Giorgio F.
Eramo V.
Lavacca F. G.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

The high time required for the deployment of cloud resources in Network Function Virtualization network architectures has led to the proposal and investigation of algorithms for predicting trafc or the necessary processing and memory resources. However, it is well known that whatever approach is taken, a prediction error is inevitable. Two types of prediction errors can occur that have a different impact on the increase in network operational costs. In case the predicted values are higher than the real ones, the resource allocation algorithms will allocate more resources than necessary with the consequent introduction of an over-provisioning cost. Conversely, when the predicted values are lower than the real values, the allocation of fewer resources will lead to a degradation of QoS and the introduction of an under-provisioning cost. When over-provisioning and under-provisioning costs are different, most of the prediction algorithms proposed in the literature are not adequate because they are based on minimizing the mean square error or symmetric cost functions. For this reason we propose and investigate a forecasting methodology in which it is introduced an asymmetric cost function capable of weighing the costs of over-provisioning and under-provisioning differently. We have applied the proposed forecasting methodology for resource allocation in a Network Function Virtualization architectures where the Network Function Virtualization Infrastructure Point-of-Presences are interconnected by an elastic optical network.We have veried a cost savings of 40% compared to solutions that provide a minimization of the mean square error

Archivio della ricerca- Università di Roma La Sapienza

Proposal and investigation of an artificial intelligence (Ai)-based cloud resource allocation algorithm in network function virtualization architectures

Author: Catena T.
Eramo V.
Lavacca F. G.
Salazar P. J. P.
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

The high time needed to reconfigure cloud resources in Network Function Virtualization network environments has led to the proposal of solutions in which a prediction based-resource allocation is performed. All of them are based on traffic or needed resource prediction with the minimization of symmetric loss functions like Mean Squared Error. When inevitable prediction errors are made, the prediction methodologies are not able to differently weigh positive and negative prediction errors that could impact the total network cost. In fact if the predicted traffic is higher than the real one then an over allocation cost, referred to as over-provisioning cost, will be paid by the network operator; conversely, in the opposite case, Quality of Service degradation cost, referred to as under-provisioning cost, will be due to compensate the users because of the resource under allocation. In this paper we propose and investigate a resource allocation strategy based on a Long Short Term Memory algorithm in which the training operation is based on the minimization of an asymmetric cost function that differently weighs the positive and negative prediction errors and the corresponding over-provisioning and under-provisioning costs. In a typical traffic and network scenario, the proposed solution allows for a cost saving by 30% with respect to the case of solution with symmetric cost function

Archivio della ricerca- Università di Roma La Sapienza

NFV orchestration in edge and fog scenarios

Author: Martín Pérez Jorge
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/10/2021
Field of study

Mención Internacional en el título de doctorLas infraestructuras de red actuales soportan una variedad diversa de servicios como video bajo demanda, video conferencias, redes sociales, sistemas de educación, o servicios de almacenamiento de fotografías. Gran parte de la población mundial ha comenzado a utilizar estos servicios, y los utilizan diariamente. Proveedores de Cloud y operadores de infraestructuras de red albergan el tráfico de red generado por estos servicios, y sus tareas de gestión no solo implican realizar el enrutamiento del tráfico, sino también el procesado del tráfico de servicios de red. Tradicionalmente, el procesado del tráfico ha sido realizado mediante aplicaciones/ programas desplegados en servidores que estaban dedicados en exclusiva a tareas concretas como la inspección de paquetes. Sin embargo, en los últimos anos los servicios de red se han virtualizado y esto ha dado lugar al paradigma de virtualización de funciones de red (Network Function Virtualization (NFV) siguiendo las siglas en ingles), en el que las funciones de red de un servicio se ejecutan en contenedores o máquinas virtuales desacopladas de la infraestructura hardware. Como resultado, el procesado de tráfico se ha ido haciendo más flexible gracias al laxo acople del software y hardware, y a la posibilidad de compartir funciones de red típicas, como firewalls, entre los distintos servicios de red. NFV facilita la automatización de operaciones de red, ya que tareas como el escalado, o la migración son típicamente llevadas a cabo mediante un conjunto de comandos previamente definidos por la tecnología de virtualización pertinente, bien mediante contenedores o máquinas virtuales. De todos modos, sigue siendo necesario decidir el en rutamiento y procesado del tráfico de cada servicio de red. En otras palabras, que servidores tienen que encargarse del procesado del tráfico, y que enlaces de la red tienen que utilizarse para que las peticiones de los usuarios lleguen a los servidores finales, es decir, el conocido como embedding problem. Bajo el paraguas del paradigma NFV, a este problema se le conoce en inglés como Virtual Network Embedding (VNE), y esta tesis utiliza el termino “NFV orchestration algorithm” para referirse a los algoritmos que resuelven este problema. El problema del VNE es NP-hard, lo cual significa que que es imposible encontrar una solución optima en un tiempo polinómico, independientemente del tamaño de la red. Como consecuencia, la comunidad investigadora y de telecomunicaciones utilizan heurísticos que encuentran soluciones de manera más rápida que productos para la resolución de problemas de optimización. Tradicionalmente, los “NFV orchestration algorithms” han intentado minimizar los costes de despliegue derivados de las soluciones asociadas. Por ejemplo, estos algoritmos intentan no consumir el ancho de banda de la red, y usar rutas cortas para no utilizar tantos recursos. Además, una tendencia reciente ha llevado a la comunidad investigadora a utilizar algoritmos que minimizan el consumo energético de los servicios desplegados, bien mediante la elección de dispositivos con un consumo energético más eficiente, o mediante el apagado de dispositivos de red en desuso. Típicamente, las restricciones de los problemas de VNE se han resumido en un conjunto de restricciones asociadas al uso de recursos y consumo energético, y las soluciones se diferenciaban por la función objetivo utilizada. Pero eso era antes de la 5a generación de redes móviles (5G) se considerase en el problema de VNE. Con la aparición del 5G, nuevos servicios de red y casos de uso entraron en escena. Los estándares hablaban de comunicaciones ultra rápidas y fiables (Ultra-Reliable and Low Latency Communications (URLLC) usando las siglas en inglés) con latencias por debajo de unos pocos milisegundos y fiabilidades del 99.999%, una banda ancha mejorada (enhanced Mobile Broadband (eMBB) usando las siglas en inglés) con notorios incrementos en el flujo de datos, e incluso la consideración de comunicaciones masivas entre maquinas (Massive Machine-Type Communications (mMTC) usando las siglas en inglés) entre dispositivos IoT. Es más, paradigmas como edge y fog computing se incorporaron a la tecnología 5G, e introducían la idea de tener dispositivos de computo más cercanos al usuario final. Como resultado, el problema del VNE tenía que incorporar los nuevos requisitos como restricciones a tener en cuenta, y toda solución debía satisfacer bajas latencias, alta fiabilidad, y mayores tasas de transmisión. Esta tesis estudia el problema des VNE, y propone algunos heurísticos que lidian con las restricciones asociadas a servicios 5G en escenarios edge y fog, es decir, las soluciones propuestas se encargan de asignar funciones virtuales de red a servidores, y deciden el enrutamiento del trafico en las infraestructuras 5G con dispositivos edge y fog. Para evaluar el rendimiento de las soluciones propuestas, esta tesis estudia en primer lugar la generación de grafos que representan redes 5G. Los mecanismos propuestos para la generación de grafos sirven para representar distintos escenarios 5G. En particular, escenarios de federación en los que varios dominios comparten recursos entre ellos. Los grafos generados también representan servidores en el edge, así como dispositivos fog con una batería limitada. Además, estos grafos tienen en cuenta los requisitos de estándares, y la demanda que se espera en las redes 5G. La generación de grafos propuesta sirve para representar escenarios federación en los que varios dominios comparten recursos entre ellos, y redes 5G con servidores edge, así como dispositivos fog estáticos o móviles con una batería limitada. Los grafos generados para infraestructuras 5G tienen en cuenta los requisitos de estándares, y la demanda de red que se espera en las redes 5G. Además, los grafos son diferentes en función de la densidad de población, y el área de estudio, es decir, si es una zona industrial, una autopista, o una zona urbana. Tras detallar la generación de grafos que representan redes 5G, esta tesis propone algoritmos de orquestación NFV para resolver con el problema del VNE. Primero, se centra en escenarios federados en los que los servicios de red se tienen que asignar no solo a la infraestructura de un dominio, sino a los recursos compartidos en la federación de dominios. Dos problemas diferentes han sido estudiados, uno es el problema del VNE propiamente dicho sobre una infraestructura federada, y el otro es la delegación de servicios de red. Es decir, si un servicio de red se debe desplegar localmente en un dominio, o en los recursos compartidos por la federación de dominios; a sabiendas de que el último caso supone el pago de cuotas por parte del dominio local a cambio del despliegue del servicio de red. En segundo lugar, esta tesis propone OKpi, un algoritmo de orquestación NFV para conseguir la calidad de servicio de las distintas slices de las redes 5G. Conceptualmente, el slicing consiste en partir la red de modo que cada servicio de red sea tratado de modo diferente dependiendo del trozo al que pertenezca. Por ejemplo, una slice de eHealth reservara los recursos de red necesarios para conseguir bajas latencias en servicios como operaciones quirúrgicas realizadas de manera remota. Cada trozo (slice) está destinado a unos servicios específicos con unos requisitos muy concretos, como alta fiabilidad, restricciones de localización, o latencias de un milisegundo. OKpi es un algoritmo de orquestación NFV que consigue satisfacer los requisitos de servicios de red en los distintos trozos, o slices de la red. Tras presentar OKpi, la tesis resuelve el problema del VNE en redes 5G con dispositivos fog estáticos y móviles. El algoritmo de orquestación NFV presentado tiene en cuenta las limitaciones de recursos de computo de los dispositivos fog, además de los problemas de falta de cobertura derivados de la movilidad de los dispositivos. Para concluir, esta tesis estudia el escalado de servicios vehiculares Vehicle-to-Network (V2N), que requieren de bajas latencias para servicios como la prevención de choques, avisos de posibles riesgos, y conducción remota. Para estos servicios, los atascos y congestiones en la carretera pueden causar el incumplimiento de los requisitos de latencia. Por tanto, es necesario anticiparse a esas circunstancias usando técnicas de series temporales que permiten saber el tráfico inminente en los siguientes minutos u horas, para así poder escalar el servicio V2N adecuadamente.Current network infrastructures handle a diverse range of network services such as video on demand services, video-conferences, social networks, educational systems, or photo storage services. These services have been embraced by a significant amount of the world population, and are used on a daily basis. Cloud providers and Network operators’ infrastructures accommodate the traffic rates that the aforementioned services generate, and their management tasks do not only involve the traffic steering, but also the processing of the network services’ traffic. Traditionally, the traffic processing has been assessed via applications/programs deployed on servers that were exclusively dedicated to a specific task as packet inspection. However, in recent years network services have stated to be virtualized and this has led to the Network Function Virtualization (Network Function Virtualization (NFV)) paradigm, in which the network functions of a service run on containers or virtual machines that are decoupled from the hardware infrastructure. As a result, the traffic processing has become more flexible because of the loose coupling between software and hardware, and the possibility of sharing common network functions, as firewalls, across multiple network services. NFV eases the automation of network operations, since scaling and migrations tasks are typically performed by a set of commands predefined by the virtualization technology, either containers or virtual machines. However, it is still necessary to decide the traffic steering and processing of every network service. In other words, which servers will hold the traffic processing, and which are the network links to be traversed so the users’ requests reach the final servers, i.e., the network embedding problem. Under the umbrella of NFV, this problem is known as Virtual Network Embedding (VNE), and this thesis refers as “NFV orchestration algorithms” to those algorithms solving such a problem. The VNE problem is a NP-hard, meaning that it is impossible to find optimal solutions in polynomial time, no matter the network size. As a consequence, the research and telecommunications community rely on heuristics that find solutions quicker than a commodity optimization solver. Traditionally, NFV orchestration algorithms have tried to minimize the deployment costs derived from their solutions. For example, they try to not exhaust the network bandwidth, and use short paths to use less network resources. Additionally, a recent tendency led the research community towards algorithms that minimize the energy consumption of the deployed services, either by selecting more energy efficient devices or by turning off those network devices that remained unused. VNE problem constraints were typically summarized in a set of resources/energy constraints, and the solutions differed on which objectives functions were aimed for. But that was before 5th generation of mobile networks (5G) were considered in the VNE problem. With the appearance of 5G, new network services and use cases started to emerge. The standards talked about Ultra Reliable Low Latency Communication (Ultra-Reliable and Low Latency Communications (URLLC)) with latencies below few milliseconds and 99.999% reliability, an enhanced mobile broadband (enhanced Mobile Broadband (eMBB)) with significant data rate increases, and even the consideration of massive machine-type communications (Massive Machine-Type Communications (mMTC)) among Internet of Things (IoT) devices. Moreover, paradigms such as edge and fog computing blended with the 5G technology to introduce the idea of having computing devices closer to the end users. As a result, the VNE problem had to incorporate the new requirements as constraints to be taken into account, and every solution should either satisfy low latencies, high reliability, or larger data rates. This thesis studies the VNE problem, and proposes some heuristics tackling the constraints related to 5G services in Edge and fog scenarios, that is, the proposed solutions assess the assignment of Virtual Network Functions to resources, and the traffic steering across 5G infrastructures that have Edge and Fog devices. To evaluate the performance of the proposed solutions, the thesis studies first the generation of graphs that represent 5G networks. The proposed mechanisms to generate graphs serve to represent diverse 5G scenarios. In particular federation scenarios in which several domains share resources among themselves. The generated graphs also represent edge servers, so as fog devices with limited battery capacity. Additionally, these graphs take into account the standard requirements, and the expected demand for 5G networks. Moreover, the graphs differ depending on the density of population, and the area of study, i.e., whether it is an industrial area, a highway, or an urban area. After detailing the generation of graphs representing the 5G networks, this thesis proposes several NFV orchestration algorithms to tackle the VNE problem. First, it focuses on federation scenarios in which network services should be assigned not only to a single domain infrastructure, but also to the shared resources of the federation of domains. Two different problems are studied, one being the VNE itself over a federated infrastructure, and the other the delegation of network services. That is, whether a network service should be deployed in a local domain, or in the pool of resources of the federation domain; knowing that the latter charges the local domain for hosting the network service. Second, the thesis proposes OKpi, a NFV orchestration algorithm to meet 5G network slices quality of service. Conceptually, network slicing consists in splitting the network so network services are treated differently based on the slice they belong to. For example, an eHealth network slice will allocate the network resources necessary to meet low latencies for network services such as remote surgery. Each network slice is devoted to specific services with very concrete requirements, as high reliability, location constraints, or 1ms latencies. OKpi is a NFV orchestration algorithm that meets the network service requirements among different slices. It is based on a multi-constrained shortest path heuristic, and its solutions satisfy latency, reliability, and location constraints. After presenting OKpi, the thesis tackles the VNE problem in 5G networks with static/moving fog devices. The presented NFV orchestration algorithm takes into account the limited computing resources of fog devices, as well as the out-of-coverage problems derived from the devices’ mobility. To conclude, this thesis studies the scaling of Vehicle-to-Network (V2N) services, which require low latencies for network services as collision avoidance, hazard warning, and remote driving. For these services, the presence of traffic jams, or high vehicular traffic congestion lead to the violation of latency requirements. Hence, it is necessary to anticipate to such circumstances by using time-series techniques that allow to derive the incoming vehicular traffic flow in the next minutes or hours, so as to scale the V2N service accordingly.The 5G Exchange (5GEx) project (2015-2018) was an EU-funded project (H2020-ICT-2014-2 grant agreement 671636). The 5G-TRANSFORMER project (2017-2019) is an EU-funded project (H2020-ICT-2016-2 grant agreement 761536). The 5G-CORAL project (2017-2019) is an EU-Taiwan project (H2020-ICT-2016-2 grant agreement 761586).Programa de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Ioannis Stavrakakis.- Secretario: Pablo Serrano Yáñez-Mingot.- Vocal: Paul Horatiu Patra

Universidad Carlos III de Madrid e-Archivo

Towards edge robotics: the progress from cloud-based robotic systems to intelligent and context-aware robotic services

Author: Groshev Milan
Publication venue: 'Elsevier BV'
Publication date: 10/10/2022
Field of study

Current robotic systems handle a different range of applications such as video surveillance, delivery of goods, cleaning, material handling, assembly, painting, or pick and place services. These systems have been embraced not only by the general population but also by the vertical industries to help them in performing daily activities. Traditionally, the robotic systems have been deployed in standalone robots that were exclusively dedicated to performing a specific task such as cleaning the floor in indoor environments. In recent years, cloud providers started to offer their infrastructures to robotic systems for offloading some of the robot’s functions. This ultimate form of the distributed robotic system was first introduced 10 years ago as cloud robotics and nowadays a lot of robotic solutions are appearing in this form. As a result, standalone robots became software-enhanced objects with increased reconfigurability as well as decreased complexity and cost. Moreover, by offloading the heavy processing from the robot to the cloud, it is easier to share services and information from various robots or agents to achieve better cooperation and coordination. Cloud robotics is suitable for human-scale responsive and delay-tolerant robotic functionalities (e.g., monitoring, predictive maintenance). However, there is a whole set of real-time robotic applications (e.g., remote control, motion planning, autonomous navigation) that can not be executed with cloud robotics solutions, mainly because cloud facilities traditionally reside far away from the robots. While the cloud providers can ensure certain performance in their infrastructure, very little can be ensured in the network between the robots and the cloud, especially in the last hop where wireless radio access networks are involved. Over the last years advances in edge computing, fog computing, 5G NR, network slicing, Network Function Virtualization (NFV), and network orchestration are stimulating the interest of the industrial sector to satisfy the stringent and real-time requirements of their applications. Robotic systems are a key piece in the industrial digital transformation and their benefits are very well studied in the literature. However, designing and implementing a robotic system that integrates all the emerging technologies and meets the connectivity requirements (e.g., latency, reliability) is an ambitious task. This thesis studies the integration of modern Information andCommunication Technologies (ICTs) in robotic systems and proposes some robotic enhancements that tackle the real-time constraints of robotic services. To evaluate the performance of the proposed enhancements, this thesis departs from the design and prototype implementation of an edge native robotic system that embodies the concepts of edge computing, fog computing, orchestration, and virtualization. The proposed edge robotics system serves to represent two exemplary robotic applications. In particular, autonomous navigation of mobile robots and remote-control of robot manipulator where the end-to-end robotic system is distributed between the robots and the edge server. The open-source prototype implementation of the designed edge native robotic system resulted in the creation of two real-world testbeds that are used in this thesis as a baseline scenario for the evaluation of new innovative solutions in robotic systems. After detailing the design and prototype implementation of the end-to-end edge native robotic system, this thesis proposes several enhancements that can be offered to robotic systems by adapting the concept of edge computing via the Multi-Access Edge Computing (MEC) framework. First, it proposes exemplary network context-aware enhancements in which the real-time information about robot connectivity and location can be used to dynamically adapt the end-to-end system behavior to the actual status of the communication (e.g., radio channel). Three different exemplary context-aware enhancements are proposed that aim to optimize the end-to-end edge native robotic system. Later, the thesis studies the capability of the edge native robotic system to offer potential savings by means of computation offloading for robot manipulators in different deployment configurations. Further, the impact of different wireless channels (e.g., 5G, 4G andWi-Fi) to support the data exchange between a robot manipulator and its remote controller are assessed. In the following part of the thesis, the focus is set on how orchestration solutions can support mobile robot systems to make high quality decisions. The application of OKpi as an orchestration algorithm and DLT-based federation are studied to meet the KPIs that autonomously controlledmobile robots have in order to provide uninterrupted connectivity over the radio access network. The elaborated solutions present high compatibility with the designed edge robotics system where the robot driving range is extended without any interruption of the end-to-end edge robotics service. While the DLT-based federation extends the robot driving range by deploying access point extension on top of external domain infrastructure, OKpi selects the most suitable access point and computing resource in the cloud-to-thing continuum in order to fulfill the latency requirements of autonomously controlled mobile robots. To conclude the thesis the focus is set on how robotic systems can improve their performance by leveraging Artificial Intelligence (AI) and Machine Learning (ML) algorithms to generate smart decisions. To do so, the edge native robotic system is presented as a true embodiment of a Cyber-Physical System (CPS) in Industry 4.0, showing the mission of AI in such concept. It presents the key enabling technologies of the edge robotic system such as edge, fog, and 5G, where the physical processes are integrated with computing and network domains. The role of AI in each technology domain is identified by analyzing a set of AI agents at the application and infrastructure level. In the last part of the thesis, the movement prediction is selected to study the feasibility of applying a forecast-based recovery mechanism for real-time remote control of robotic manipulators (FoReCo) that uses ML to infer lost commands caused by interference in the wireless channel. The obtained results are showcasing the its potential in simulation and real-world experimentation.Programa de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Karl Holger.- Secretario: Joerg Widmer.- Vocal: Claudio Cicconett

Universidad Carlos III de Madrid e-Archivo

Dynamic VNF Placement, Resource Allocation and Traffic Routing in 5G  

Author: Ali Movaghar
Carla Fabiana Chiasserini
Francesco Malandrino
Morteza Golkarifarda
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

5G networks are going to support a variety of vertical services, with a diverse set of key performance indicators (KPIs), by using enabling technologies such as software-defined networking and network function virtualization. It is the responsibility of the network operator to efficiently allocate the available resources to the service requests in such a way to honor KPI requirements, while accounting for the limited quantity of available resources and their cost. A critical challenge is that requests may be highly varying over time, requiring a solution that accounts for their dynamic generation and termination. With this motivation, we seek to make joint decisions for request admission, resource activation, VNF placement, resource allocation, and traffic routing. We do so by considering real-world aspects such as the setup times of virtual machines, with the goal of maximizing the mobile network operator profit. To this end, first, we formulate a one-shot optimization problem which can attain the optimum solution for small size problems given the complete knowledge of arrival and departure times of requests over the entire system lifespan. We then propose an efficient and practical heuristic solution that only requires this knowledge for the next time period and works for realistically-sized scenarios. Finally, we evaluate the performance of these solutions using real-world services and large-scale network topologies. {Results demonstrate that our heuristic solution performs better than a state-of-the-art online approach and close to the optimum

arXiv.org e-Print Archive

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Autonomous service management of virtual network functions migration and placement for end-to-end advanced wireless communication (5g)

Author: Bunyakitanon Monchai
Publication venue
Publication date: 02/12/2021
Field of study

Explore Bristol Research

Data-Driven Methods for Data Center Operations Support

Author: LANCIANO Giacomo
Publication venue: Scuola Normale Superiore
Publication date: 17/07/2023
Field of study

During the last decade, cloud technologies have been evolving at an impressive pace, such that we are now living in a cloud-native era where developers can leverage on an unprecedented landscape of (possibly managed) services for orchestration, compute, storage, load-balancing, monitoring, etc. The possibility to have on-demand access to a diverse set of configurable virtualized resources allows for building more elastic, flexible and highly-resilient distributed applications. Behind the scenes, cloud providers sustain the heavy burden of maintaining the underlying infrastructures, consisting in large-scale distributed systems, partitioned and replicated among many geographically dislocated data centers to guarantee scalability, robustness to failures, high availability and low latency. The larger the scale, the more cloud providers have to deal with complex interactions among the various components, such that monitoring, diagnosing and troubleshooting issues become incredibly daunting tasks. To keep up with these challenges, development and operations practices have undergone significant transformations, especially in terms of improving the automations that make releasing new software, and responding to unforeseen issues, faster and sustainable at scale. The resulting paradigm is nowadays referred to as DevOps. However, while such automations can be very sophisticated, traditional DevOps practices fundamentally rely on reactive mechanisms, that typically require careful manual tuning and supervision from human experts. To minimize the risk of outages—and the related costs—it is crucial to provide DevOps teams with suitable tools that can enable a proactive approach to data center operations. This work presents a comprehensive data-driven framework to address the most relevant problems that can be experienced in large-scale distributed cloud infrastructures. These environments are indeed characterized by a very large availability of diverse data, collected at each level of the stack, such as: time-series (e.g., physical host measurements, virtual machine or container metrics, networking components logs, application KPIs); graphs (e.g., network topologies, fault graphs reporting dependencies among hardware and software components, performance issues propagation networks); and text (e.g., source code, system logs, version control system history, code review feedbacks). Such data are also typically updated with relatively high frequency, and subject to distribution drifts caused by continuous configuration changes to the underlying infrastructure. In such a highly dynamic scenario, traditional model-driven approaches alone may be inadequate at capturing the complexity of the interactions among system components. DevOps teams would certainly benefit from having robust data-driven methods to support their decisions based on historical information. For instance, effective anomaly detection capabilities may also help in conducting more precise and efficient root-cause analysis. Also, leveraging on accurate forecasting and intelligent control strategies would improve resource management. Given their ability to deal with high-dimensional, complex data, Deep Learning-based methods are the most straightforward option for the realization of the aforementioned support tools. On the other hand, because of their complexity, this kind of models often requires huge processing power, and suitable hardware, to be operated effectively at scale. These aspects must be carefully addressed when applying such methods in the context of data center operations. Automated operations approaches must be dependable and cost-efficient, not to degrade the services they are built to improve. i

Archivio istituzionale della Ricerca - Scuola Normale Superiore