82 research outputs found
Modern computing: Vision and challenges
Over the past six decades, the computing systems field has experienced significant transformations, profoundly impacting society with transformational developments, such as the Internet and the commodification of computing. Underpinned by technological advancements, computer systems, far from being static, have been continuously evolving and adapting to cover multifaceted societal niches. This has led to new paradigms such as cloud, fog, edge computing, and the Internet of Things (IoT), which offer fresh economic and creative opportunities. Nevertheless, this rapid change poses complex research challenges, especially in maximizing potential and enhancing functionality. As such, to maintain an economical level of performance that meets ever-tighter requirements, one must understand the drivers of new model emergence and expansion, and how contemporary challenges differ from past ones. To that end, this article investigates and assesses the factors influencing the evolution of computing systems, covering established systems and architectures as well as newer developments, such as serverless computing, quantum computing, and on-device AI on edge devices. Trends emerge when one traces technological trajectory, which includes the rapid obsolescence of frameworks due to business and technical constraints, a move towards specialized systems and models, and varying approaches to centralized and decentralized control. This comprehensive review of modern computing systems looks ahead to the future of research in the field, highlighting key challenges and emerging trends, and underscoring their importance in cost-effectively driving technological progress
Co-designing reliability and performance for datacenter memory
Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability,
disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores.
This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties.
First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes Dvé, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. Dvé can
flexibly provide these benefits on-demand at runtime.
Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures.
The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications
Data-Driven Methods for Data Center Operations Support
During the last decade, cloud technologies have been evolving at
an impressive pace, such that we are now living in a cloud-native
era where developers can leverage on an unprecedented landscape
of (possibly managed) services for orchestration, compute, storage,
load-balancing, monitoring, etc. The possibility to have on-demand
access to a diverse set of configurable virtualized resources allows
for building more elastic, flexible and highly-resilient distributed
applications. Behind the scenes, cloud providers sustain the heavy
burden of maintaining the underlying infrastructures, consisting in
large-scale distributed systems, partitioned and replicated among
many geographically dislocated data centers to guarantee scalability,
robustness to failures, high availability and low latency. The larger the
scale, the more cloud providers have to deal with complex interactions
among the various components, such that monitoring, diagnosing and
troubleshooting issues become incredibly daunting tasks.
To keep up with these challenges, development and operations
practices have undergone significant transformations, especially in
terms of improving the automations that make releasing new software,
and responding to unforeseen issues, faster and sustainable at scale.
The resulting paradigm is nowadays referred to as DevOps. However,
while such automations can be very sophisticated, traditional DevOps
practices fundamentally rely on reactive mechanisms, that typically
require careful manual tuning and supervision from human experts.
To minimize the risk of outages—and the related costs—it is crucial to
provide DevOps teams with suitable tools that can enable a proactive
approach to data center operations.
This work presents a comprehensive data-driven framework to address
the most relevant problems that can be experienced in large-scale
distributed cloud infrastructures. These environments are indeed characterized
by a very large availability of diverse data, collected at each
level of the stack, such as: time-series (e.g., physical host measurements,
virtual machine or container metrics, networking components
logs, application KPIs); graphs (e.g., network topologies, fault graphs
reporting dependencies among hardware and software components,
performance issues propagation networks); and text (e.g., source code,
system logs, version control system history, code review feedbacks).
Such data are also typically updated with relatively high frequency,
and subject to distribution drifts caused by continuous configuration
changes to the underlying infrastructure. In such a highly dynamic scenario,
traditional model-driven approaches alone may be inadequate
at capturing the complexity of the interactions among system components. DevOps teams would certainly benefit from having robust
data-driven methods to support their decisions based on historical
information. For instance, effective anomaly detection capabilities may
also help in conducting more precise and efficient root-cause analysis.
Also, leveraging on accurate forecasting and intelligent control
strategies would improve resource management.
Given their ability to deal with high-dimensional, complex data,
Deep Learning-based methods are the most straightforward option for
the realization of the aforementioned support tools. On the other hand,
because of their complexity, this kind of models often requires huge
processing power, and suitable hardware, to be operated effectively
at scale. These aspects must be carefully addressed when applying
such methods in the context of data center operations. Automated
operations approaches must be dependable and cost-efficient, not to
degrade the services they are built to improve.
i
Reliability-oriented resource management for High-Performance Computing
Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even
single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies
Technologies and Applications for Big Data Value
This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part “Technologies and Methods” contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part “Processes and Applications” details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems
Cybersecurity of Digital Service Chains
This open access book presents the main scientific results from the H2020 GUARD project. The GUARD project aims at filling the current technological gap between software management paradigms and cybersecurity models, the latter still lacking orchestration and agility to effectively address the dynamicity of the former. This book provides a comprehensive review of the main concepts, architectures, algorithms, and non-technical aspects developed during three years of investigation; the description of the Smart Mobility use case developed at the end of the project gives a practical example of how the GUARD platform and related technologies can be deployed in practical scenarios. We expect the book to be interesting for the broad group of researchers, engineers, and professionals daily experiencing the inadequacy of outdated cybersecurity models for modern computing environments and cyber-physical systems
Energy and Performance Management of Virtual Machines: Provisioning, Placement and Consolidation
Cloud computing is a new computing paradigm that offers scalable storage
and compute resources to users on demand through Internet. Public cloud
providers operate large-scale data centers around the world to handle a
large number of users request. However, data centers consume an immense
amount of electrical energy that can lead to high operating costs and carbon
emissions. One of the most common and effective method in order to reduce
energy consumption is Dynamic Virtual Machines Consolidation (DVMC)
enabled by the virtualization technology. DVMC dynamically consolidates
Virtual Machines (VMs) into the minimum number of active servers and
then switches the idle servers into a power-saving mode to save energy. Ho-
wever, maintaining the desired level of Quality-of-Service (QoS) between
data centers and their users is critical for satisfying users’ expectations con-
cerning performance. Therefore, the main challenge is to minimize the data
center energy consumption while maintaining the required QoS.
This thesis address this challenge by presenting novel DVMC approaches
to reduce the energy consumption of data centers and improve resource utili-
zation under workload independent quality of service constraints. These ap-
proaches can be divided into three main categories: heuristic, meta-heuristic
and machine learning.
Our first contribution is a heuristic algorithm for solving the DVMC
problem. The algorithm uses a linear regression-based prediction model to
detect over-loaded servers based on the historical utilization data. Then it
migrates some VMs from the over-loaded servers to avoid further performan-
ce degradations. Moreover, our algorithm consolidates VMs on fewer number
of server for energy saving. The second and third contributions are two novel
DVMC algorithms based on the Reinforcement Learning (RL) approach. RL
is interesting for highly adaptive and autonomous management in dynamic
environments. For this reason, we use RL to solve two main sub-problems in
VM consolidation. The first sub-problem is the server power mode detection
(sleep or active). The second sub-problem is to find an effective solution
for server status detection (overloaded or non-overloaded). The fourth con-
tribution of this thesis is an online optimization meta-heuristic algorithm
called Ant Colony System-based Placement Optimization (ACS-PO). ACS is a suitable approach for VM consolidation due to the ease of parallelization,
that it is close to the optimal solution, and its polynomial worst-case time
complexity. The simulation results show that ACS-PO provides substantial
improvement over other heuristic algorithms in reducing energy consump-
tion, the number of VM migrations, and performance degradations.
Our fifth contribution is a Hierarchical VM management (HiVM) archi-
tecture based on a three-tier data center topology which is very common use
in data centers. HiVM has the ability to scale across many thousands of ser-
vers with energy efficiency. Our sixth contribution is a Utilization Prediction-
aware Best Fit Decreasing (UP-BFD) algorithm. UP-BFD can avoid SLA
violations and needless migrations by taking into consideration the current
and predicted future resource requirements for allocation, consolidation, and
placement of VMs.
Finally, the seventh and the last contribution is a novel Self-Adaptive
Resource Management System (SARMS) in data centers. To achieve scala-
bility, SARMS uses a hierarchical architecture that is partially inspired from
HiVM. Moreover, SARMS provides self-adaptive ability for resource mana-
gement by dynamically adjusting the utilization thresholds for each server
in data centers.
</div
Proyecto Docente e Investigador, Trabajo Original de Investigación y Presentación de la Defensa, preparado por Germán Moltó para concursar a la plaza de Catedrático de Universidad, concurso 082/22, plaza 6708, área de Ciencia de la Computación e Inteligencia Artificial
Este documento contiene el proyecto docente e investigador del candidato Germán Moltó Martínez presentado como requisito para el concurso de acceso a plazas de Cuerpos Docentes Universitarios. Concretamente, el documento se centra en el concurso para la plaza 6708 de Catedrático de Universidad en el área de Ciencia de la Computación en el Departamento de Sistemas Informáticos y Computación de la Universitat Politécnica de València. La plaza está adscrita a la Escola Técnica Superior d'Enginyeria Informàtica y tiene como perfil las asignaturas "Infraestructuras de Cloud Público" y "Estructuras de Datos y Algoritmos".También se incluye el Historial Académico, Docente e Investigador, así como la presentación usada durante la defensa.Germán Moltó Martínez (2022). Proyecto Docente e Investigador, Trabajo Original de Investigación y Presentación de la Defensa, preparado por Germán Moltó para concursar a la plaza de Catedrático de Universidad, concurso 082/22, plaza 6708, área de Ciencia de la Computación e Inteligencia Artificial. http://hdl.handle.net/10251/18903
A Survey of Intelligent Network Slicing Management for Industrial IoT: Integrated Approaches for Smart Transportation, Smart Energy, and Smart Factory
This is the author accepted manuscript. The final version is available from IEEE via the DOI in this recordNetwork slicing has been widely agreed as a promising technique to accommodate diverse services for the Industrial Internet of Things (IIoT). Smart transportation, smart energy, and smart factory/manufacturing are the three key services to form the backbone of IIoT. Network slicing management is of paramount importance in the face of IIoT services with diversified requirements. It is important to have a comprehensive survey on intelligent network slicing management to provide guidance for future research in this field. In this paper, we provide a thorough investigation and analysis of network slicing management in its general use cases as well as specific IIoT services including smart transportation, smart energy and smart factory, and highlight the advantages and drawbacks across many existing works/surveys and this current survey in terms of a set of important criteria. In addition, we present an architecture for intelligent network slicing management for IIoT focusing on the above three IIoT services. For each service, we provide a detailed analysis of the application requirements and network slicing architecture, as well as the associated enabling technologies. Further, we present a deep understanding of network slicing orchestration and management for each service, in terms of orchestration architecture, AI-assisted management and operation, edge computing empowered network slicing, reliability, and security. For the presented architecture for intelligent network slicing management and its application in each IIoT service, we identify the corresponding key challenges and open issues that can guide future research. To facilitate the understanding of the implementation, we provide a case study of the intelligent network slicing management for integrated smart transportation, smart energy, and smart factory. Some lessons learnt include: 1) For smart transportation, it is necessary to explicitly identify service function chains (SFCs) for specific applications along with the orchestration of underlying VNFs/PNFs for supporting such SFCs; 2) For smart energy, it is crucial to guarantee both ultra-low latency and extremely high reliability; 3) For smart factory, resource management across heterogeneous network domains is of paramount importance. We hope that this survey is useful for both researchers and engineers on the innovation and deployment of intelligent network slicing management for IIoT.Engineering and Physical Sciences Research Council (EPSRC)Singapore University of Technology and Design (SUTD)Hong Kong RGC Research Impact Fund (RIF)National Natural Science Foundation of ChinaShenzhen Science and Technology Innovation Commissio
- …