    Performance characterization of multi-container deployment schemes for online learning inference

    Online machine learning (ML) inference services provide users with an interactive way to request for predictions in realtime. To meet the notable computational requirements of such services, they are increasingly being deployed in the Cloud. In this context, the efficient provisioning and optimization of ML inference services in the Cloud is critical to achieve the required performance and meet the dynamic queries by end-users. Existing provisioning solutions focus on framework parameter tuning and infrastructure resources scaling, without considering deployments based on containerization technologies. The latter promises reproducibility and portability features for ML inferences services. There is limited knowledge about the impact of distinct deployment schemes at the container-level on the performance of online ML inference services, particularly on how to exploit multi-container deployments and its relation with processor and memory affinity. In light of this, in this paper we investigate experimentally the containerization of ML inference services and analyze the performance of multi-container deployments that partition the threads belonging to an online learning application into multiple containers in each node. This paper shares the findings and lessons learned from conducting realistic client patterns on an image classification model across numerous deployment configurations, especially including the impact of container granularity and its potential to exploit processor and memory affinity. Our results indicate that fine-grained multi-container deployments and affinity are useful for improving performance (both throughput and latency). In particular, our experiments on single-node and four-node clusters show up to 69% and 87% performance improvement compared to the single-container deployment, respectively.This work was partially supported by Lenovo as part of Lenovo-BSC collaboration agreement, by the Spanish Government under contract PID2019-107255GB-C22, and by the Generalitat de Catalunya under contract 2021-SGR-00478 and under grant 2020 FI-B 00257.Peer ReviewedPostprint (author's final draft

    Adaptive CPU Allocation for Resource Isolation and Work Conservation

    Consolidating multiple workloads on the same physical machine is an effective measure for utilizing resources efficiently and reducing costs. The main objective is to execute multiple demanding workloads using no more than necessary resources while simultaneously maximizing performance. Conventional work-conserving resource managers are designed for this purpose. However, without adequate control, the performance of consolidated workloads may degrade dramatically or become unpredictable because of contention for shared resources. Hence, resource isolation should be enforced according to a sharing policy when there is resource contention among workloads, i.e., each workload should obtain a theoretical share of resources. In reality, it is challenging for state-of-the-art resource managers to achieve both resource isolation and work conservation simultaneously due to complex and dynamic workloads. This thesis proposes adaptive resource allocation to address this sharing problem and studies CPU management as an example. A novel feedback-based resource manager is designed to perform adaptive allocation of CPU resources, taking into account each workload's requirements. First, an application-agnostic metric is proposed as the feedback signal, which can be used to measure the performance change of various applications in a non-invasive and timely way. Second, two alternative feedback-based algorithms are designed to search for the optimal resource allocation for each workload. The adaptive allocation is modelled as a dynamic optimization problem. The algorithms solve this problem by assessing performance changes in response to a change in resource allocation. The algorithms are demonstrated to be capable of handling complex and dynamic workloads. The resource manager proposed in this thesis uses these algorithms to determine the CPU allocation for multiple tenants. A prototype is implemented with four different sharing policies. For three common policies, the experimental evaluation confirms that the resource manager can achieve resource isolation and work conservation simultaneously, while the existing best-practice mechanisms cannot. Moreover, the resource manager can support a novel efficiency policy, which determines CPU sharing based on the overall system efficiency. In addition, a preliminary study shows that the feedback-based methodology for CPU management can be extended to control I/O bandwidth

    Data-Driven Intelligent Scheduling For Long Running Workloads In Large-Scale Datacenters

    Cloud computing is becoming a fundamental facility of society today. Large-scale public or private cloud datacenters spreading millions of servers, as a warehouse-scale computer, are supporting most business of Fortune-500 companies and serving billions of users around the world. Unfortunately, modern industry-wide average datacenter utilization is as low as 6% to 12%. Low utilization not only negatively impacts operational and capital components of cost efficiency, but also becomes the scaling bottleneck due to the limits of electricity delivered by nearby utility. It is critical and challenge to improve multi-resource efficiency for global datacenters. Additionally, with the great commercial success of diverse big data analytics services, enterprise datacenters are evolving to host heterogeneous computation workloads including online web services, batch processing, machine learning, streaming computing, interactive query and graph computation on shared clusters. Most of them are long-running workloads that leverage long-lived containers to execute tasks. We concluded datacenter resource scheduling works over last 15 years. Most previous works are designed to maximize the cluster efficiency for short-lived tasks in batch processing system like Hadoop. They are not suitable for modern long-running workloads of Microservices, Spark, Flink, Pregel, Storm or Tensorflow like systems. It is urgent to develop new effective scheduling and resource allocation approaches to improve efficiency in large-scale enterprise datacenters. In the dissertation, we are the first of works to define and identify the problems, challenges and scenarios of scheduling and resource management for diverse long-running workloads in modern datacenter. They rely on predictive scheduling techniques to perform reservation, auto-scaling, migration or rescheduling. It forces us to pursue and explore more intelligent scheduling techniques by adequate predictive knowledges. We innovatively specify what is intelligent scheduling, what abilities are necessary towards intelligent scheduling, how to leverage intelligent scheduling to transfer NP-hard online scheduling problems to resolvable offline scheduling issues. We designed and implemented an intelligent cloud datacenter scheduler, which automatically performs resource-to-performance modeling, predictive optimal reservation estimation, QoS (interference)-aware predictive scheduling to maximize resource efficiency of multi-dimensions (CPU, Memory, Network, Disk I/O), and strictly guarantee service level agreements (SLA) for long-running workloads. Finally, we introduced a large-scale co-location techniques of executing long-running and other workloads on the shared global datacenter infrastructure of Alibaba Group. It effectively improves cluster utilization from 10% to averagely 50%. It is far more complicated beyond scheduling that involves technique evolutions of IDC, network, physical datacenter topology, storage, server hardwares, operating systems and containerization. We demonstrate its effectiveness by analysis of newest Alibaba public cluster trace in 2017. We are the first of works to reveal the global view of scenarios, challenges and status in Alibaba large-scale global datacenters by data demonstration, including big promotion events like Double 11 . Data-driven intelligent scheduling methodologies and effective infrastructure co-location techniques are critical and necessary to pursue maximized multi-resource efficiency in modern large-scale datacenter, especially for long-running workloads

    Edge Video Analytics: A Survey on Applications, Systems and Enabling Techniques

    Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. The basic concepts of EVA (e.g., definition, architectures) were not fully elucidated due to the rapid development of this domain. To fill these gaps, we provide a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.Comment: 31 pages, 13 figure

    Resource optimization of edge servers dealing with priority-based workloads by utilizing service level objective-aware virtual rebalancing

    IoT enables profitable communication between sensor/actuator devices and the cloud. Slow network causing Edge data to lack Cloud analytics hinders real-time analytics adoption. VRebalance solves priority-based workload performance for stream processing at the Edge. BO is used in VRebalance to prioritize workloads and find optimal resource configurations for efficient resource management. Apache Storm platform was used with RIoTBench IoT benchmark tool for real-time stream processing. Tools were used to evaluate VRebalance. Study shows VRebalance is more effective than traditional methods, meeting SLO targets despite system changes. VRebalance decreased SLO violation rates by almost 30% for static priority-based workloads and 52.2% for dynamic priority-based workloads compared to hill climbing algorithm. Using VRebalance decreased SLO violations by 66.1% compared to Apache Storm\u27s default allocation

    QoS-Based Optimization of Runtime Management of Sensing Cloud Applications

    Die vorliegende Arbeit präsentiert Ansätze und Techniken zur qualitätsbewussten Verbesserung des Laufzeitmanagements von IoT-Anwendungen. IoT-Anwendungen nehmen über die Sensorik von Smart Devices ihre Umgebung wahr, um diese zu analysieren oder mit ihr zu interagieren. Smart Devices sind in der Rechen- und Speicherleistung begrenzt, weshalb viele IoT-Anwendungen über eine IoT Plattform mit elastischen und skalierbaren Cloud Services verbunden sind. Die Last auf dem Cloud Service entsteht durch die verbundenen Smart Devices, die kontinuierlich Nachrichten transferieren. Die Ressourcenkonfiguration des Cloud Services beeinflusst dessen Kapazität. Ein Service Operator, der eine IoT-Anwendung betreibt, ist mit der Herausforderung konfrontiert, die Smart Devices und den Cloud Service so zu konfigurieren, dass eine hohe Datenqualität bei niedrigen Betriebskosten erreicht wird. Um hierbei den Service Operator zur Design Time zu unterstützen, modellieren wir Kostenfunktionen für Datenqualitäten, die durch das Wechselspiel der Smart Device- und Cloud Service-Konfiguration beeinflusst werden. Mit Hilfe dieser Kostenfunktionen kann ein Service Operator nach einer kostenminimalen Konfiguration für bestimmte Szenarien suchen. Existierende Ansätze zur Optimierung von Anwendungen zur Design Time fokussieren sich auf traditionelle Software-Architekturen und bieten daher nicht die notwendigen Konzepte zur Kostenmodellierung von IoT-Anwendungen an. Des Weiteren unterstützen wir den Service Operator durch Lastkontrollverfahren, die auf Kapazitätsengpässe des Cloud Services durch eine kontrollierte Reduktion der Nachrichtenrate reagieren. Während sich das auf die Genauigkeit der Messungen nachteilig auswirken kann, stabilisieren sich zeitliche Verzögerungen und die IoT-Anwendung bleibt auch in starken Überlastszenarien verfügbar. Existierende Laufzeittechniken fokussieren sich auf die automatische Ressourcenprovisionierung von Cloud Services durch Auto-Scaler. Diese ermöglichen zwar, auf Kapazitätsengpässe und Lastschwankungen zu reagieren, doch die erreichte Quality-of-Service (QoS) kann dadurch mit hohen Betriebskosten verbunden sein. Daher ermöglichen wir durch die Lastkontrollverfahren eine weitere Technik, mit der einerseits dynamisch auf Kapazitätsengpässe reagiert werden und andererseits die zur Verfügung stehende Kapazität eines Cloud Services effizient genutzt werden kann. Außerdem präsentieren wir Kopplungstechniken, die Auto-Scaling und Lastkontrollverfahren kombinieren. Bestehende Ansätze zur Rekonfiguration von Smart Devices konzentrieren sich auf Qualitäten wie Genauigkeit oder Energie-Effizienz und sind daher ungeeignet, um auf Kapazitätsengpässe zu reagieren. Zusammenfassend liefert die Dissertation die folgenden Beiträge: 1. Untersuchung von Performance Metriken für Skalierentscheidungen: Wir haben Infrastuktur- und Anwendungsebenen-Metriken daraufhin evaluiert, wie geeignet sie für Skalierentscheidungen von Microservices sind, die variierende Charakteristiken aufweisen. Auf Basis der Ergebnisse kann ein Service Operator eine fundierte Entscheidung darüber treffen, welche Performance Metrik zur Skalierung eines bestimmten Microservices am geeignesten ist. 2. Design von QoS Kostenfunktionen für IoT-Anwendungen: Wir haben ein QoS Kostenmodell aufgestellt, dass das Wirken von Smart Device- und Cloud Service-Konfiguration auf die Qualitäten einer IoT-Anwendung erfasst. Auf Grundlage dieser Kostenmodelle kann die Konfiguration von IoT-Anwendungen zur Design Time optimiert werden. Des Weiteren können mit den Kostenfunktionen Laufzeitverfahren hinsichtlich ihrem Beitrag zur QoS für verschiedene Szenarien evaluiert werden. 3. Entwicklung von Lastkontrollverfahren für IoT-Anwendungen: Die präsentierten Verfahren bieten einen komplementären Mechanismus zu Auto-Scaling an, um bei Kapazitätsengpässen die QoS aufrechtzuerhalten. Hierbei wird die Gesamtlast auf dem Cloud Service durch Anpassungen der Nachrichtenrate der Smart Devices reduziert. Ein Service Operator hat hiermit die Möglichkeit, Kapazitätsengpässen über eine Degradierung der Datenqualität zu begegnen. 4. Kopplung von Lastkontrollverfahren mit Ressourcen-Provisionierung: Wir präsentieren regelbasierte Kopplungsmechanismen, die reaktiv Lastkontrollverfahren oder Auto-Scaler aktivieren und diese damit koppeln. Das ermöglicht, auf Kapazitätsengpässe über eine Kombination von Datenqualitätsreduzierungen und Ressourcekostenerhöhungen zu reagieren. 5. Design eines Frameworks zur Entwicklung selbst-adaptiver Systeme: Das selbst-adaptive Framework bietet ein Anwendungsmodell für IoT-Anwendungen und Konzepte für die Rekonfiguration von Microservices und Smart Devices an. Es kann in verschiedenen Cloud-Umgebungen aufgesetzt werden und beschleunigt die prototypische Entwicklung von Laufzeitverfahren. Wir validierten die Ansätze anhand zweier Case Study Systeme unterschiedlicher Komplexität. Das erste Case Study System besteht aus einem Cloud Service, welcher über eine IoT Plattform Nachrichten von virtuellen Smart Devices verarbeitet. Mit diesem System haben wir für unterschiedliche Anwendungsszenarien die Charakteristiken der vorgestellten Lastkontrollverfahren analysiert, um diese gegen Auto-Scaling und einer Kopplung der Ansätze zu vergleichen. Hierbei stellte sich heraus, dass die Lastkontrollverfahren ähnlich effizient wie Auto-Scaler Überlastszenarien addressieren können und sich die QoS in einem vergleichbaren Bereich bewegt. Im Schnitt erreichten die Lastkontrollverfahren in den untersuchten Szenarien etwa 50 % geringere QoS Gesamtkosten. Es zeigte sich auch, dass sowohl Auto-Scaling als auch die Lastkontrollverfahren in bestimmten Anwendungsszenarien deutliche Nachteile haben, so z. B. wenn die Datengenauigkeit oder Ressourcenkosten im Vordergrund stehen. Es hat sich gezeigt, dass eine Kopplung hierbei immer vorteilhaft ist, um die QoS beizubehalten. Im zweiten Case Study System haben wir eine intelligente Heizungslösung der Robert Bosch GmbH implementiert, um die Ansätze an einem komplexeren System zu validieren. Auch hier zeigte sich, dass eine Kombination von Lastkontrolle und Auto-Scaling am vorteilhaftesten ist und zu einer hohen Datenqualität bei geringen Ressourcenkosten beiträgt. Die Ergebnisse zeigen, dass die vorgestellten Lastkontrollverfahren geeignet sind, die QoS von IoT Anwendungen zu verbessern. Es bietet einem Service Operator damit ein weiteres Werkzeug für das Laufzeitmanagement von IoT Anwendungen, dass einen zum Auto-Scaling komplementären Mechanismus verwendet. Das hier vorgestellte Framework zur Entwicklung selbst-adaptiver IoT Systeme haben wir zur empirischen Beantwortung der Forschungsfragen instanziiert und damit dessen Eignung demonstriert. Wir zeigen außerdem eine exemplarische Verwendung der vorgestellten Kostenfunktionen für verschiedene Anwendungsszenarien und binden diese im Zuge der Validierung in einem Optimierungs-Framework ein

    Machine Learning Algorithms for Provisioning Cloud/Edge Applications

    Mención Internacional en el título de doctorReinforcement Learning (RL), in which an agent is trained to make the most favourable decisions in the long run, is an established technique in artificial intelligence. Its popularity has increased in the recent past, largely due to the development of deep neural networks spawning deep reinforcement learning algorithms such as Deep Q-Learning. The latter have been used to solve previously insurmountable problems, such as playing the famed game of “Go” that previous algorithms could not. Many such problems suffer the curse of dimensionality, in which the sheer number of possible states is so overwhelming that it is impractical to explore every possible option. While these recent techniques have been successful, they may not be strictly necessary or practical for some applications such as cloud provisioning. In these situations, the action space is not as vast and workload data required to train such systems is not as widely shared, as it is considered commercialy sensitive by the Application Service Provider (ASP). Given that provisioning decisions evolve over time in sympathy to incident workloads, they fit into the sequential decision process problem that legacy RL was designed to solve. However because of the high correlation of time series data, states are not independent of each other and the legacy Markov Decision Processes (MDPs) have to be cleverly adapted to create robust provisioning algorithms. As the first contribution of this thesis, we exploit the knowledge of both the application and configuration to create an adaptive provisioning system leveraging stationary Markov distributions. We then develop algorithms that, with neither application nor configuration knowledge, solve the underlying Markov Decision Process (MDP) to create provisioning systems. Our Q-Learning algorithms factor in the correlation between states and the consequent transitions between them to create provisioning systems that do not only adapt to workloads, but can also exploit similarities between them, thereby reducing the retraining overhead. Our algorithms also exhibit convergence in fewer learning steps given that we restructure the state and action spaces to avoid the curse of dimensionality without the need for the function approximation approach taken by deep Q-Learning systems. A crucial use-case of future networks will be the support of low-latency applications involving highly mobile users. With these in mind, the European Telecommunications Standards Institute (ETSI) has proposed the Multi-access Edge Computing (MEC) architecture, in which computing capabilities can be located close to the network edge, where the data is generated. Provisioning for such applications therefore entails migrating them to the most suitable location on the network edge as the users move. In this thesis, we also tackle this type of provisioning by considering vehicle platooning or Cooperative Adaptive Cruise Control (CACC) on the edge. We show that our Q-Learning algorithm can be adapted to minimize the number of migrations required to effectively run such an application on MEC hosts, which may also be subject to traffic from other competing applications.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Antonio Fernández Anta.- Secretario: Diego Perino.- Vocal: Ilenia Tinnirell