307 research outputs found

    TANDEM: taming failures in next-generation datacenters with emerging memory

    Get PDF
    The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions. Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice. Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for implementing recovery, complicating correctness and programmability. Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges. When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability). This thesis aims to address these challenges through the following approaches: (a) defining precise consistency models that formally specify correct end-to-end semantics in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery. We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop- ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism, lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance. We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes). Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services. Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating any intervention from the programmers

    Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective

    Full text link
    Machine Learning models are being deployed as parts of real-world systems with the upsurge of interest in artificial intelligence. The design, implementation, and maintenance of such systems are challenged by real-world environments that produce larger amounts of heterogeneous data and users requiring increasingly faster responses with efficient resource consumption. These requirements push prevalent software architectures to the limit when deploying ML-based systems. Data-oriented Architecture (DOA) is an emerging concept that equips systems better for integrating ML models. DOA extends current architectures to create data-driven, loosely coupled, decentralised, open systems. Even though papers on deployed ML-based systems do not mention DOA, their authors made design decisions that implicitly follow DOA. The reasons why, how, and the extent to which DOA is adopted in these systems are unclear. Implicit design decisions limit the practitioners' knowledge of DOA to design ML-based systems in the real world. This paper answers these questions by surveying real-world deployments of ML-based systems. The survey shows the design decisions of the systems and the requirements these satisfy. Based on the survey findings, we also formulate practical advice to facilitate the deployment of ML-based systems. Finally, we outline open challenges to deploying DOA-based systems that integrate ML models.Comment: Under revie

    Designing Scalable Mechanisms for Geo-Distributed Platform Services in the Presence of Client Mobility

    Get PDF
    Situation-awareness applications require low-latency response and high network bandwidth, hence benefiting from geo-distributed Edge infrastructures. The developers of these applications typically rely on several platform services, such as Kubernetes, Apache Cassandra and Pulsar, for managing their compute and data components across the geo-distributed Edge infrastructure. Situation-awareness applications impose peculiar requirements on the compute and data placement policies of the platform services. Firstly, the processing logic of these applications is closely tied to the physical environment that it is interacting with. Hence, the access pattern to compute and data exhibits strong spatial affinity. Secondly, the network topology of Edge infrastructure is heterogeneous, wherein communication latency forms a significant portion of the end-to-end compute and data access latency. Therefore, the placement of compute and data components has to be cognizant of the spatial affinity and latency requirements of the applications. However, clients of situation-awareness applications, such as vehicles and drones, are typically mobile – making the compute and data access pattern dynamic and complicating the management of data and compute components. Constant changes in the network connectivity and spatial locality of clients due to client mobility results in making the current placement of compute and data components unsuitable for meeting the latency and spatial affinity requirements of the application. Constant client mobility necessitates that client location and latency offered by the platform services be continuously monitored to detect when application requirements are violated and to adapt the compute and data placement. The control and monitoring modules of off-the-shelf platform services do not have the necessary primitives to incorporate spatial affinity and network topology awareness into their compute and data placement policies. The spatial location of clients is not considered as an input for decision- making in their control modules. Furthermore, they do not perform fine-grained end-to-end monitoring of observed latency to detect and adapt to performance degradations due to client mobility. This dissertation presents three mechanisms that inform the compute and data placement policies of platform services, so that application requirements can be met. M1: Dynamic Spatial Context Management for system entities – clients and data and compute components – to ensure spatial affinity requirements are satisfied. M2: Network Proximity Estimation to provide topology-awareness to the data and compute placement policies of platform services. M3: End-to-End Latency Monitoring to enable collection, aggregation and analysis of per-application metrics in a geo-distributed manner to provide end-to-end insights into application performance. The thesis of our work is that the aforementioned mechanisms are fundamental building blocks for the compute and data management policies of platform services, and that by incorporating them, platform services can meet application requirements at the Edge. Furthermore, the proposed mechanisms can be implemented in a way that offers high scalability to handle high levels of client activity. We demonstrate by construction the efficacy and scalability of the proposed mechanisms for building dynamic compute and data orchestration policies by incorporating them in the control and monitoring modules of three different platform services. Specifically, we incorporate these mechanisms into a topic-based publish-subscribe system (ePulsar), an application orchestration platform (OneEdge), and a key-value store (FogStore). We conduct extensive performance evaluation of these enhanced platform services to showcase how the new mechanisms aid in dynamically adapting the compute/data orchestration decisions to satisfy performance requirements of applicationsPh.D

    Models, methods, and tools for developing MMOG backends on commodity clouds

    Get PDF
    Online multiplayer games have grown to unprecedented scales, attracting millions of players worldwide. The revenue from this industry has already eclipsed well-established entertainment industries like music and films and is expected to continue its rapid growth in the future. Massively Multiplayer Online Games (MMOGs) have also been extensively used in research studies and education, further motivating the need to improve their development process. The development of resource-intensive, distributed, real-time applications like MMOG backends involves a variety of challenges. Past research has primarily focused on the development and deployment of MMOG backends on dedicated infrastructures such as on-premise data centers and private clouds, which provide more flexibility but are expensive and hard to set up and maintain. A limited set of works has also focused on utilizing the Infrastructure-as-a-Service (IaaS) layer of public clouds to deploy MMOG backends. These clouds can offer various advantages like a lower barrier to entry, a larger set of resources, etc. but lack resource elasticity, standardization, and focus on development effort, from which MMOG backends can greatly benefit. Meanwhile, other research has also focused on solving various problems related to consistency, performance, and scalability. Despite major advancements in these areas, there is no standardized development methodology to facilitate these features and assimilate the development of MMOG backends on commodity clouds. This thesis is motivated by the results of a systematic mapping study that identifies a gap in research, evident from the fact that only a handful of studies have explored the possibility of utilizing serverless environments within commodity clouds to host these types of backends. These studies are mostly vision papers and do not provide any novel contributions in terms of methods of development or detailed analyses of how such systems could be developed. Using the knowledge gathered from this mapping study, several hypotheses are proposed and a set of technical challenges is identified, guiding the development of a new methodology. The peculiarities of MMOG backends have so far constrained their development and deployment on commodity clouds despite rapid advancements in technology. To explore whether such environments are viable options, a feasibility study is conducted with a minimalistic MMOG prototype to evaluate a limited set of public clouds in terms of hosting MMOG backends. Foli lowing encouraging results from this study, this thesis first motivates toward and then presents a set of models, methods, and tools with which scalable MMOG backends can be developed for and deployed on commodity clouds. These are encapsulated into a software development framework called Athlos which allows software engineers to leverage the proposed development methodology to rapidly create MMOG backend prototypes that utilize the resources of these clouds to attain scalable states and runtimes. The proposed approach is based on a dynamic model which aims to abstract the data requirements and relationships of many types of MMOGs. Based on this model, several methods are outlined that aim to solve various problems and challenges related to the development of MMOG backends, mainly in terms of performance and scalability. Using a modular software architecture, and standardization in common development areas, the proposed framework aims to improve and expedite the development process leading to higher-quality MMOG backends and a lower time to market. The models and methods proposed in this approach can be utilized through various tools during the development lifecycle. The proposed development framework is evaluated qualitatively and quantitatively. The thesis presents three case study MMOG backend prototypes that validate the suitability of the proposed approach. These case studies also provide a proof of concept and are subsequently used to further evaluate the framework. The propositions in this thesis are assessed with respect to the performance, scalability, development effort, and code maintainability of MMOG backends developed using the Athlos framework, using a variety of methods such as small and large-scale simulations and more targeted experimental setups. The results of these experiments uncover useful information about the behavior of MMOG backends. In addition, they provide evidence that MMOG backends developed using the proposed methodology and hosted on serverless environments can: (a) support a very high number of simultaneous players under a given latency threshold, (b) elastically scale both in terms of processing power and memory capacity and (c) significantly reduce the amount of development effort. The results also show that this methodology can accelerate the development of high-performance, distributed, real-time applications like MMOG backends, while also exposing the limitations of Athlos in terms of code maintainability. Finally, the thesis provides a reflection on the research objectives, considerations on the hypotheses and technical challenges, and outlines plans for future work in this domain

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Network and Content Intelligence for 360 Degree Video Streaming Optimization

    Get PDF
    In recent years, 360° videos, a.k.a. spherical frames, became popular among users creating an immersive streaming experience. Along with the advances in smart- phones and Head Mounted Devices (HMD) technology, many content providers have facilitated to host and stream 360° videos in both on-demand and live stream- ing modes. Therefore, many different applications have already arisen leveraging these immersive videos, especially to give viewers an impression of presence in a digital environment. For example, with 360° videos, now it is possible to connect people in a remote meeting in an interactive way which essentially increases the productivity of the meeting. Also, creating interactive learning materials using 360° videos for students will help deliver the learning outcomes effectively. However, streaming 360° videos is not an easy task due to several reasons. First, 360° video frames are 4–6 times larger than normal video frames to achieve the same quality as a normal video. Therefore, delivering these videos demands higher bandwidth in the network. Second, processing relatively larger frames requires more computational resources at the end devices, particularly for end user devices with limited resources. This will impact not only the delivery of 360° videos but also many other applications running on shared resources. Third, these videos need to be streamed with very low latency requirements due their interactive nature. Inability to satisfy these requirements can result in poor Quality of Experience (QoE) for the user. For example, insufficient bandwidth incurs frequent rebuffer- ing and poor video quality. Also, inadequate computational capacity can cause faster battery draining and unnecessary heating of the device, causing discomfort to the user. Motion or cyber–sickness to the user will be prevalent if there is an unnecessary delay in streaming. These circumstances will hinder providing im- mersive streaming experiences to the much-needed communities, especially those who do not have enough network resources. To address the above challenges, we believe that enhancements to the three main components in video streaming pipeline, server, network and client, are essential. Starting from network, it is beneficial for network providers to identify 360° video flows as early as possible and understand their behaviour in the network to effec- tively allocate sufficient resources for this video delivery without compromising the quality of other services. Content servers, at one end of this streaming pipeline, re- quire efficient 360° video frame processing mechanisms to support adaptive video streaming mechanisms such as ABR (Adaptive Bit Rate) based streaming, VP aware streaming, a streaming paradigm unique to 360° videos that select only part of the larger video frame that fall within the user-visible region, etc. On the other end, the client can be combined with edge-assisted streaming to deliver 360° video content with reduced latency and higher quality. Following the above optimization strategies, in this thesis, first, we propose a mech- anism named 360NorVic to extract 360° video flows from encrypted video traffic and analyze their traffic characteristics. We propose Machine Learning (ML) mod- els to classify 360° and normal videos under different scenarios such as offline, near real-time, VP-aware streaming and Mobile Network Operator (MNO) level stream- ing. Having extracted 360° video traffic traces both in packet and flow level data at higher accuracy, we analyze and understand the differences between 360° and normal video patterns in the encrypted traffic domain that is beneficial for effec- tive resource optimization for enhancing 360° video delivery. Second, we present a WGAN (Wesserstien Generative Adversarial Network) based data generation mechanism (namely VideoTrain++) to synthesize encrypted network video traffic, taking minimal data. Leveraging synthetic data, we show improved performance in 360° video traffic analysis, especially in ML-based classification in 360NorVic. Thirdly, we propose an effective 360° video frame partitioning mechanism (namely VASTile) at the server side to support VP-aware 360° video streaming with dy- namic tiles (or variable tiles) of different sizes and locations on the frame. VASTile takes a visual attention map on the video frames as the input and applies a com- putational geometric approach to generate a non-overlapping tile configuration to cover the video frames adaptive to the visual attention. We present VASTile as a scalable approach for video frame processing at the servers and a method to re- duce bandwidth consumption in network data transmission. Finally, by applying VASTile to the individual user VP at the client side and utilizing cache storage of Multi Access Edge Computing (MEC) servers, we propose OpCASH, a mech- anism to personalize the 360° video streaming with dynamic tiles with the edge assistance. While proposing an ILP based solution to effectively select cached variable tiles from MEC servers that might not be identical to the requested VP tiles by user, but still effectively cover the same VP region, OpCASH maximize the cache utilization and reduce the number of requests to the content servers in congested core network. With this approach, we demonstrate the gain in latency and bandwidth saving and video quality improvement in personalized 360° video streaming

    Rise of the Planet of Serverless Computing: A Systematic Review

    Get PDF
    Serverless computing is an emerging cloud computing paradigm, being adopted to develop a wide range of software applications. It allows developers to focus on the application logic in the granularity of function, thereby freeing developers from tedious and error-prone infrastructure management. Meanwhile, its unique characteristic poses new challenges to the development and deployment of serverless-based applications. To tackle these challenges, enormous research efforts have been devoted. This paper provides a comprehensive literature review to characterize the current research state of serverless computing. Specifically, this paper covers 164 papers on 17 research directions of serverless computing, including performance optimization, programming framework, application migration, multi-cloud development, testing and debugging, etc. It also derives research trends, focus, and commonly-used platforms for serverless computing, as well as promising research opportunities

    Accelerating orchestration with in-network offloading

    Get PDF
    The demand for low-latency Internet applications has pushed functionality that was originally placed in commodity hardware into the network. Either in the form of binaries for the programmable data plane or virtualised network functions, services are implemented within the network fabric with the aim of improving their performance and placing them close to the end user. Training of machine learning algorithms, aggregation of networking traffic, virtualised radio access components, are just some of the functions that have been deployed within the network. Therefore, as the network fabric becomes the accelerator for various applications, it is imperative that the orchestration of their components is also adapted to the constraints and capabilities of the deployment environment. This work identifies performance limitations of in-network compute use cases for both cloud and edge environments and makes suitable adaptations. Within cloud infrastructure, this thesis proposes a platform that relies on programmable switches to accelerate the performance of data replication. It then proceeds to discuss design adaptations of an orchestrator that will allow in-network data offloading and enable accelerated service deployment. At the edge, the topic of inefficient orchestration of virtualised network functions is explored, mainly with respect to energy usage and resource contention. An orchestrator is adapted to schedule requests by taking into account edge constraints in order to minimise resource contention and accelerate service processing times. With data transfers consuming valuable resources at the edge, an efficient data representation mechanism is implemented to provide statistical insight on the provenance of data at the edge and enable smart query allocation to nodes with relevant data. Taking into account the previous state of the art, the proposed data plane replication method appears to be the most computationally efficient and scalable in-network data replication platform available, with significant improvements in throughput and up to an order of magnitude decrease in latency. The orchestrator of virtual network functions at the edge was shown to reduce event rejections, total processing time, and energy consumption imbalances over the default orchestrator, thus proving more efficient use of the infrastructure. Lastly, computational cost at the edge was further reduced with the use of the proposed query allocation mechanism which minimised redundant engagement of nodes

    Collected Papers (on Neutrosophics, Plithogenics, Hypersoft Set, Hypergraphs, and other topics), Volume X

    Get PDF
    This tenth volume of Collected Papers includes 86 papers in English and Spanish languages comprising 972 pages, written between 2014-2022 by the author alone or in collaboration with the following 105 co-authors (alphabetically ordered) from 26 countries: Abu Sufian, Ali Hassan, Ali Safaa Sadiq, Anirudha Ghosh, Assia Bakali, Atiqe Ur Rahman, Laura Bogdan, Willem K.M. Brauers, Erick González Caballero, Fausto Cavallaro, Gavrilă Calefariu, T. Chalapathi, Victor Christianto, Mihaela Colhon, Sergiu Boris Cononovici, Mamoni Dhar, Irfan Deli, Rebeca Escobar-Jara, Alexandru Gal, N. Gandotra, Sudipta Gayen, Vassilis C. Gerogiannis, Noel Batista Hernández, Hongnian Yu, Hongbo Wang, Mihaiela Iliescu, F. Nirmala Irudayam, Sripati Jha, Darjan Karabašević, T. Katican, Bakhtawar Ali Khan, Hina Khan, Volodymyr Krasnoholovets, R. Kiran Kumar, Manoranjan Kumar Singh, Ranjan Kumar, M. Lathamaheswari, Yasar Mahmood, Nivetha Martin, Adrian Mărgean, Octavian Melinte, Mingcong Deng, Marcel Migdalovici, Monika Moga, Sana Moin, Mohamed Abdel-Basset, Mohamed Elhoseny, Rehab Mohamed, Mohamed Talea, Kalyan Mondal, Muhammad Aslam, Muhammad Aslam Malik, Muhammad Ihsan, Muhammad Naveed Jafar, Muhammad Rayees Ahmad, Muhammad Saeed, Muhammad Saqlain, Muhammad Shabir, Mujahid Abbas, Mumtaz Ali, Radu I. Munteanu, Ghulam Murtaza, Munazza Naz, Tahsin Oner, ‪Gabrijela Popović‬‬‬‬‬, Surapati Pramanik, R. Priya, S.P. Priyadharshini, Midha Qayyum, Quang-Thinh Bui, Shazia Rana, Akbara Rezaei, Jesús Estupiñán Ricardo, Rıdvan Sahin, Saeeda Mirvakili, Said Broumi, A. A. Salama, Flavius Aurelian Sârbu, Ganeshsree Selvachandran, Javid Shabbir, Shio Gai Quek, Son Hoang Le, Florentin Smarandache, Dragiša Stanujkić, S. Sudha, Taha Yasin Ozturk, Zaigham Tahir, The Houw Iong, Ayse Topal, Alptekin Ulutaș, Maikel Yelandi Leyva Vázquez, Rizha Vitania, Luige Vlădăreanu, Victor Vlădăreanu, Ștefan Vlăduțescu, J. Vimala, Dan Valeriu Voinea, Adem Yolcu, Yongfei Feng, Abd El-Nasser H. Zaied, Edmundas Kazimieras Zavadskas.‬
    corecore