307 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processor’s load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model – Release Persistency (RP) – that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework – DART – to validate the end-to-end correctness of transactional protocols with recovery. Using DART’s target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective
Machine Learning models are being deployed as parts of real-world systems
with the upsurge of interest in artificial intelligence. The design,
implementation, and maintenance of such systems are challenged by real-world
environments that produce larger amounts of heterogeneous data and users
requiring increasingly faster responses with efficient resource consumption.
These requirements push prevalent software architectures to the limit when
deploying ML-based systems. Data-oriented Architecture (DOA) is an emerging
concept that equips systems better for integrating ML models. DOA extends
current architectures to create data-driven, loosely coupled, decentralised,
open systems. Even though papers on deployed ML-based systems do not mention
DOA, their authors made design decisions that implicitly follow DOA. The
reasons why, how, and the extent to which DOA is adopted in these systems are
unclear. Implicit design decisions limit the practitioners' knowledge of DOA to
design ML-based systems in the real world. This paper answers these questions
by surveying real-world deployments of ML-based systems. The survey shows the
design decisions of the systems and the requirements these satisfy. Based on
the survey findings, we also formulate practical advice to facilitate the
deployment of ML-based systems. Finally, we outline open challenges to
deploying DOA-based systems that integrate ML models.Comment: Under revie
Designing Scalable Mechanisms for Geo-Distributed Platform Services in the Presence of Client Mobility
Situation-awareness applications require low-latency response and high network bandwidth, hence benefiting from geo-distributed Edge infrastructures. The developers of these applications typically rely on several platform services, such as Kubernetes, Apache Cassandra and Pulsar, for managing their compute and data components across the geo-distributed Edge infrastructure. Situation-awareness applications impose peculiar requirements on the compute and data placement policies of the platform services. Firstly, the processing logic of these applications is closely tied to the physical environment that it is interacting with. Hence, the access pattern to compute and data exhibits strong spatial affinity. Secondly, the network topology of Edge infrastructure is heterogeneous, wherein communication latency forms a significant portion of the end-to-end compute and data access latency. Therefore, the placement of compute and data components has to be cognizant of the spatial affinity and latency requirements of the applications. However, clients of situation-awareness applications, such as vehicles and drones, are typically mobile – making the compute and data access pattern dynamic and complicating the management of data and compute components. Constant changes in the network connectivity and spatial locality of clients due to client mobility results in making the current placement of compute and data components unsuitable for meeting the latency and spatial affinity requirements of the application. Constant client mobility necessitates that client location and latency offered by the platform services be continuously monitored to detect when application requirements are violated and to adapt the compute and data placement. The control and monitoring modules of off-the-shelf platform services do not have the necessary primitives to incorporate spatial affinity and network topology awareness into their compute and data placement policies. The spatial location of clients is not considered as an input for decision- making in their control modules. Furthermore, they do not perform fine-grained end-to-end monitoring of observed latency to detect and adapt to performance degradations due to client mobility.
This dissertation presents three mechanisms that inform the compute and data placement policies of platform services, so that application requirements can be met.
M1: Dynamic Spatial Context Management for system entities – clients and data and compute components – to ensure spatial affinity requirements are satisfied.
M2: Network Proximity Estimation to provide topology-awareness to the data and compute placement policies of platform services.
M3: End-to-End Latency Monitoring to enable collection, aggregation and analysis of per-application metrics in a geo-distributed manner to provide end-to-end insights into application performance.
The thesis of our work is that the aforementioned mechanisms are fundamental building blocks for the compute and data management policies of platform services, and that by incorporating them, platform services can meet application requirements at the Edge. Furthermore, the proposed mechanisms can be implemented in a way that offers high scalability to handle high levels of client activity. We demonstrate by construction the efficacy and scalability of the proposed mechanisms for building dynamic compute and data orchestration policies by incorporating them in the control and monitoring modules of three different platform services. Specifically, we incorporate these mechanisms into a topic-based publish-subscribe system (ePulsar), an application orchestration platform (OneEdge), and a key-value store (FogStore). We conduct extensive performance evaluation of these enhanced platform services to showcase how the new mechanisms aid in dynamically adapting the compute/data orchestration decisions to satisfy performance requirements of applicationsPh.D
Models, methods, and tools for developing MMOG backends on commodity clouds
Online multiplayer games have grown to unprecedented scales, attracting millions of players
worldwide. The revenue from this industry has already eclipsed well-established entertainment
industries like music and films and is expected to continue its rapid growth in the future.
Massively Multiplayer Online Games (MMOGs) have also been extensively used in research
studies and education, further motivating the need to improve their development process.
The development of resource-intensive, distributed, real-time applications like MMOG backends
involves a variety of challenges. Past research has primarily focused on the development and
deployment of MMOG backends on dedicated infrastructures such as on-premise data centers
and private clouds, which provide more flexibility but are expensive and hard to set up and
maintain. A limited set of works has also focused on utilizing the Infrastructure-as-a-Service
(IaaS) layer of public clouds to deploy MMOG backends. These clouds can offer various advantages
like a lower barrier to entry, a larger set of resources, etc. but lack resource elasticity,
standardization, and focus on development effort, from which MMOG backends can greatly
benefit.
Meanwhile, other research has also focused on solving various problems related to consistency,
performance, and scalability. Despite major advancements in these areas, there is no standardized
development methodology to facilitate these features and assimilate the development of
MMOG backends on commodity clouds. This thesis is motivated by the results of a systematic
mapping study that identifies a gap in research, evident from the fact that only a handful
of studies have explored the possibility of utilizing serverless environments within commodity
clouds to host these types of backends. These studies are mostly vision papers and do
not provide any novel contributions in terms of methods of development or detailed analyses
of how such systems could be developed. Using the knowledge gathered from this mapping
study, several hypotheses are proposed and a set of technical challenges is identified, guiding
the development of a new methodology.
The peculiarities of MMOG backends have so far constrained their development and deployment
on commodity clouds despite rapid advancements in technology. To explore whether such
environments are viable options, a feasibility study is conducted with a minimalistic MMOG
prototype to evaluate a limited set of public clouds in terms of hosting MMOG backends. Foli
lowing encouraging results from this study, this thesis first motivates toward and then presents
a set of models, methods, and tools with which scalable MMOG backends can be developed
for and deployed on commodity clouds. These are encapsulated into a software development
framework called Athlos which allows software engineers to leverage the proposed development
methodology to rapidly create MMOG backend prototypes that utilize the resources of
these clouds to attain scalable states and runtimes. The proposed approach is based on a dynamic
model which aims to abstract the data requirements and relationships of many types of
MMOGs. Based on this model, several methods are outlined that aim to solve various problems
and challenges related to the development of MMOG backends, mainly in terms of performance
and scalability. Using a modular software architecture, and standardization in common development
areas, the proposed framework aims to improve and expedite the development process
leading to higher-quality MMOG backends and a lower time to market. The models and methods
proposed in this approach can be utilized through various tools during the development
lifecycle.
The proposed development framework is evaluated qualitatively and quantitatively. The thesis
presents three case study MMOG backend prototypes that validate the suitability of the proposed
approach. These case studies also provide a proof of concept and are subsequently used
to further evaluate the framework. The propositions in this thesis are assessed with respect to
the performance, scalability, development effort, and code maintainability of MMOG backends
developed using the Athlos framework, using a variety of methods such as small and large-scale
simulations and more targeted experimental setups. The results of these experiments uncover
useful information about the behavior of MMOG backends. In addition, they provide evidence
that MMOG backends developed using the proposed methodology and hosted on serverless
environments can: (a) support a very high number of simultaneous players under a given latency
threshold, (b) elastically scale both in terms of processing power and memory capacity
and (c) significantly reduce the amount of development effort. The results also show that this
methodology can accelerate the development of high-performance, distributed, real-time applications
like MMOG backends, while also exposing the limitations of Athlos in terms of code
maintainability.
Finally, the thesis provides a reflection on the research objectives, considerations on the hypotheses
and technical challenges, and outlines plans for future work in this domain
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
Network and Content Intelligence for 360 Degree Video Streaming Optimization
In recent years, 360° videos, a.k.a. spherical frames, became popular among users
creating an immersive streaming experience. Along with the advances in smart-
phones and Head Mounted Devices (HMD) technology, many content providers
have facilitated to host and stream 360° videos in both on-demand and live stream-
ing modes. Therefore, many different applications have already arisen leveraging
these immersive videos, especially to give viewers an impression of presence in a
digital environment. For example, with 360° videos, now it is possible to connect
people in a remote meeting in an interactive way which essentially increases the
productivity of the meeting. Also, creating interactive learning materials using
360° videos for students will help deliver the learning outcomes effectively.
However, streaming 360° videos is not an easy task due to several reasons. First,
360° video frames are 4–6 times larger than normal video frames to achieve the
same quality as a normal video. Therefore, delivering these videos demands higher
bandwidth in the network. Second, processing relatively larger frames requires
more computational resources at the end devices, particularly for end user devices
with limited resources. This will impact not only the delivery of 360° videos but
also many other applications running on shared resources. Third, these videos need
to be streamed with very low latency requirements due their interactive nature.
Inability to satisfy these requirements can result in poor Quality of Experience
(QoE) for the user. For example, insufficient bandwidth incurs frequent rebuffer-
ing and poor video quality. Also, inadequate computational capacity can cause
faster battery draining and unnecessary heating of the device, causing discomfort
to the user. Motion or cyber–sickness to the user will be prevalent if there is an
unnecessary delay in streaming. These circumstances will hinder providing im-
mersive streaming experiences to the much-needed communities, especially those
who do not have enough network resources.
To address the above challenges, we believe that enhancements to the three main
components in video streaming pipeline, server, network and client, are essential.
Starting from network, it is beneficial for network providers to identify 360° video
flows as early as possible and understand their behaviour in the network to effec-
tively allocate sufficient resources for this video delivery without compromising the
quality of other services. Content servers, at one end of this streaming pipeline, re-
quire efficient 360° video frame processing mechanisms to support adaptive video streaming mechanisms such as ABR (Adaptive Bit Rate) based streaming, VP
aware streaming, a streaming paradigm unique to 360° videos that select only
part of the larger video frame that fall within the user-visible region, etc. On the
other end, the client can be combined with edge-assisted streaming to deliver 360°
video content with reduced latency and higher quality.
Following the above optimization strategies, in this thesis, first, we propose a mech-
anism named 360NorVic to extract 360° video flows from encrypted video traffic
and analyze their traffic characteristics. We propose Machine Learning (ML) mod-
els to classify 360° and normal videos under different scenarios such as offline, near
real-time, VP-aware streaming and Mobile Network Operator (MNO) level stream-
ing. Having extracted 360° video traffic traces both in packet and flow level data
at higher accuracy, we analyze and understand the differences between 360° and
normal video patterns in the encrypted traffic domain that is beneficial for effec-
tive resource optimization for enhancing 360° video delivery. Second, we present
a WGAN (Wesserstien Generative Adversarial Network) based data generation
mechanism (namely VideoTrain++) to synthesize encrypted network video traffic,
taking minimal data. Leveraging synthetic data, we show improved performance
in 360° video traffic analysis, especially in ML-based classification in 360NorVic.
Thirdly, we propose an effective 360° video frame partitioning mechanism (namely
VASTile) at the server side to support VP-aware 360° video streaming with dy-
namic tiles (or variable tiles) of different sizes and locations on the frame. VASTile
takes a visual attention map on the video frames as the input and applies a com-
putational geometric approach to generate a non-overlapping tile configuration to
cover the video frames adaptive to the visual attention. We present VASTile as a
scalable approach for video frame processing at the servers and a method to re-
duce bandwidth consumption in network data transmission. Finally, by applying
VASTile to the individual user VP at the client side and utilizing cache storage
of Multi Access Edge Computing (MEC) servers, we propose OpCASH, a mech-
anism to personalize the 360° video streaming with dynamic tiles with the edge
assistance. While proposing an ILP based solution to effectively select cached
variable tiles from MEC servers that might not be identical to the requested VP
tiles by user, but still effectively cover the same VP region, OpCASH maximize
the cache utilization and reduce the number of requests to the content servers in
congested core network. With this approach, we demonstrate the gain in latency
and bandwidth saving and video quality improvement in personalized 360° video
streaming
Rise of the Planet of Serverless Computing: A Systematic Review
Serverless computing is an emerging cloud computing paradigm, being adopted to develop a wide range of software applications.
It allows developers to focus on the application logic in the granularity of function, thereby freeing developers from tedious and
error-prone infrastructure management. Meanwhile, its unique characteristic poses new challenges to the development and deployment
of serverless-based applications. To tackle these challenges, enormous research efforts have been devoted. This paper provides a
comprehensive literature review to characterize the current research state of serverless computing. Specifically, this paper covers 164
papers on 17 research directions of serverless computing, including performance optimization, programming framework, application
migration, multi-cloud development, testing and debugging, etc. It also derives research trends, focus, and commonly-used platforms
for serverless computing, as well as promising research opportunities
Accelerating orchestration with in-network offloading
The demand for low-latency Internet applications has pushed functionality that was originally placed in commodity hardware into the network. Either in the form of binaries for the programmable data plane or virtualised network functions, services are implemented within the network fabric with the aim of improving their performance and placing them close to the end user. Training of machine learning algorithms, aggregation of networking traffic, virtualised radio access components, are just some of the functions that have been deployed within the network. Therefore, as the network fabric becomes the accelerator for various applications, it is imperative that the orchestration of their components is also adapted to the constraints and capabilities of the deployment environment.
This work identifies performance limitations of in-network compute use cases for both cloud and edge environments and makes suitable adaptations. Within cloud infrastructure, this thesis proposes a platform that relies on programmable switches to accelerate the performance of data replication. It then proceeds to discuss design adaptations of an orchestrator that will allow in-network data offloading and enable accelerated service deployment. At the edge, the topic of inefficient orchestration of virtualised network functions is explored, mainly with respect to energy usage and resource contention. An orchestrator is adapted to schedule requests by taking into account edge constraints in order to minimise resource contention and accelerate service processing times. With data transfers consuming valuable resources at the edge, an efficient data representation mechanism is implemented to provide statistical insight on the provenance of data at the edge and enable smart query allocation to nodes with relevant data.
Taking into account the previous state of the art, the proposed data plane replication method appears to be the most computationally efficient and scalable in-network data replication platform available, with significant improvements in throughput and up to an order of magnitude decrease in latency. The orchestrator of virtual network functions at the edge was shown to reduce event rejections, total processing time, and energy consumption imbalances over the default orchestrator, thus proving more efficient use of the infrastructure. Lastly, computational cost at the edge was further reduced with the use of the proposed query allocation mechanism which minimised redundant engagement of nodes
Collected Papers (on Neutrosophics, Plithogenics, Hypersoft Set, Hypergraphs, and other topics), Volume X
This tenth volume of Collected Papers includes 86 papers in English and Spanish languages comprising 972 pages, written between 2014-2022 by the author alone or in collaboration with the following 105 co-authors (alphabetically ordered) from 26 countries: Abu Sufian, Ali Hassan, Ali Safaa Sadiq, Anirudha Ghosh, Assia Bakali, Atiqe Ur Rahman, Laura Bogdan, Willem K.M. Brauers, Erick González Caballero, Fausto Cavallaro, Gavrilă Calefariu, T. Chalapathi, Victor Christianto, Mihaela Colhon, Sergiu Boris Cononovici, Mamoni Dhar, Irfan Deli, Rebeca Escobar-Jara, Alexandru Gal, N. Gandotra, Sudipta Gayen, Vassilis C. Gerogiannis, Noel Batista Hernández, Hongnian Yu, Hongbo Wang, Mihaiela Iliescu, F. Nirmala Irudayam, Sripati Jha, Darjan Karabašević, T. Katican, Bakhtawar Ali Khan, Hina Khan, Volodymyr Krasnoholovets, R. Kiran Kumar, Manoranjan Kumar Singh, Ranjan Kumar, M. Lathamaheswari, Yasar Mahmood, Nivetha Martin, Adrian Mărgean, Octavian Melinte, Mingcong Deng, Marcel Migdalovici, Monika Moga, Sana Moin, Mohamed Abdel-Basset, Mohamed Elhoseny, Rehab Mohamed, Mohamed Talea, Kalyan Mondal, Muhammad Aslam, Muhammad Aslam Malik, Muhammad Ihsan, Muhammad Naveed Jafar, Muhammad Rayees Ahmad, Muhammad Saeed, Muhammad Saqlain, Muhammad Shabir, Mujahid Abbas, Mumtaz Ali, Radu I. Munteanu, Ghulam Murtaza, Munazza Naz, Tahsin Oner, Gabrijela Popović, Surapati Pramanik, R. Priya, S.P. Priyadharshini, Midha Qayyum, Quang-Thinh Bui, Shazia Rana, Akbara Rezaei, Jesús Estupiñán Ricardo, Rıdvan Sahin, Saeeda Mirvakili, Said Broumi, A. A. Salama, Flavius Aurelian Sârbu, Ganeshsree Selvachandran, Javid Shabbir, Shio Gai Quek, Son Hoang Le, Florentin Smarandache, Dragiša Stanujkić, S. Sudha, Taha Yasin Ozturk, Zaigham Tahir, The Houw Iong, Ayse Topal, Alptekin Ulutaș, Maikel Yelandi Leyva Vázquez, Rizha Vitania, Luige Vlădăreanu, Victor Vlădăreanu, Ștefan Vlăduțescu, J. Vimala, Dan Valeriu Voinea, Adem Yolcu, Yongfei Feng, Abd El-Nasser H. Zaied, Edmundas Kazimieras Zavadskas.
- …