Search CORE

71 research outputs found

A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems

Author: Hamandawana Prince
Khan Awais
Kim Youngjae
Lee Chang-Gyu
Park Sungyong
Publication venue
Publication date: 20/03/2018
Field of study

Deduplication has been largely employed in distributed storage systems to improve space efficiency. Traditional deduplication research ignores the design specifications of shared-nothing distributed storage systems such as no central metadata bottleneck, scalability, and storage rebalancing. Further, deduplication introduces transactional changes, which are prone to errors in the event of a system failure, resulting in inconsistencies in data and deduplication metadata. In this paper, we propose a robust, fault-tolerant and scalable cluster-wide deduplication that can eliminate duplicate copies across the cluster. We design a distributed deduplication metadata shard which guarantees performance scalability while preserving the design constraints of shared- nothing storage systems. The placement of chunks and deduplication metadata is made cluster-wide based on the content fingerprint of chunks. To ensure transactional consistency and garbage identification, we employ a flag-based asynchronous consistency mechanism. We implement the proposed deduplication on Ceph. The evaluation shows high disk-space savings with minimal performance degradation as well as high robustness in the event of sudden server failure.Comment: 6 Pages including reference

arXiv.org e-Print Archive

Crossref

An Introduction to Hyperdex and the Brave New World of High Performance, Scalable, Consistent, Faulttolerant Data Stores

Author: Bernard Wong
Emin
Gün Sirer
Robert Escriva
Publication venue
Publication date
Field of study

CiteSeerX

Data auditing and security in cloud computing: issues, challenges and future directions

Author: Geeta C.M.
Iyenga S.S.
Patnaik L.M.
Raghavendra S.
Rajkumar Buyya .
Venugopal K.R.
Publication venue: Global Society of Scientific Research and Researchers
Publication date: 01/01/2018
Field of study

Cloud computing is one of the significant development that utilizes progressive computational power and upgrades data distribution and data storing facilities. With cloud information services, it is essential for information to be saved in the cloud and also distributed across numerous customers. Cloud information repository is involved with issues of information integrity, data security and information access by unapproved users. Hence, an autonomous reviewing and auditing facility is necessary to guarantee that the information is effectively accommodated and used in the cloud. In this paper, a comprehensive survey on the state-of-art techniques in data auditing and security are discussed. Challenging problems in information repository auditing and security are presented. Finally, directions for future research in data auditing and security have been discusse

ePrints@Bangalore University

Data Auditing and Security in Cloud Computing: Issues, Challenges and Future Directions

Author: Buyya Rajkumar
C M Geeta
Iyengar S S
K R Venugopal
Patnaik L M
S Raghavendra
Publication venue: 'International Journal of Computer Engineering and Applications'
Publication date: 11/01/2018
Field of study

International Journal of Computer (IJC - Global Society of Scientific Research and Researchers, GSSRR)

CoAP Infrastructure for IoT

Author: Shi Heng 1991-
Publication venue: 'University of Saskatchewan Library'
Publication date: 30/08/2018
Field of study

The Internet of Things (IoT) can be seen as a large-scale network of billions of smart devices. Often IoT devices exchange data in small but numerous messages, which requires IoT services to be more scalable and reliable than ever. Traditional protocols that are known in the Web world does not fit well in the constrained environment that these devices operate in. Therefore many lightweight protocols specialized for the IoT have been studied, among which the Constrained Application Protocol (CoAP) stands out for its well-known REST paradigm and easy integration with existing Web. On the other hand, new paradigms such as Fog Computing emerges, attempting to avoid the centralized bottleneck in IoT services by moving computations to the edge of the network. Since a node of the Fog essentially belongs to relatively constrained environment, CoAP fits in well. Among the many attempts of building scalable and reliable systems, Erlang as a typical concurrency-oriented programming (COP) language has been battle tested in the telecom industry, which has similar requirements as the IoT. In order to explore the possibility of applying Erlang and COP in general to the IoT, this thesis presents an Erlang based CoAP server/client prototype ecoap with a flexible concurrency model that can scale up to an unconstrained environment like the Cloud and scale down to a constrained environment like an embedded platform. The flexibility of the presented server renders the same architecture applicable from Fog to Cloud. To evaluate its performance, the proposed server is compared with the mainstream CoAP implementation on an Amazon Web Service (AWS) Cloud instance and a Raspberry Pi 3, representing the unconstrained and constrained environment respectively. The ecoap server achieves comparable throughput, lower latency, and in general scales better than the other implementation in the Cloud and on the Raspberry Pi. The thesis yields positive results and demonstrates the value of the philosophy of Erlang in the IoT space

eCommons@USASK

University of Saskatchewan Research Archive

Scalable and fault-tolerant data stream processing on multi-core architectures

Author: Theodorakis George Rafail
Publication venue: Computing, Imperial College London
Publication date: 01/06/2023
Field of study

With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

Spiral - Imperial College Digital Repository

Hyperscale Data Processing With Network-Centric Designs

Author: Zhang Qizhen
Publication venue: ScholarlyCommons
Publication date: 01/01/2022
Field of study

Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale

ScholarlyCommons@Penn

Linear Scalability of Distributed Applications

Author: Bonvin Nicolas
Publication venue: Lausanne, EPFL
Publication date: 08/12/2011
Field of study

The explosion of social applications such as Facebook, LinkedIn and Twitter, of electronic commerce with companies like Amazon.com and Ebay.com, and of Internet search has created the need for new technologies and appropriate systems to manage effectively a considerable amount of data and users. These applications must run continuously every day of the year and must be capable of surviving sudden and abrupt load increases as well as all kinds of software, hardware, human and organizational failures. Increasing (or decreasing) the allocated resources of a distributed application in an elastic and scalable manner, while satisfying requirements on availability and performance in a cost-effective way, is essential for the commercial viability but it poses great challenges in today's infrastructures. Indeed, Cloud Computing can provide resources on demand: it now becomes easy to start dozens of servers in parallel (computational resources) or to store a huge amount of data (storage resources), even for a very limited period, paying only for the resources consumed. However, these complex infrastructures consisting of heterogeneous and low-cost resources are failure-prone. Also, although cloud resources are deemed to be virtually unlimited, only adequate resource management and demand multiplexing can meet customer requirements and avoid performance deteriorations. In this thesis, we deal with adaptive management of cloud resources under specific application requirements. First, in the intra-cloud environment, we address the problem of cloud storage resource management with availability guarantees and find the optimal resource allocation in a decentralized way by means of a virtual economy. Data replicas migrate, replicate or delete themselves according to their economic fitness. Our approach responds effectively to sudden load increases or failures and makes best use of the geographical distance between nodes to improve application-specific data availability. We then propose a decentralized approach for adaptive management of computational resources for applications requiring high availability and performance guarantees under load spikes, sudden failures or cloud resource updates. Our approach involves a virtual economy among service components (similar to the one among data replicas) and an innovative cascading scheme for setting up the performance goals of individual components so as to meet the overall application requirements. Our approach manages to meet application requirements with the minimum resources, by allocating new ones or releasing redundant ones. Finally, as cloud storage vendors offer online services at different rates, which can vary widely due to second-degree price discrimination, we present an inter-cloud storage resource allocation method to aggregate resources from different storage vendors and provide to the user a system which guarantees the best rate to host and serve its data, while satisfying the user requirements on availability, durability, latency, etc. Our system continuously optimizes the placement of data according to its type and usage pattern, and minimizes migration costs from one provider to another, thereby avoiding vendor lock-in

Infoscience - École polytechnique fédérale de Lausanne

Data management and Data Pipelines: An empirical investigation in the embedded systems domain

Author: Munappy Aiswarya Raj
Publication venue
Publication date: 01/01/2021
Field of study

Context: Companies are increasingly collecting data from all possible sources to extract insights that help in data-driven decision-making. Increased data volume, variety, and velocity and the impact of poor quality data on the development of data products are leading companies to look for an improved data management approach that can accelerate the development of high-quality data products. Further, AI is being applied in a growing number of fields, and thus it is evolving as a horizontal technology. Consequently, AI components are increasingly been integrated into embedded systems along with electronics and software. We refer to these systems as AI-enhanced embedded systems. Given the strong dependence of AI on data, this expansion also creates a new space for applying data management techniques. Objective: The overall goal of this thesis is to empirically identify the data management challenges encountered during the development and maintenance of AI-enhanced embedded systems, propose an improved data management approach and empirically validate the proposed approach.Method: To achieve the goal, we conducted this research in close collaboration with Software Center companies using a combination of different empirical research methods: case studies, literature reviews, and action research.Results and conclusions: This research provides five main results. First, it identifies key data management challenges specific to Deep Learning models developed at embedded system companies. Second, it examines the practices such as DataOps and data pipelines that help to address data management challenges. We observed that DataOps is the best data management practice that improves the data quality and reduces the time tdevelop data products. The data pipeline is the critical component of DataOps that manages the data life cycle activities. The study also provides the potential faults at each step of the data pipeline and the corresponding mitigation strategies. Finally, the data pipeline model is realized in a small piece of data pipeline and calculated the percentage of saved data dumps through the implementation.Future work: As future work, we plan to realize the conceptual data pipeline model so that companies can build customized robust data pipelines. We also plan to analyze the impact and value of data pipelines in cross-domain AI systems and data applications. We also plan to develop AI-based fault detection and mitigation system suitable for data pipelines

Chalmers Research