Search CORE

102 research outputs found

In-Network Redundancy Generation for Opportunistic Speedup of Backup

Author: Ahlswede
Anwitaman Datta
Frédérique Oggier
Gu
Liu
Lluis Pamies-Juarez
Reed
Publication venue: 'Elsevier BV'
Publication date: 19/02/2013
Field of study

Erasure coding is a storage-efficient alternative to replication for achieving reliable data backup in distributed storage systems. During the storage process, traditional erasure codes require a unique source node to create and upload all the redundant data to the different storage nodes. However, such a source node may have limited communication and computation capabilities, which constrain the storage process throughput. Moreover, the source node and the different storage nodes might not be able to send and receive data simultaneously -- e.g., nodes might be busy in a datacenter setting, or simply be offline in a peer-to-peer setting -- which can further threaten the efficacy of the overall storage process. In this paper we propose an "in-network" redundancy generation process which distributes the data insertion load among the source and storage nodes by allowing the storage nodes to generate new redundant data by exchanging partial information among themselves, improving the throughput of the storage process. The process is carried out asynchronously, utilizing spare bandwidth and computing resources from the storage nodes. The proposed approach leverages on the local repairability property of newly proposed erasure codes tailor made for the needs of distributed storage systems. We analytically show that the performance of this technique relies on an efficient usage of the spare node resources, and we derive a set of scheduling algorithms to maximize the same. We experimentally show, using availability traces from real peer-to-peer applications as well as Google data center availability and workload traces, that our algorithms can, depending on the environment characteristics, increase the throughput of the storage process significantly (up to 90% in data centers, and 60% in peer-to-peer settings) with respect to the classical naive data insertion approach

arXiv.org e-Print Archive

Crossref

Multi-Core Parallel Routing

Author: Soran Ahmet
Publication venue
Publication date: 06/09/2017
Field of study

The recent increase in the amount of data (i.e., big data) led to higher data volumes to be transferred and processed over the network. Also, over the last years, the deployment of multi-core routers has grown rapidly. However, such big data transfers are not leveraging the powerful multi-core routers to the extent possible, particularly in the key function of routing. Our main goal is to find a way so we can use these cores more effectively and efficiently in routing the big data transfers. In this dissertation, we propose a novel approach to parallelize data transfers by leveraging the multi-core CPUs in the routers. Legacy routing protocols, e.g. OSPF for intra-domain routing, send data from source to destination on a shortest single path. We describe an end-to-end method to distribute data optimally on flows by using multiple paths. We generate new virtual topology substrates from the underlying router topology and perform shortest path routing on each substrate. With this framework, even though calculating shortest paths could be done with well-known techniques such as OSPF's Dijkstra implementation, finding optimal substrates so as to maximize the aggregate throughput over multiple end-to-end paths is still an NP-hard problem. We focus our efforts on solving the problem and design heuristics for substrate generation from a given router topology. Our heuristics' interim goal is to generate substrates in such a way that the shortest path between a source-destination pair on each substrate minimally overlaps with each other. Once these substrates are determined, we assign each substrate to a core in routers and employ a multi-path transport protocol, like MPTCP, to perform end-to-end parallel transfers

University of Nevada, Reno ScholarWorks Repository

Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

Author: Kambatla Karthik Shashank
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

Purdue E-Pubs

Improving Data Management and Data Movement Efficiency in Hybrid Storage Systems

Author: GE XIONGZI
Publication venue
Publication date: 01/07/2017
Field of study

University of Minnesota Ph.D. dissertation.July 2017. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); ix, 116 pages.In the big data era, large volumes of data being continuously generated drive the emergence of high performance large capacity storage systems. To reduce the total cost of ownership, storage systems are built in a more composite way with many different types of emerging storage technologies/devices including Storage Class Memory (SCM), Solid State Drives (SSD), Shingle Magnetic Recording (SMR), Hard Disk Drives (HDD), and even across off-premise cloud storage. To make better utilization of each type of storage, industries have provided multi-tier storage through dynamically placing hot data in the faster tiers and cold data in the slower tiers. Data movement happens between devices on one single device and as well as between devices connected via various networks. Toward improving data management and data movement efficiency in such hybrid storage systems, this work makes the following contributions: To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information (I/O access hints) from upper layers to the block storage layer may greatly improve application performance or ease data management in heterogeneous storage systems. We present and develop a generic and flexible framework, called HintStor, to execute and evaluate various I/O access hints on heterogeneous storage systems with minor modifications to the kernel and applications. The design of HintStor contains a new application/user level interface, a file system plugin and a block storage data manager. With HintStor, storage systems composed of various storage devices can perform pre-devised data placement, space reallocation and data migration polices assisted by the added access hints. Each storage device/technology has its own unique price-performance tradeoffs and idiosyncrasies with respect to workload characteristics they prefer to support. To explore the internal access patterns and thus efficiently place data on storage systems with fully connected (i.e., data can move from one device to any other device instead of moving tier by tier) differential pools (each pool consists of storage devices of a particular type), we propose a chunk-level storage-aware workload analyzer framework, simplified as ChewAnalyzer. With ChewAnalzyer, the storage manager can adequately distribute and move the data chunks across different storage pools. To reduce the duplicate content transferred between local storage devices and devices in remote data centers, an inline Network Redundancy Elimination (NRE) process with Content-Defined Chunking (CDC) policy can obtain a higher Redundancy Elimination (RE) ratio but may suffer from a considerably higher computational requirement than fixed-size chunking. We build an inline NRE appliance which incorporates an improved FPGA based scheme to speed up CDC processing. To efficiently utilize the hardware resources, the whole NRE process is handled by a Virtualized NRE (VNRE) controller. The uniqueness of this VNRE that we developed lies in its ability to exploit the redundancy patterns of different TCP flows and customize the chunking process to achieve a higher RE ratio

University of Minnesota Digital Conservancy

CERN openlab Whitepaper on Future IT Challenges in Scientific Research

Author: Di Meglio Alberto
Gaillard Melissa
Purcell Andrew
Publication venue
Publication date: 01/01/2014
Field of study

This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates

ZENODO

CERN Document Server

Recommended from our members

Making Data Storage Efficient in the Era of Cloud Computing

Author: Tang Yang
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

We enter the era of cloud computing in the last decade, as many paradigm shifts are happening on how people write and deploy applications. Despite the advancement of cloud computing, data storage abstractions have not evolved much, causing inefficiencies in performance, cost, and security. This dissertation proposes a novel approach to make data storage efficient in the era of cloud computing by building new storage abstractions and systems that bridge the gap between cloud computing and data storage and simplify development. We build four systems to address four data inefficiencies in cloud computing. The first system, Grandet, solves the data storage inefficiency caused by the paradigm shift from upfront provisioning to a variety of pay-as-you-go cloud services. Grandet is an extensible storage system that significantly reduces storage costs for web applications deployed in the cloud. Under the hood, it supports multiple heterogeneous stores and unifies them by placing each data object at the store deemed most economical. Our results show that Grandet reduces their costs by an average of 42.4%, and it is fast, scalable, and easy to use. The second system, Unic, solves the data inefficiency caused by the paradigm shift from single-tenancy to multi-tenancy. Unic securely deduplicates general computations. It exports a cache service that allows cloud applications running on behalf of mutually distrusting users to memoize and reuse computation results, thereby improving performance. Unic achieves both integrity and secrecy through a novel use of code attestation, and it provides a simple yet expressive API that enables applications to deduplicate their own rich computations. Our results show that Unic is easy to use, speeds up applications by an average of 7.58x, and with little storage overhead. The third system, Lambdata, solves the data inefficiency caused by the paradigm shift to serverless computing, where developers only write core business logic, and cloud service providers maintain all the infrastructure. Lambdata is a novel serverless computing system that enables developers to declare a cloud function's data intents, including both data read and data written. Once data intents are made explicit, Lambdata performs a variety of optimizations to improve speed, including caching data locally and scheduling functions based on code and data locality. Our results show that Lambdata achieves an average speedup of 1.51x on the turnaround time of practical workloads and reduces monetary cost by 16.5%. The fourth system, CleanOS, solves the data inefficiency caused by the paradigm shift from desktop computers to smartphones always connected to the cloud. CleanOS is a new Android-based operating system that manages sensitive data rigorously and maintains a clean environment at all times. It identifies and tracks sensitive data, encrypts it with a key, and evicts that key to the cloud when the data is not in active use on the device. Our results show that CleanOS limits sensitive-data exposure drastically while incurring acceptable overheads on mobile networks

Columbia University Academic Commons