102 research outputs found
In-Network Redundancy Generation for Opportunistic Speedup of Backup
Erasure coding is a storage-efficient alternative to replication for
achieving reliable data backup in distributed storage systems. During the
storage process, traditional erasure codes require a unique source node to
create and upload all the redundant data to the different storage nodes.
However, such a source node may have limited communication and computation
capabilities, which constrain the storage process throughput. Moreover, the
source node and the different storage nodes might not be able to send and
receive data simultaneously -- e.g., nodes might be busy in a datacenter
setting, or simply be offline in a peer-to-peer setting -- which can further
threaten the efficacy of the overall storage process. In this paper we propose
an "in-network" redundancy generation process which distributes the data
insertion load among the source and storage nodes by allowing the storage nodes
to generate new redundant data by exchanging partial information among
themselves, improving the throughput of the storage process. The process is
carried out asynchronously, utilizing spare bandwidth and computing resources
from the storage nodes. The proposed approach leverages on the local
repairability property of newly proposed erasure codes tailor made for the
needs of distributed storage systems. We analytically show that the performance
of this technique relies on an efficient usage of the spare node resources, and
we derive a set of scheduling algorithms to maximize the same. We
experimentally show, using availability traces from real peer-to-peer
applications as well as Google data center availability and workload traces,
that our algorithms can, depending on the environment characteristics, increase
the throughput of the storage process significantly (up to 90% in data centers,
and 60% in peer-to-peer settings) with respect to the classical naive data
insertion approach
Multi-Core Parallel Routing
The recent increase in the amount of data (i.e., big data) led to higher data volumes to be transferred and processed over the network. Also, over the last years, the deployment of multi-core routers has grown rapidly. However, such big data transfers are not leveraging the powerful multi-core routers to the extent possible, particularly in the key function of routing. Our main goal is to find a way so we can use these cores more effectively and efficiently in routing the big data transfers. In this dissertation, we propose a novel approach to parallelize data transfers by leveraging the multi-core CPUs in the routers. Legacy routing protocols, e.g. OSPF for intra-domain routing, send data from source to destination on a shortest single path. We describe an end-to-end method to distribute data optimally on flows by using multiple paths. We generate new virtual topology substrates from the underlying router topology and perform shortest path routing on each substrate. With this framework, even though calculating shortest paths could be done with well-known techniques such as OSPF's Dijkstra implementation, finding optimal substrates so as to maximize the aggregate throughput over multiple end-to-end paths is still an NP-hard problem. We focus our efforts on solving the problem and design heuristics for substrate generation from a given router topology. Our heuristics' interim goal is to generate substrates in such a way that the shortest path between a source-destination pair on each substrate minimally overlaps with each other. Once these substrates are determined, we assign each substrate to a core in routers and employ a multi-path transport protocol, like MPTCP, to perform end-to-end parallel transfers
Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks
The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks
Improving Data Management and Data Movement Efficiency in Hybrid Storage Systems
University of Minnesota Ph.D. dissertation.July 2017. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); ix, 116 pages.In the big data era, large volumes of data being continuously generated drive the emergence of high performance large capacity storage systems. To reduce the total cost of ownership, storage systems are built in a more composite way with many different types of emerging storage technologies/devices including Storage Class Memory (SCM), Solid State Drives (SSD), Shingle Magnetic Recording (SMR), Hard Disk Drives (HDD), and even across off-premise cloud storage. To make better utilization of each type of storage, industries have provided multi-tier storage through dynamically placing hot data in the faster tiers and cold data in the slower tiers. Data movement happens between devices on one single device and as well as between devices connected via various networks. Toward improving data management and data movement efficiency in such hybrid storage systems, this work makes the following contributions: To bridge the giant semantic gap between applications and modern storage systems, passing a piece of tiny and useful information (I/O access hints) from upper layers to the block storage layer may greatly improve application performance or ease data management in heterogeneous storage systems. We present and develop a generic and flexible framework, called HintStor, to execute and evaluate various I/O access hints on heterogeneous storage systems with minor modifications to the kernel and applications. The design of HintStor contains a new application/user level interface, a file system plugin and a block storage data manager. With HintStor, storage systems composed of various storage devices can perform pre-devised data placement, space reallocation and data migration polices assisted by the added access hints. Each storage device/technology has its own unique price-performance tradeoffs and idiosyncrasies with respect to workload characteristics they prefer to support. To explore the internal access patterns and thus efficiently place data on storage systems with fully connected (i.e., data can move from one device to any other device instead of moving tier by tier) differential pools (each pool consists of storage devices of a particular type), we propose a chunk-level storage-aware workload analyzer framework, simplified as ChewAnalyzer. With ChewAnalzyer, the storage manager can adequately distribute and move the data chunks across different storage pools. To reduce the duplicate content transferred between local storage devices and devices in remote data centers, an inline Network Redundancy Elimination (NRE) process with Content-Defined Chunking (CDC) policy can obtain a higher Redundancy Elimination (RE) ratio but may suffer from a considerably higher computational requirement than fixed-size chunking. We build an inline NRE appliance which incorporates an improved FPGA based scheme to speed up CDC processing. To efficiently utilize the hardware resources, the whole NRE process is handled by a Virtualized NRE (VNRE) controller. The uniqueness of this VNRE that we developed lies in its ability to exploit the redundancy patterns of different TCP flows and customize the chunking process to achieve a higher RE ratio
CERN openlab Whitepaper on Future IT Challenges in Scientific Research
This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates
Recommended from our members
Making Data Storage Efficient in the Era of Cloud Computing
We enter the era of cloud computing in the last decade, as many paradigm shifts are happening on how people write and deploy applications. Despite the advancement of cloud computing, data storage abstractions have not evolved much, causing inefficiencies in performance, cost, and security.
This dissertation proposes a novel approach to make data storage efficient in the era of cloud computing by building new storage abstractions and systems that bridge the gap between cloud computing and data storage and simplify development. We build four systems to address four data inefficiencies in cloud computing.
The first system, Grandet, solves the data storage inefficiency caused by the paradigm shift from upfront provisioning to a variety of pay-as-you-go cloud services. Grandet is an extensible storage system that significantly reduces storage costs for web applications deployed in the cloud. Under the hood, it supports multiple heterogeneous stores and unifies them by placing each data object at the store deemed most economical. Our results show that Grandet reduces their costs by an average of 42.4%, and it is fast, scalable, and easy to use.
The second system, Unic, solves the data inefficiency caused by the paradigm shift from single-tenancy to multi-tenancy. Unic securely deduplicates general computations. It exports a cache service that allows cloud applications running on behalf of mutually distrusting users to memoize and reuse computation results, thereby improving performance. Unic achieves both integrity and secrecy through a novel use of code attestation, and it provides a simple yet expressive API that enables applications to deduplicate their own rich computations. Our results show that Unic is easy to use, speeds up applications by an average of 7.58x, and with little storage overhead.
The third system, Lambdata, solves the data inefficiency caused by the paradigm shift to serverless computing, where developers only write core business logic, and cloud service providers maintain all the infrastructure. Lambdata is a novel serverless computing system that enables developers to declare a cloud function's data intents, including both data read and data written. Once data intents are made explicit, Lambdata performs a variety of optimizations to improve speed, including caching data locally and scheduling functions based on code and data locality. Our results show that Lambdata achieves an average speedup of 1.51x on the turnaround time of practical workloads and reduces monetary cost by 16.5%.
The fourth system, CleanOS, solves the data inefficiency caused by the paradigm shift from desktop computers to smartphones always connected to the cloud. CleanOS is a new Android-based operating system that manages sensitive data rigorously and maintains a clean environment at all times. It identifies and tracks sensitive data, encrypts it with a key, and evicts that key to the cloud when the data is not in active use on the device. Our results show that CleanOS limits sensitive-data exposure drastically while incurring acceptable overheads on mobile networks
- …