1,176 research outputs found
A Robust Fault-Tolerant and Scalable Cluster-wide Deduplication for Shared-Nothing Storage Systems
Deduplication has been largely employed in distributed storage systems to
improve space efficiency. Traditional deduplication research ignores the design
specifications of shared-nothing distributed storage systems such as no central
metadata bottleneck, scalability, and storage rebalancing. Further,
deduplication introduces transactional changes, which are prone to errors in
the event of a system failure, resulting in inconsistencies in data and
deduplication metadata. In this paper, we propose a robust, fault-tolerant and
scalable cluster-wide deduplication that can eliminate duplicate copies across
the cluster. We design a distributed deduplication metadata shard which
guarantees performance scalability while preserving the design constraints of
shared- nothing storage systems. The placement of chunks and deduplication
metadata is made cluster-wide based on the content fingerprint of chunks. To
ensure transactional consistency and garbage identification, we employ a
flag-based asynchronous consistency mechanism. We implement the proposed
deduplication on Ceph. The evaluation shows high disk-space savings with
minimal performance degradation as well as high robustness in the event of
sudden server failure.Comment: 6 Pages including reference
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Transactional filesystems
Dissertação de Mestrado em Engenharia InformáticaThe task of implementing correct software is not trivial; mainly when facing the need for supporting concurrency. To overcome this difficulty, several researchers proposed the technique of providing the well known database transactional models as an abstraction for existing programming languages, allowing a software programmer to define groups of computations as transactions and benefit from the expectable semantics of the underlying transactional model. Prototypes for this programming model are nowadays made available by many research teams but are still far from perfection due to a considerable number of operational restrictions. Mostly, these restrictions derive from the limitations on the use of input-output functions inside a transaction. These functions are frequently irreversible which disables their compatibility with a transactional engine due to its impossibility to undo their effects in the event of aborting a transaction.
However, there is a group of input-output operations that are potentially reversible and that can produce a valuable tool when provided within the transactional programming model explained above: the file system operations. A programming model that would involve in a transaction not only a set of memory operations but also a set of file operations, would allow the software programmer to define algorithms in a much flexible and simple way, reaching greater stability and consistency in each application.
In this document we purpose to specify and allow the use of this type of operations inside a transactional programming model, as well as studying the advantages and disadvantages of this approach
- …