206,570 research outputs found
Distributed exact deduplication for primary storage infrastructures
Lecture Notes in Computer Science, Volume 8460, 2014Deduplication of primary storage volumes in a cloud computing environment is increasingly desirable, as the resulting space savings contribute to the cost effectiveness of a large scale multi-tenant infrastructure. However, traditional archival and backup deduplication systems impose prohibitive overhead for latency-sensitive applications deployed at these infrastructures while, current primary deduplication systems rely on special cluster filesystems, centralized components, or restrictive workload assumptions.
We present DEDIS, a fully-distributed and dependable system that performs exact and cluster-wide background deduplication of primary storage. DEDIS does not depend on data locality and works on top of any unsophisticated storage backend, centralized or distributed, that exports a basic shared block device interface. The evaluation of an open-source prototype shows that DEDIS scales out and adds negligible overhead even when deduplication and intensive storage I/O run simultaneously.(undefined
Hadoop MapReduce for Mobile Cloud
The new generations of mobile devices have high processing power and storage, but they lag behind in terms of software systems for big data storage and processing. Hadoop is a scalable platform that provides distributed storage and computational capabilities on clusters of commodity hardware. Building Hadoop on a mobile net- work enables the devices to run data intensive computing applications without direct knowledge of underlying distributed systems complexities. However, these applications have severe energy and reliability constraints (e.g., caused by unexpected device failures or topology changes in a dynamic network). As mobile devices are more susceptible to unauthorized access when compared to traditional servers, security is also a concern for sensitive data. Hence, it is paramount to consider reliability, energy efficiency and security for such applications. The goal of this thesis is to bring Hadoop MapReduce framework to a mobile cloud environment such that it solves these bottlenecks involved in big data processing. The Mobile Distributed File System(MDFS) addresses these issues for big data processing in mobile clouds. We have developed the Hadoop MapReduce framework over MDFS and have evaluated its performance by varying input workloads in a real heterogeneous mobile cluster. Our evaluation shows that the implementation addresses all constraints in processing large amounts of data in mobile clouds. Thus, our system is a viable solution to meet the growing demands of data processing in a mobile environment
Towards trusted volunteer grid environments
Intensive experiences show and confirm that grid environments can be
considered as the most promising way to solve several kinds of problems
relating either to cooperative work especially where involved collaborators are
dispersed geographically or to some very greedy applications which require
enough power of computing or/and storage. Such environments can be classified
into two categories; first, dedicated grids where the federated computers are
solely devoted to a specific work through its end. Second, Volunteer grids
where federated computers are not completely devoted to a specific work but
instead they can be randomly and intermittently used, at the same time, for any
other purpose or they can be connected or disconnected at will by their owners
without any prior notification. Each category of grids includes surely several
advantages and disadvantages; nevertheless, we think that volunteer grids are
very promising and more convenient especially to build a general multipurpose
distributed scalable environment. Unfortunately, the big challenge of such
environments is, however, security and trust. Indeed, owing to the fact that
every federated computer in such an environment can randomly be used at the
same time by several users or can be disconnected suddenly, several security
problems will automatically arise. In this paper, we propose a novel solution
based on identity federation, agent technology and the dynamic enforcement of
access control policies that lead to the design and implementation of trusted
volunteer grid environments.Comment: 9 Pages, IJCNC Journal 201
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Survey and Analysis of Production Distributed Computing Infrastructures
This report has two objectives. First, we describe a set of the production
distributed infrastructures currently available, so that the reader has a basic
understanding of them. This includes explaining why each infrastructure was
created and made available and how it has succeeded and failed. The set is not
complete, but we believe it is representative.
Second, we describe the infrastructures in terms of their use, which is a
combination of how they were designed to be used and how users have found ways
to use them. Applications are often designed and created with specific
infrastructures in mind, with both an appreciation of the existing capabilities
provided by those infrastructures and an anticipation of their future
capabilities. Here, the infrastructures we discuss were often designed and
created with specific applications in mind, or at least specific types of
applications. The reader should understand how the interplay between the
infrastructure providers and the users leads to such usages, which we call
usage modalities. These usage modalities are really abstractions that exist
between the infrastructures and the applications; they influence the
infrastructures by representing the applications, and they influence the ap-
plications by representing the infrastructures
ASCR/HEP Exascale Requirements Review Report
This draft report summarizes and details the findings, results, and
recommendations derived from the ASCR/HEP Exascale Requirements Review meeting
held in June, 2015. The main conclusions are as follows. 1) Larger, more
capable computing and data facilities are needed to support HEP science goals
in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of
the demand at the 2025 timescale is at least two orders of magnitude -- and in
some cases greater -- than that available currently. 2) The growth rate of data
produced by simulations is overwhelming the current ability, of both facilities
and researchers, to store and analyze it. Additional resources and new
techniques for data analysis are urgently needed. 3) Data rates and volumes
from HEP experimental facilities are also straining the ability to store and
analyze large and complex data volumes. Appropriately configured
leadership-class facilities can play a transformational role in enabling
scientific discovery from these datasets. 4) A close integration of HPC
simulation and data analysis will aid greatly in interpreting results from HEP
experiments. Such an integration will minimize data movement and facilitate
interdependent workflows. 5) Long-range planning between HEP and ASCR will be
required to meet HEP's research needs. To best use ASCR HPC resources the
experimental HEP program needs a) an established long-term plan for access to
ASCR computational and data resources, b) an ability to map workflows onto HPC
resources, c) the ability for ASCR facilities to accommodate workflows run by
collaborations that can have thousands of individual members, d) to transition
codes to the next-generation HPC platforms that will be available at ASCR
facilities, e) to build up and train a workforce capable of developing and
using simulations and analysis to support HEP scientific research on
next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio
- …