1,739 research outputs found
The Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo.
Recommended from our members
A CACHING EVALUATION OF THE HADOOP DISTRIBUTED FILE SYSTEM
The Hadoop Distributed File System (HDFS) is a distributed le system used to support
multiple widely-used big data frameworks, including Apache Hadoop and Apache Spark. Since
these frameworks are often run across many compute nodes, it is possible that multiple nodes
will read the same data. In addition, since data is replicated across multiple nodes for storage,
the same data will be written multiple times across the network. In this paper, we conduct an
evaluation of the caching potential present in HDFS in order to determine if in-network caching,
particularly of the type seen in Named Data Networking (NDN), would reduce the amount of
tra c seen in a Spark cluster network, as well as the average load on each data storage node.
Our results show that for most benchmarks running on Apache Spark, a majority of the large
read operations were done to transfer the Spark and application dependency libraries to each
compute node. In addition, there was not a signi cant amount of read tra c in the network
for most of the applications we evaluated, making the bene ts of in-network caching for HDFS
questionable
High Performance Fault-Tolerant Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. Huge amounts of data generated from many sources daily. Maintenance of such data is a challenging task. One proposing solution is to use Hadoop. The solution provided by Google, ?Doug Cutting? and his team developed an Open Source Project called Hadoop. Hadoop is a framework written in Java for running applications on large clusters of commodity hardware. The Hadoop Distributed File System (HDFS) is designed to be scalable, fault-tolerant, distributed storage system. Hadoop?s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. The HDFS stores filesystem Metadata and application data separately. HDFS stores Metadata on separate dedicated server called NameNode and application data stored on separate servers called DataNodes. The file system data is accessed via HDFS clients, which first contact the NameNode data location and then transfer data to (write) or from (read) the specified DataNodes. Download file request chooses only one of the servers to download. Other replicated servers are not used. As the file size increases the download time increases. In this paper we work on three policies for selection of blocks. Those are first, random and loadbased. By observing the results the speed of download time for file is ?first? runs slower than ?random? and ?random? runs slower than ?loadbased?
Tinzenite: Encrypted Peer to Peer File Synchronization via the Tox Protocol
We proposed and implemented an open source, peer to peer, file synchronization software based on the Tox communication protocol. Targeted features include full secure communication between peers, an encrypted server peer, and a focus on ease of use while retaining data security. The software suite was implemented based on the Tox protocol, with Golang as the programming language, and the server client built to offer free choice of storage mechanisms, for which we implemented support for the Hadoop distributed file system. The proof of concept implementation was shown to work and further possible work discussed
Lustre, Hadoop, Accumulo
Data processing systems impose multiple views on data as it is processed by
the system. These views include spreadsheets, databases, matrices, and graphs.
There are a wide variety of technologies that can be used to store and process
data through these different steps. The Lustre parallel file system, the Hadoop
distributed file system, and the Accumulo database are all designed to address
the largest and the most challenging data storage problems. There have been
many ad-hoc comparisons of these technologies. This paper describes the
foundational principles of each technology, provides simple models for
assessing their capabilities, and compares the various technologies on a
hypothetical common cluster. These comparisons indicate that Lustre provides 2x
more storage capacity, is less likely to loose data during 3 simultaneous drive
failures, and provides higher bandwidth on general purpose workloads. Hadoop
can provide 4x greater read bandwidth on special purpose workloads. Accumulo
provides 10,000x lower latency on random lookups than either Lustre or Hadoop
but Accumulo's bulk bandwidth is 10x less. Significant recent work has been
done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo
to be combined in different ways.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing
conference, Waltham, MA, 201
Foreword: The Nature of Discretion
Power consumption has become a critical issue in large scale clusters. Existing solutions for addressing the servers’ energy consumption suggest “shrinking” the set of active machines, at least until the more power-proport-ional hardware devices become available. This paper demonstrates that leveraging the sleeping state, however, may lead to unacceptably poor performance and low data availability if the distributed services are not aware of the power management’s actions. Therefore, we present an architecture for cluster services in which the deployed services overcome this problem by actively participating in any action taken by the power management. We propose, implement, and evaluate modifications for the Hadoop Distributed File System and the MapReduce clone that make them capable of operating efficiently under limited power budgets.QC 20140707</p
Optimal Locally Repairable Codes and Connections to Matroid Theory
Petabyte-scale distributed storage systems are currently transitioning to
erasure codes to achieve higher storage efficiency. Classical codes like
Reed-Solomon are highly sub-optimal for distributed environments due to their
high overhead in single-failure events. Locally Repairable Codes (LRCs) form a
new family of codes that are repair efficient. In particular, LRCs minimize the
number of nodes participating in single node repairs during which they generate
small network traffic. Two large-scale distributed storage systems have already
implemented different types of LRCs: Windows Azure Storage and the Hadoop
Distributed File System RAID used by Facebook. The fundamental bounds for LRCs,
namely the best possible distance for a given code locality, were recently
discovered, but few explicit constructions exist. In this work, we present an
explicit and optimal LRCs that are simple to construct. Our construction is
based on grouping Reed-Solomon (RS) coded symbols to obtain RS coded symbols
over a larger finite field. We then partition these RS symbols in small groups,
and re-encode them using a simple local code that offers low repair locality.
For the analysis of the optimality of the code, we derive a new result on the
matroid represented by the code generator matrix.Comment: Submitted for publication, a shorter version was presented at ISIT
201
- …