Search CORE

1,739 research outputs found

The Hadoop Distributed File System

Author: Parmar Amit Kumar
Srivastav Rajesh
TAUNK ANMOL KUMAR
Publication venue: 'International Journal of Computer Engineering and Applications'
Publication date: 18/10/2013
Field of study

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo.

International Journal of Computer (IJC - Global Society of Scientific Research and Researchers, GSSRR)

Recommended from our members

A CACHING EVALUATION OF THE HADOOP DISTRIBUTED FILE SYSTEM

Author: Newberry Eric Evan Michael
Newberry Eric Evan Michael
Publication venue: The University of Arizona.
Publication date: 01/01/2018
Field of study

The Hadoop Distributed File System (HDFS) is a distributed le system used to support multiple widely-used big data frameworks, including Apache Hadoop and Apache Spark. Since these frameworks are often run across many compute nodes, it is possible that multiple nodes will read the same data. In addition, since data is replicated across multiple nodes for storage, the same data will be written multiple times across the network. In this paper, we conduct an evaluation of the caching potential present in HDFS in order to determine if in-network caching, particularly of the type seen in Named Data Networking (NDN), would reduce the amount of tra c seen in a Spark cluster network, as well as the average load on each data storage node. Our results show that for most benchmarks running on Apache Spark, a majority of the large read operations were done to transfer the Spark and application dependency libraries to each compute node. In addition, there was not a signi cant amount of read tra c in the network for most of the applications we evaluated, making the bene ts of in-network caching for HDFS questionable

The University of Arizona

High Performance Fault-Tolerant Hadoop Distributed File System

Author: Yelakala Pragna
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2017
Field of study

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. Huge amounts of data generated from many sources daily. Maintenance of such data is a challenging task. One proposing solution is to use Hadoop. The solution provided by Google, ?Doug Cutting? and his team developed an Open Source Project called Hadoop. Hadoop is a framework written in Java for running applications on large clusters of commodity hardware. The Hadoop Distributed File System (HDFS) is designed to be scalable, fault-tolerant, distributed storage system. Hadoop?s HDFS is a highly fault-tolerant distributed file system and, like Hadoop in general, designed to be deployed on low-cost hardware. The HDFS stores filesystem Metadata and application data separately. HDFS stores Metadata on separate dedicated server called NameNode and application data stored on separate servers called DataNodes. The file system data is accessed via HDFS clients, which first contact the NameNode data location and then transfer data to (write) or from (read) the specified DataNodes. Download file request chooses only one of the servers to download. Other replicated servers are not used. As the file size increases the download time increases. In this paper we work on three policies for selection of blocks. Those are first, random and loadbased. By observing the results the speed of download time for file is ?first? runs slower than ?random? and ?random? runs slower than ?loadbased?

International Journal on Recent and Innovation Trends in Computing and Communication

Tinzenite: Encrypted Peer to Peer File Synchronization via the Tox Protocol

Author: Hartmann Tamino P.S.M
Publication venue
Publication date: 01/10/2015
Field of study

We proposed and implemented an open source, peer to peer, file synchronization software based on the Tox communication protocol. Targeted features include full secure communication between peers, an encrypted server peer, and a focus on ease of use while retaining data security. The software suite was implemented based on the Tox protocol, with Golang as the programming language, and the server client built to offer free choice of storage mechanisms, for which we implemented support for the Hadoop distributed file system. The proof of concept implementation was shown to work and further possible work discussed

DBIS EPub

Lustre, Hadoop, Accumulo

Author: Arcand William
Bergeron Bill
Bestor David
Byun Chansup
Edwards Lauren
Gadepally Vijay
Hubbell Matthew
Kepner Jeremy
Michaleas Peter
Mullen Julie
Prout Andrew
Reuther Albert
Rosa Antonio
Yee Charles
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/07/2015
Field of study

Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets, databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons of these technologies. This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose workloads. Accumulo provides 10,000x lower latency on random lookups than either Lustre or Hadoop but Accumulo's bulk bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.Comment: 6 pages; accepted to IEEE High Performance Extreme Computing conference, Waltham, MA, 201

arXiv.org e-Print Archive

Crossref

Foreword: The Nature of Discretion

Author: Barisits Martin
Kostic Dejan
Salzgeber Vincent
Vasic Nedeljko
Publication venue: Duke University School of Law
Publication date: 01/10/1984
Field of study

Power consumption has become a critical issue in large scale clusters. Existing solutions for addressing the servers’ energy consumption suggest “shrinking” the set of active machines, at least until the more power-proport-ional hardware devices become available. This paper demonstrates that leveraging the sleeping state, however, may lead to unacceptably poor performance and low data availability if the distributed services are not aware of the power management’s actions. Therefore, we present an architecture for cluster services in which the deployed services overcome this problem by actively participating in any action taken by the power management. We propose, implement, and evaluate modifications for the Hadoop Distributed File System and the MapReduce clone that make them capable of operating efficiently under limited power budgets.QC 20140707</p

Infoscience - École polytechnique fédérale de Lausanne

Publikationer från KTH

Crossref

Duke Law Scholarship Repository

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Optimal Locally Repairable Codes and Connections to Matroid Theory

Author: Dimakis Alexandros G.
Papailiopoulos Dimitris S.
Tamo Itzhak
Publication venue
Publication date: 06/11/2013
Field of study

Petabyte-scale distributed storage systems are currently transitioning to erasure codes to achieve higher storage efficiency. Classical codes like Reed-Solomon are highly sub-optimal for distributed environments due to their high overhead in single-failure events. Locally Repairable Codes (LRCs) form a new family of codes that are repair efficient. In particular, LRCs minimize the number of nodes participating in single node repairs during which they generate small network traffic. Two large-scale distributed storage systems have already implemented different types of LRCs: Windows Azure Storage and the Hadoop Distributed File System RAID used by Facebook. The fundamental bounds for LRCs, namely the best possible distance for a given code locality, were recently discovered, but few explicit constructions exist. In this work, we present an explicit and optimal LRCs that are simple to construct. Our construction is based on grouping Reed-Solomon (RS) coded symbols to obtain RS coded symbols over a larger finite field. We then partition these RS symbols in small groups, and re-encode them using a simple local code that offers low repair locality. For the analysis of the optimality of the code, we derive a new result on the matroid represented by the code generator matrix.Comment: Submitted for publication, a shorter version was presented at ISIT 201

arXiv.org e-Print Archive

CiteSeerX