Search CORE

3 research outputs found

Critical Analysis of Solutions to Hadoop Small File Problem

Author: Dr. Chandramouli H
Prof. Shwetha K S
Publication venue: Global Journals Inc. (US)
Publication date: 28/10/2023
Field of study

Hadoop big data platform is designed to process large volume of data Small file problem is a performance bottleneck in Hadoop processing Small files lower than the block size of Hadoop creates huge storage overhead at Namenode s and also wastes computational resources due to spawning of many map tasks Various solutions like merging small files mapping multiple map threads to same java virtual machine instance etc have been proposed to solve the small file problems in Hadoop This survey does a critical analysis of existing works addressing small file problems in Hadoop and its variant platforms like Spark The aim is to understand their effectiveness in reducing the storage computational overhead and identify the open issues for further researc

Global Journal of Computer Science and Technology (GJCST)

Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS

Author: Du Xiaojiang
Guizani Mohsen
Lin Kwei Jay
Tao Wenjun
Tchaye-Kondi Jude
Zhai Yanlong
Zhu Liehuang
Publication venue: 'Elsevier BV'
Publication date: 01/10/2021
Field of study

HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.The authors thank the anonymous reviewers for their insightful suggestions. This work is supported by the National Natural Science Foundation of China (Grant No. 61602037 )

Qatar University Institutional Repository