n approach for storing and accessing small files on hadoop

Abstract

HDFS(Hadoop Distributed File System)凭借其高容错、可伸缩和廉价存储的优点,在当前面向云计算的应用场景中得到了广泛应用。然而,HDFS设计的初衷是存储超大文件,对于海量小文件,由于NameNode内存开销等问题,其存储和读取性能并不理想。提出一种基于小文件合并的方法 HIFM(Hierarchy Index File Merging),综合考虑小文件之间的相关性和数据的目录结构,来辅助将小文件合并成大文件,并生成分层索引。采用集中存储和分布式存储相结合的方式管理索引文件,并实现索引文件预加载。此外,HIFM采用数据预取的机制,提高顺序访问小文件的效率。实验结果表明,HIFM方法能够有效提高小文件存储和读取效率,显著降低NameNode和DataNode的内存开销,适合应用在有一定目录结构的海量小文件存储的应用场合。新闻出版重大科技工程项目(0610-1041BJNF2328/23)|国家科技支撑计划课题(2011BAH14B02)|中国科学院知识创新工程方向性项目课题(KGCX2-YW-174)Benefiting from its advantages of high fault-tolerance, scalability and low-cost storage capability, HDFS (Hadoop distributed file system) has been gaining widely application in current cloud computing-based applied scenes. However, HDFS is primarily designed for streaming access of ultra-large files and suffers the performance penalty in both storage and accessing while managing massive small files due to the memory overhead problem of NameNode. In this paper, an approach based on combining small files, called HIFM (hierarchy index file merging), is proposed. In it, the correlations between small files and the directory structure of data are comprehensively considered to assist the small files to be merged into large ones and to generate hierarchical index. Centralised storage and distributed storage methods are jointly used in index files management, and the preload of index files is implemented. Besides, in order to improve the efficiency of sequentially ?accessing? the small files, HIFM adopts data prefetching mechanism. Experimental results show that HIFM can improve the efficiency of ?storing? and accessing small files effectively, and mitigate the memory overhead of NameNode and DataNode obviously. It is suitable for the applications which have massive structured small files storage

    Similar works