538 research outputs found

    Critical Analysis of Solutions to Hadoop Small File Problem

    Get PDF
    Hadoop big data platform is designed to process large volume of data Small file problem is a performance bottleneck in Hadoop processing Small files lower than the block size of Hadoop creates huge storage overhead at Namenode s and also wastes computational resources due to spawning of many map tasks Various solutions like merging small files mapping multiple map threads to same java virtual machine instance etc have been proposed to solve the small file problems in Hadoop This survey does a critical analysis of existing works addressing small file problems in Hadoop and its variant platforms like Spark The aim is to understand their effectiveness in reducing the storage computational overhead and identify the open issues for further researc

    A REVIEW ON SMALL FILES IN HADOOP A NOVEL APPROACH TO UNDESTAND SMALL FILES PROBLEM IN HADOOP

    Get PDF
    Hadoop is an open source data management system designed for storing and processing large volumes of data, minimum size being 64MB. Storing and processing of Small Files smaller than the minimum block size cannot be efficiently handled by hadoop because Small Files results in lots of seeks and lots of hopping between the datanodes.  A survey on the existing literature has been carried out to analyze the effect / solutions for the Small Files problem in hadoop. This paper presents the same and lists many effective solutions for this problem and further this paper says that there is a need to carry out lot of research on small file problem in order to attain effective and efficient solutions

    Concentric Layout, A New Scientific Data Layout For Matrix Data Set In Hadoop File System

    Get PDF
    The data generated by scientific simulation, sensor, monitor or optical telescope has increased with dramatic speed. In order to analyze the raw data speed and space efficiently, data preprocess operation is needed to achieve better performance in data analysis phase. Current research shows an increasing tread of adopting MapReduce framework for large scale data processing. However, the data access patterns which generally applied to scientific data set are not supported by current MapReduce framework directly. The gap between the requirement from analytics application and the property of MapReduce framework motivates us to provide support for these data access patterns in MapReduce framework. In our work, we studied the data access patterns in matrix files and proposed a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout which adopted in current Hadoop framework by default, concentric data layout stores the data from the same sub-matrix into one chunk. This matches well with the matrix operations like computation. The concentric data layout preprocesses the data beforehand, and optimizes the afterward run of MapReduce application. The experiments indicate that the concentric data layout improves the overall performance, reduces the execution time by 38% when the file size is 16 GB, also it relieves the data overhead phenomenon and increases the effective data retrieval rate by 32% on average

    Tiny datablock in saving Hadoop distributed file system wasted memory

    Get PDF
    Hadoop distributed file system (HDFS) is the file system whereby Hadoop is use it to store all the upcoming data inside it. Since it been declared, HDFS is consuming a huge memory amount in order to serve a normal dataset. Nonetheless, the current file saving mechanism in HDFS save only one file in one datablock. Thus, a file with just 5 Mb in size will take up the whole datablock capacity causing the rest of the memory unavailable for other upcoming files, and this is considered a huge waste of memory in serving a normal size dataset. This paper proposed a method called tiny datablock-HDFS (TD-HDFS) to increase the usability of HDFS memory and increase the file hosting capabilities by reducing the datablock size to the minimum capacity, and then merging all the related datablocks into one master datablock. This master datablock consists of tiny virtual datablocks that contain the related small files together; will exploit the full memory of the master datablock. The result of this study is a running HDFS with a minimum amount of wasted memory with the same read/write data performance. The results were examined through a comparison between the standard HDFS file hosting and the proposed solution of this study.TRANSLATE with x EnglishArabicHebrewPolishBulgarianHindiPortugueseCatalanHmong DawRomanianChinese SimplifiedHungarianRussianChinese TraditionalIndonesianSlovakCzechItalianSlovenianDanishJapaneseSpanishDutchKlingonSwedishEnglishKoreanThaiEstonianLatvianTurkishFinnishLithuanianUkrainianFrenchMalayUrduGermanMalteseVietnameseGreekNorwegianWelshHaitian CreolePersian //  TRANSLATE with COPY THE URL BELOW Back EMBED THE SNIPPET BELOW IN YOUR SITE Enable collaborative features and customize widget: Bing Webmaster PortalBack/

    Nomadic fog storage

    Get PDF
    Mobile services incrementally demand for further processing and storage. However, mobile devices are known for their constrains in terms of processing, storage, and energy. Early proposals have addressed these aspects; by having mobile devices access remote clouds. But these proposals suffer from long latencies and backhaul bandwidth limitations in retrieving data. To mitigate these issues, edge clouds have been proposed. Using this paradigm, intermediate nodes are placed between the mobile devices and the remote cloud. These intermediate nodes should fulfill the end users’ resource requests, namely data and processing capability, and reduce the energy consumption on the mobile devices’ batteries. But then again, mobile traffic demand is increasing exponentially and there is a greater than ever evolution of mobile device’s available resources. This urges the use of mobile nodes’ extra capabilities for fulfilling the requisites imposed by new mobile applications. In this new scenario, the mobile devices should become both consumers and providers of the emerging services. The current work researches on this possibility by designing, implementing and testing a novel nomadic fog storage system that uses fog and mobile nodes to support the upcoming applications. In addition, a novel resource allocation algorithm has been developed that considers the available energy on mobile devices and the network topology. It also includes a replica management module based on data popularity. The comprehensive evaluation of the fog proposal has evidenced that it is responsive, offloads traffic from the backhaul links, and enables a fair energy depletion among mobiles nodes by storing content in neighbor nodes with higher battery autonomy.Os serviços móveis requerem cada vez mais poder de processamento e armazenamento. Contudo, os dispositivos móveis são conhecidos por serem limitados em termos de armazenamento, processamento e energia. Como solução, os dispositivos móveis começaram a aceder a estes recursos através de nuvens distantes. No entanto, estas sofrem de longas latências e limitações na largura de banda da rede, ao aceder aos recursos. Para resolver estas questões, foram propostas soluções de edge computing. Estas, colocam nós intermediários entre os dispositivos móveis e a nuvem remota, que são responsáveis por responder aos pedidos de recursos por parte dos utilizadores finais. Dados os avanços na tecnologia dos dispositivos móveis e o aumento da sua utilização, torna-se cada mais pertinente a utilização destes próprios dispositivos para fornecer os serviços da nuvem. Desta forma, o dispositivo móvel torna-se consumidor e fornecedor do serviço nuvem. O trabalho atual investiga esta vertente, implementado e testando um sistema que utiliza dispositivos móveis e nós no “fog”, para suportar os serviços móveis emergentes. Foi ainda implementado um algoritmo de alocação de recursos que considera os níveis de energia e a topologia da rede, bem como um módulo que gere a replicação de dados no sistema de acordo com a sua popularidade. Os resultados obtidos provam que o sistema é responsivo, alivia o tráfego nas ligações no core, e demonstra uma distribuição justa do consumo de energia no sistema através de uma disseminação eficaz de conteúdo nos nós da periferia da rede mais próximos dos nós consumidores

    Early Accurate Results for Advanced Analytics on MapReduce

    Full text link
    Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for `big data'. Therefore, we proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation. These estimates are based on a technique called bootstrapping that has been widely employed in statistics and can be applied to arbitrary functions and data distributions. In this paper, we describe our Early Accurate Result Library (EARL) for Hadoop that was designed to minimize the changes required to the MapReduce framework. Various tests of EARL of Hadoop are presented to characterize the frequent situations where EARL can provide major speed-ups over the current version of Hadoop.Comment: VLDB201
    corecore