Search CORE

10 research outputs found

Efficient Ways to Improve the Performance of HDFS for Small Files

Author: Gohil Parth
Panchal Bakul
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 24/01/2014
Field of study

Hadoop , an open-source implementation of MapReduce dealing with big data is widely used for short jobs that require low response time. Facebook, Yahoo, Google etc. makes use of Hadoop to process more than 15 terabytes of new data per day. MapReduce gathers the results across the multiple nodes and return a single result or set. The fault tolerance is offered by MapReduce platform and is entirely transparent to the programmers. HDFS (Hadoop Distributed File System), is a single master and multiple slave frameworks. It is one of the core component of Hadoop and it does not perform well for small files as huge numbers of small files pose a heavy burden on NameNode of HDFS and decreasing the performance of HDFS. HDFS is a distributed file system which can process large amounts of data. It is designed to handle large files and suffers performance penalty while dealing with large number of small files. This paper introduces about HDFS, small file problems and ways to deal with it. Keywords: Hadoop; Hadoop Distributed File System; MapReduce; small file

International Institute for Science, Technology and Education (IISTE): E-Journals

Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

Author: L. Prasanna Kumar, Sampathirao Suneetha
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 26/02/2016
Field of study

Hadoop is a distributed framework which uses a simple programming model for the processing of huge datasets over the network of computers. Hadoop is used across multiple machines to store very large files, which are normally in the range of gigabytes to terabytes. High throughput access is acquired using HDFS for applications with huge datasets. In Hadoop Distributed File System(HDFS), a small file is the one which is smaller than 64MB which is the default block size of HDFS. Hadoop performance is better with a small number of large files, as opposed to a huge number of small files. Many organizations like financial firms need to handle a large number of small files daily. Low performance and high resource consumption are the bottlenecks of traditional method. To reduce the processing time and memory required to handle a large set of small files, an efficient solution is needed which will make HDFS work better for large data of small files. This solution should combine many small files into a large file and treat these large files as an individual file. It should also be able to store these large files into HDFS and retrieve any small file when needed

International Journal on Recent and Innovation Trends in Computing and Communication

A REVIEW ON SMALL FILES IN HADOOP A NOVEL APPROACH TO UNDESTAND SMALL FILES PROBLEM IN HADOOP

Author: Indira Prof. Dr. B.
Khare Avanti
Publication venue: International Journal of Innovative Technology and Research
Publication date: 07/07/2017
Field of study

Hadoop is an open source data management system designed for storing and processing large volumes of data, minimum size being 64MB. Storing and processing of Small Files smaller than the minimum block size cannot be efficiently handled by hadoop because Small Files results in lots of seeks and lots of hopping between the datanodes. A survey on the existing literature has been carried out to analyze the effect / solutions for the Small Files problem in hadoop. This paper presents the same and lists many effective solutions for this problem and further this paper says that there is a need to carry out lot of research on small file problem in order to attain effective and efficient solutions

International Journal of Innovative Technology and Research (IJITR)

Recommended from our members

PARQ: A MEMORY-EFFICIENT APPROACH FOR QUERY-LEVEL PARALLELISM

Author: Gao Qianqian
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2016
Field of study

In the era of big data, people not only enjoy what massive information brings, but also experience the problem of information overload. As the volume of both data and users increasing sharply, more and more studies focus on how to answer a query for interesting information from massive data. However, most memory-based query systems are designed and implemented to optimize the performance in processing a single query and do not support in-memory data sharing among query processing jobs. When they are extended to process multiple concurrent queries, they will suffer the problems of the inefficient use of memory and waste of time. This thesis aims to design and implement a memory-efficient system, ParQ, which can be adopted by memory-based query systems to realize query-level parallelism. The main idea includes constructing a common memory block for maintaining sharable data. By sharing data, ParQ is able to process multiple queries concurrently while reducing memory usage and running time. We apply ParQ to several existing query systems. The experiment results show that ParQ improves the performance in both job completion time and memory usage when executing multiple concurrent query jobs

ScholarWorks@UMass Amherst

Flink vs. Samza for the management of streaming big data from photovoltaic panels.

Author: Ρεβυθάς Ιωακείμ Ι.
Publication venue
Publication date: 01/01/2016
Field of study

University of Thessaly Institutional Repository

Анализа на активностите на Moodle база на податоци, пред и после појавата на пандемијата Covid-19

Author: Lapevska Dijana
Publication venue
Publication date: 01/11/2021
Field of study

Образовните системи ширум светот се соочија со предизвик без преседан, при што беше неопходно да се обезбеди образование од далечина преку мешавина на технологии, со цел да се обезбеди континуитет на студирање и учење базирано на наставна програма за сите. Тоа предизвика миграција на досегашниот начин на едукација, односно наставата со физичко присуство веќе беше заменета со учење на далечина преку интернет. Затворањето на училиштата беше наложено како дел од препораките за јавно здравје за да се спречи ширењето на Covid-19 од февруари 2020 година во повеќето земји. Учењето на далечина, вклучително и настава и учење преку интернет, се изучува и применува со децении, но со појавата на пандемијата тоа стана единствен начин за продолжување на образовниот процес. Бројни истражувачки студии, теории, модели, стандарди и критериуми за евалуација се фокусираат на квалитетно учење преку интернет, настава преку интернет и дизајн на курсеви преку интернет. Во овој контекст, во 2020 година, студискиот процес на Универзитетот „Гоце Делчев“ кој се одвиваше со физичко присуство беше променет со учење од далечина поради новосоздадената ситуација предизвикана од пандемијата Covid-19. Примарната цел на ова истражување е да се анализира бројот на активности на корисниците на Moodle платформата пред и по пандемијата. Системот е-учење Moodle се користи скоро 10 години. За таа цел е извршена анализа на податоците од базата на податоци на Moodle користејќи алатки за големи податоци. Според добиените резултати, вкупниот број активности во 2020 година е зголемен за три пати во споредба со истиот период во 2019 година. Од истражувањето исто така беа добиени резултати за поединечните активности на наставничкиот кадар и студентите, односно резултати за одредени модули кои беа анализирани. Истите тие резултати покажуваат дека има разлика во бројот на активностите на корисниците на Moodle платформата. Клучни зборови: големи податоци, Moodle, систем за електронско учење,COVID 1

UGD Academic Repository

Modelação ágil para sistemas de Big Data Warehousing

Author: Nogueira Marta Susete Carvalho Batista
Publication venue
Publication date: 01/01/2019
Field of study

Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoOs Sistemas de Informação, com a popularização do conceito de Big Data começaram a considerar aspetos relativos às infraestruturas capazes de lidar com a recolha, armazenamento, processamento e análise de vastas quantidades de dados heterogéneos, como pouca estrutura (ou nenhuma) e gerados a velocidades cada vez maiores. Estes têm sido os desafios inerentes à transição do Data Modelling em Data Warehouses tradicionais para ambientes de Big Data. O estado-de-arte reflete que a área científica de Big Data Warehousing é recente, ambígua e apresenta lacunas relativas a abordagens para a conceção e implementação destes sistemas; deste modo, nos últimos anos, vários autores motivados pela ausência de trabalhos científicos e técnicos desenvolveram estudos na área com o intuito de explorar modelos adequados (representação de componentes lógicas e tecnológicas, data flows e estruturas de dados), métodos e instanciações (casos de demonstração recorrendo a protótipos e benchmarks). A presente dissertação está inserida no estudo da proposta geral dos padrões de design para sistemas de Big Data Warehousing (M. Y. Santos & Costa, 2019) e, posteriormente, é efetuada a proposta de um método, em vista a semiautomatização da proposta de design dos autores referidos, constituído por sete regras computacionais, apresentadas, demonstradas e validadas com exemplos baseados em contextos reais. De forma a apresentar o processo de modelação ágil, foi criado um fluxograma para cada regra, permitindo assim apresentar todos passos. Comparando os resultados dos exemplos obtidos após aplicação do método e dos resultantes de uma modelação totalmente manual, o trabalho proposto apresenta uma proposta de modelação geral, que funciona como uma sugestão de modelação de Big Data Warehouses para o utilizador que, posteriormente, deve validar e ajustar o resultado tendo em consideração o contexto do caso em análise, as queries que pretende utilizar e as características dos dados.Information Systems, with the popularization of Big Data, have started to consider the aspects related to infrastructures capable of dealing with collection, storage, processing and analysis of vast amounts of heterogeneous data, with little or no structure and produced at increasing speed. These have been the challenges inherent to the transition from Data Modelling into traditional Data Warehouses for Big Data environments. The state-of-the-art reflects that the scientific field of Big Data Warehousing is recent, ambiguous and that it shows a few gaps regarding the approaches to the design and implementation of these systems; thus, in the past few years, several authors, motivated by the lack of scientific and technical work, have developed some studies in this scientific area in order to explore appropriated models (representation of logical and technological components, data flows and data structures), methods and instantiations (demonstration cases using prototypes and benchmarks). This dissertation is inserted in the study of the general proposal of design standards for Big Data Warehousing systems. Late on, the proposed method is comprised of seven sequential rules which are thoroughly explained, demonstrated and validated with relevante exemples based on common real use-cases. For each rule, step-by-step flowchart is provider an agile modelling process. When compared a fully manual example, the proposed work offered a correct but genereal resulting model that works best as a first modelling effort that should then be validated by a use-case expert

Universidade do Minho: RepositoriUM

A Design Framework for Efficient Distributed Analytics on Structured Big Data

Author: Orensa Noah
Publication venue: 'University of Saskatchewan Library'
Publication date: 10/08/2021
Field of study

Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

University of Saskatchewan Research Archive

Improving Metadata Management For Small Files In Hdfs

Author: MacKey Grant
Sehrish Saba
Wang Jun
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 21/12/2009
Field of study

Scientific applications are adapting HDFS/MapReduce to perform large scale data analytics. One of the major challenges is that an overabundance of small files is common in these applications, and HDFS manages all its files through a single server, the Namenode. It is anticipated that small files can significantly impact the performance of Namenode. In this work we propose a mechanism to store small files in HDFS efficiently and improve the space utilization for metadata. Our scheme is based on the assumption that each client is assigned a quota in the file system, for both the space and number of files. In our approach, we utilize the compression method \u27harballing\u27, provided by Hadoop, to better utilize the HDFS. We provide for new job functionality to allow for in-job archival of directories and files so that running MapReduce programs may complete without being killed by the JobTracker due to quota policies. This approach leads to better functionality of metadata operations and more efficient usage of the HDFS. Our analysis results show that we can reduce the metadata footprint in main memory by a factor of 42. © 2009 IEEE

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Neues Konzept zur skalierbaren, explorativen Analyse großer Zeitreihendaten mit Anwendung auf umfangreiche Stromnetz-Messdaten

Author: Bach Felix
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2019
Field of study

Diese Arbeit beschäftigt sich mit der Entwicklung und Anwendung eines neuen Konzepts zur skalierbaren explorativen Analyse großer Zeitreihendaten. Hierzu werden zahlreiche datenintensive Methoden aus dem Bereich des Data-Mining und der Zeitreihenanalyse hinsichtlich ihrer Skalierbarkeit mit wachsendem Datenvolumen untersucht und neue Verfahren und Datenrepräsentationen vorgestellt, die eine Exploration sehr großer Zeitreihendaten erlauben, die mit herkömmlichen Methoden nicht effizient auswertbar sind und unter dem Begriff Big Data eingeordnet werden können. Methoden zur Verwaltung und Visualisierung großer multivariater Zeitreihen werden mit Methoden zur Detektion seltener und häufiger Muster – sog. Discords und Motifs – kombiniert und zu einem leistungsfähigen Explorationssystem namens ViAT (von engl. Visual Analysis of Time series) zusammengefasst. Um auch Analysen von Zeitreihendaten durchführen zu können, deren Datenvolumen hunderte von Terabyte und mehr umfasst, wurde eine datenparallele verteilte Verarbeitung auf Basis von Apache Hadoop entwickelt. Sie erlaubt die Ableitung datenreduzierter Metadaten, welche statistische Eigenschaften und neuartige Strukturbeschreibungen der Zeitreihen enthalten. Auf dieser Basis sind neue inhaltsbasierte Anfragen und Auswertungen sowie Suchen nach bekannten und zuvor unbekannten Mustern in den Daten möglich. Das Design der entwickelten neuen Methoden und deren Integration zu einem Gesamtsystem namens FraScaTi (von engl. Framework for Scalable management and analysis of Time series data) wird vorgestellt. Das System wird evaluiert und im Anwendungsfeld der Stromnetzanalyse erprobt, welches von der Skalierbarkeit und den neuartigen Analysemöglichkeiten profitiert. Hierzu wird eine explorative Analyse hochfrequenter Stromnetz-Messdaten durchgeführt, deren Ergebnisse im Kontext des Anwendungsbereichs präsentiert und diskutiert werden

KITopen