10 research outputs found

    Efficient Ways to Improve the Performance of HDFS for Small Files

    Get PDF
    Hadoop , an open-source implementation of MapReduce dealing with big data is widely used for short jobs that require low response time. Facebook, Yahoo, Google etc. makes use of Hadoop to process more than 15 terabytes of new data per day. MapReduce gathers the results across the multiple nodes and return a single result or set. The fault tolerance is offered by MapReduce platform and is entirely transparent to the programmers. HDFS (Hadoop Distributed File System), is a single master and multiple slave frameworks. It is one of the core component of Hadoop and it does not perform well for small files as huge numbers of small files pose a heavy burden on NameNode of HDFS and decreasing the performance of HDFS. HDFS is a distributed file system which can process large amounts of data. It is designed to  handle  large  files and suffers  performance  penalty  while dealing with large  number  of  small  files. This paper introduces about HDFS, small file problems and ways to deal with it. Keywords: Hadoop; Hadoop Distributed File System; MapReduce; small file

    Optimization Scheme for Storing and Accessing Huge Number of Small Files on HADOOP Distributed File System

    Get PDF
    Hadoop is a distributed framework which uses a simple programming model for the processing of huge datasets over the network of computers. Hadoop is used across multiple machines to store very large files, which are normally in the range of gigabytes to terabytes. High throughput access is acquired using HDFS for applications with huge datasets. In Hadoop Distributed File System(HDFS), a small file is the one which is smaller than 64MB which is the default block size of HDFS. Hadoop performance is better with a small number of large files, as opposed to a huge number of small files. Many organizations like financial firms need to handle a large number of small files daily. Low performance and high resource consumption are the bottlenecks of traditional method. To reduce the processing time and memory required to handle a large set of small files, an efficient solution is needed which will make HDFS work better for large data of small files. This solution should combine many small files into a large file and treat these large files as an individual file. It should also be able to store these large files into HDFS and retrieve any small file when needed

    A REVIEW ON SMALL FILES IN HADOOP A NOVEL APPROACH TO UNDESTAND SMALL FILES PROBLEM IN HADOOP

    Get PDF
    Hadoop is an open source data management system designed for storing and processing large volumes of data, minimum size being 64MB. Storing and processing of Small Files smaller than the minimum block size cannot be efficiently handled by hadoop because Small Files results in lots of seeks and lots of hopping between the datanodes.  A survey on the existing literature has been carried out to analyze the effect / solutions for the Small Files problem in hadoop. This paper presents the same and lists many effective solutions for this problem and further this paper says that there is a need to carry out lot of research on small file problem in order to attain effective and efficient solutions

    Анализа на активностите на Moodle база на податоци, пред и после појавата на пандемијата Covid-19

    Get PDF
    Образовните системи ширум светот се соочија со предизвик без преседан, при што беше неопходно да се обезбеди образование од далечина преку мешавина на технологии, со цел да се обезбеди континуитет на студирање и учење базирано на наставна програма за сите. Тоа предизвика миграција на досегашниот начин на едукација, односно наставата со физичко присуство веќе беше заменета со учење на далечина преку интернет. Затворањето на училиштата беше наложено како дел од препораките за јавно здравје за да се спречи ширењето на Covid-19 од февруари 2020 година во повеќето земји. Учењето на далечина, вклучително и настава и учење преку интернет, се изучува и применува со децении, но со појавата на пандемијата тоа стана единствен начин за продолжување на образовниот процес. Бројни истражувачки студии, теории, модели, стандарди и критериуми за евалуација се фокусираат на квалитетно учење преку интернет, настава преку интернет и дизајн на курсеви преку интернет. Во овој контекст, во 2020 година, студискиот процес на Универзитетот „Гоце Делчев“ кој се одвиваше со физичко присуство беше променет со учење од далечина поради новосоздадената ситуација предизвикана од пандемијата Covid-19. Примарната цел на ова истражување е да се анализира бројот на активности на корисниците на Moodle платформата пред и по пандемијата. Системот е-учење Moodle се користи скоро 10 години. За таа цел е извршена анализа на податоците од базата на податоци на Moodle користејќи алатки за големи податоци. Според добиените резултати, вкупниот број активности во 2020 година е зголемен за три пати во споредба со истиот период во 2019 година. Од истражувањето исто така беа добиени резултати за поединечните активности на наставничкиот кадар и студентите, односно резултати за одредени модули кои беа анализирани. Истите тие резултати покажуваат дека има разлика во бројот на активностите на корисниците на Moodle платформата. Клучни зборови: големи податоци, Moodle, систем за електронско учење,COVID 1

    Modelação ágil para sistemas de Big Data Warehousing

    Get PDF
    Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de InformaçãoOs Sistemas de Informação, com a popularização do conceito de Big Data começaram a considerar aspetos relativos às infraestruturas capazes de lidar com a recolha, armazenamento, processamento e análise de vastas quantidades de dados heterogéneos, como pouca estrutura (ou nenhuma) e gerados a velocidades cada vez maiores. Estes têm sido os desafios inerentes à transição do Data Modelling em Data Warehouses tradicionais para ambientes de Big Data. O estado-de-arte reflete que a área científica de Big Data Warehousing é recente, ambígua e apresenta lacunas relativas a abordagens para a conceção e implementação destes sistemas; deste modo, nos últimos anos, vários autores motivados pela ausência de trabalhos científicos e técnicos desenvolveram estudos na área com o intuito de explorar modelos adequados (representação de componentes lógicas e tecnológicas, data flows e estruturas de dados), métodos e instanciações (casos de demonstração recorrendo a protótipos e benchmarks). A presente dissertação está inserida no estudo da proposta geral dos padrões de design para sistemas de Big Data Warehousing (M. Y. Santos & Costa, 2019) e, posteriormente, é efetuada a proposta de um método, em vista a semiautomatização da proposta de design dos autores referidos, constituído por sete regras computacionais, apresentadas, demonstradas e validadas com exemplos baseados em contextos reais. De forma a apresentar o processo de modelação ágil, foi criado um fluxograma para cada regra, permitindo assim apresentar todos passos. Comparando os resultados dos exemplos obtidos após aplicação do método e dos resultantes de uma modelação totalmente manual, o trabalho proposto apresenta uma proposta de modelação geral, que funciona como uma sugestão de modelação de Big Data Warehouses para o utilizador que, posteriormente, deve validar e ajustar o resultado tendo em consideração o contexto do caso em análise, as queries que pretende utilizar e as características dos dados.Information Systems, with the popularization of Big Data, have started to consider the aspects related to infrastructures capable of dealing with collection, storage, processing and analysis of vast amounts of heterogeneous data, with little or no structure and produced at increasing speed. These have been the challenges inherent to the transition from Data Modelling into traditional Data Warehouses for Big Data environments. The state-of-the-art reflects that the scientific field of Big Data Warehousing is recent, ambiguous and that it shows a few gaps regarding the approaches to the design and implementation of these systems; thus, in the past few years, several authors, motivated by the lack of scientific and technical work, have developed some studies in this scientific area in order to explore appropriated models (representation of logical and technological components, data flows and data structures), methods and instantiations (demonstration cases using prototypes and benchmarks). This dissertation is inserted in the study of the general proposal of design standards for Big Data Warehousing systems. Late on, the proposed method is comprised of seven sequential rules which are thoroughly explained, demonstrated and validated with relevante exemples based on common real use-cases. For each rule, step-by-step flowchart is provider an agile modelling process. When compared a fully manual example, the proposed work offered a correct but genereal resulting model that works best as a first modelling effort that should then be validated by a use-case expert

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    Improving Metadata Management For Small Files In Hdfs

    No full text
    Scientific applications are adapting HDFS/MapReduce to perform large scale data analytics. One of the major challenges is that an overabundance of small files is common in these applications, and HDFS manages all its files through a single server, the Namenode. It is anticipated that small files can significantly impact the performance of Namenode. In this work we propose a mechanism to store small files in HDFS efficiently and improve the space utilization for metadata. Our scheme is based on the assumption that each client is assigned a quota in the file system, for both the space and number of files. In our approach, we utilize the compression method \u27harballing\u27, provided by Hadoop, to better utilize the HDFS. We provide for new job functionality to allow for in-job archival of directories and files so that running MapReduce programs may complete without being killed by the JobTracker due to quota policies. This approach leads to better functionality of metadata operations and more efficient usage of the HDFS. Our analysis results show that we can reduce the metadata footprint in main memory by a factor of 42. © 2009 IEEE

    Neues Konzept zur skalierbaren, explorativen Analyse großer Zeitreihendaten mit Anwendung auf umfangreiche Stromnetz-Messdaten

    Get PDF
    Diese Arbeit beschäftigt sich mit der Entwicklung und Anwendung eines neuen Konzepts zur skalierbaren explorativen Analyse großer Zeitreihendaten. Hierzu werden zahlreiche datenintensive Methoden aus dem Bereich des Data-Mining und der Zeitreihenanalyse hinsichtlich ihrer Skalierbarkeit mit wachsendem Datenvolumen untersucht und neue Verfahren und Datenrepräsentationen vorgestellt, die eine Exploration sehr großer Zeitreihendaten erlauben, die mit herkömmlichen Methoden nicht effizient auswertbar sind und unter dem Begriff Big Data eingeordnet werden können. Methoden zur Verwaltung und Visualisierung großer multivariater Zeitreihen werden mit Methoden zur Detektion seltener und häufiger Muster – sog. Discords und Motifs – kombiniert und zu einem leistungsfähigen Explorationssystem namens ViAT (von engl. Visual Analysis of Time series) zusammengefasst. Um auch Analysen von Zeitreihendaten durchführen zu können, deren Datenvolumen hunderte von Terabyte und mehr umfasst, wurde eine datenparallele verteilte Verarbeitung auf Basis von Apache Hadoop entwickelt. Sie erlaubt die Ableitung datenreduzierter Metadaten, welche statistische Eigenschaften und neuartige Strukturbeschreibungen der Zeitreihen enthalten. Auf dieser Basis sind neue inhaltsbasierte Anfragen und Auswertungen sowie Suchen nach bekannten und zuvor unbekannten Mustern in den Daten möglich. Das Design der entwickelten neuen Methoden und deren Integration zu einem Gesamtsystem namens FraScaTi (von engl. Framework for Scalable management and analysis of Time series data) wird vorgestellt. Das System wird evaluiert und im Anwendungsfeld der Stromnetzanalyse erprobt, welches von der Skalierbarkeit und den neuartigen Analysemöglichkeiten profitiert. Hierzu wird eine explorative Analyse hochfrequenter Stromnetz-Messdaten durchgeführt, deren Ergebnisse im Kontext des Anwendungsbereichs präsentiert und diskutiert werden
    corecore