268 research outputs found

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    STUDY OF BIG DATA ARHITECTURE LAMBDA ARHITECTURE

    Get PDF
    The lambda architecture introduced by Marz is generic, scalable and fault-tolerant data processing architecture. It aims to satisfy the needs for a robust system that is faulttolerant, both against hardware failures and human mistakes, being able to serve a wide range of workloads and use cases. The architecture proposal decomposes the problem into three layers: a) the batch layer focuses on fault tolerance and optimizes for precise results b) the speed layer is optimized for short response-times and only takes into account the most recent data and c) the serving layer provides low latency views to the results of the batch layer. The reason to divide the architecture into three layers is the flexibility it offers to the potential applications. The fast but possibly inaccurate results of the speed layer are eventually replaced by the precise results of the batch layer. The evaluation of the designed architecture measured its capabilities based on the DEBS grand challenge 2014 and percentile calculation for milestones task. As part of the project we implement the lambda architecture in different ways (i.e. using different systems). We compare these different implementations and derive the strengths and weaknesses of each different system used in the lambda architecture

    Hadoop for EEG Storage and Processing: A Feasibility Study

    Get PDF
    Lots of heterogeneous complex data are collected for diagnosis purposes. Such data should be shared between all caregivers and, often at least partly automatically processed, due to its complexity, for its full potential to be harnessed. This paper is a feasibility study that assesses the potential of Hadoop as a medical data storage and processing platform using EEGs as example of medical data

    Aligning Machine Learning for the Lambda Architecture

    Get PDF
    We live in the era of Big Data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. Traditional data storage and analysis methodologies have become insufficient to handle the rapidly increasing amount of data. The development of complex machine learning techniques has led to the proliferation of advanced analytics solutions. This has led to a paradigm shift in the way we store, process and analyze data. The avalanche of data has led to the development of numerous platforms and solutions satisfying various business analytics needs. It becomes imperative for the business practitioners and consultants to choose the right solution which can provide the best performance and maximize the utilization of the data available. In this thesis, we develop and implement a Big Data architectural framework called the Lambda Architecture. It consists of three major components, namely batch data processing, realtime data processing and a reporting layer. We develop and implement analytics use cases using machine learning techniques for each of these layers. The objective is to build a system in which the data storage and processing platforms and the analytics frameworks can be integrated seamlessly

    Hadoop Performance Analysis on Raspberry Pi for DNA Sequence Alignment

    Get PDF
    The rapid development of electronic data has brought two major challenges, namely, how to store big data and how to process it. Two main problems in processing big data are the high cost and the computational power. Hadoop, one of the open source frameworks for processing big data, uses distributed computational model designed to be able to run on commodity hardware. The aim of this research is to analyze Hadoop cluster on Raspberry Pi as a commodity hardware for DNA sequence alignment. Six B Model Raspberry Pi and a Biodoop library were used in this research for DNA sequence alignment. The length of the DNA used in this research is between 5,639 bp and 13,271 bp. The results showed that the Hadoop cluster was running on the Raspberry Pi with average usage of processor 73.08%, 334.69 MB of memory and 19.89 minutes of job time completion. The distribution of Hadoop data file blocks was found to reduce processor usage as much as 24.14% and memory usage as much as 8.49%. However this increased job processing time as much as 31.53%
    • …
    corecore