964 research outputs found

    Only Aggressive Elephants are Fast Elephants

    Full text link
    Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.Comment: VLDB201

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    A data-driven situation-aware framework for predictive analysis in smart environments

    Get PDF
    In the era of Internet of Things (IoT), it is vital for smart environments to be able to efficiently provide effective predictions of user’s situations and take actions in a proactive manner to achieve the highest performance. However, there are two main challenges. First, the sensor environment is equipped with a heterogeneous set of data sources including hardware and software sensors, and oftentimes complex humans as sensors, too. These sensors generate a huge amount of raw data. In order to extract knowledge and do predictive analysis, it is necessary that the raw sensor data be cleaned, understood, analyzed, and interpreted. Second challenge refers to predictive modeling. Traditional predictive models predict situations that are likely to happen in the near future by keeping and analyzing the history of past user’s situations. Traditional predictive analysis approaches have become less effective because of the massive amount of data that both affects data processing efficiency and complicates the data semantics. In this study, we propose a data-driven, situation-aware framework for predictive analysis in smart environments that addresses the above challenges

    High Performance CDR Processing with MapReduce

    Get PDF
    A call detail record (CDR) is a data record produced by telecommunication equipment consisting of call detail transaction logs. It contains valuable information for many purposes in several domains, such as billing, fraud detection and analytical purposes. However, in the real world these needs face a big data challenge. Billions of CDRs are generated every day and the processing systems are expected to deliver results in a timely manner. The capacity of our current production system is not enough to meet these needs. Therefore a better performing system based on MapReduce and running on Hadoop cluster was designed and implemented. This paper presents an analysis of the previous system and the design and implementation of the new system, called MS2. In this paper also empirical evidence is provided to demonstrate the efficiency and linearity of MS2. Tests have shown that MS2 reduces overhead by 44% and speeds up performance nearly twice compared to the previous system. From benchmarking with several related technologies in large-scale data processing, MS2 was also shown to perform better in the case of CDR batch processing.  When it runs on a cluster consisting of eight CPU cores and two conventional disks, MS2 is able to process 67,000 CDRs/second

    Research on High-performance and Scalable Data Access in Parallel Big Data Computing

    Get PDF
    To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability

    Parallel and Distributed Stream Processing: Systems Classification and Specific Issues

    Get PDF
    Deploying an infrastructure to execute queries on distributed data streams sources requires to identify a scalable and robust solution able to provide results which can be qualified. Last decade, different Data Stream Management Systems have been designed by exploiting new paradigm and technologies to improve performances of solutions facing specific features of data streams and their growing number. However, some tradeoffs are often achieved between performance of the processing, resources consumption and quality of results. This survey 5 suggests an overview of existing solutions among distributed and parallel systems classified according to criteria able to allow readers to efficiently identify relevant existing Distributed Stream Management Systems according to their needs ans resources

    Metocean Big Data Processing Using Hadoop

    Get PDF
    This report will discuss about MapReduce and how it handles big data. In this report, Metocean (Meteorology and Oceanography) Data will be used as it consist of large data. As the number and type of data acquisition devices grows annually, the sheer size and rate of data being collected is rapidly expanding. These big data sets can contain gigabytes or terabytes of data, and can grow on the order of megabytes or gigabytes per day. While the collection of this information presents opportunities for insight, it also presents many challenges. Most algorithms are not designed to process big data sets in a reasonable amount of time or with a reasonable amount of memory. MapReduce allows us to meet many of these challenges to gain important insights from large data sets. The objective of this project is to use MapReduce to handle big data. MapReduce is a programming technique for analysing data sets that do not fit in memory. The problem statement chapter in this project will discuss on how MapReduce comes as an advantage to deal with large data. The literature review part will explain the definition of NoSQL and RDBMS, Hadoop Mapreduce and big data, things to do when selecting database, NoSQL database deployments, scenarios for using Hadoop and Hadoop real world example. The methodology part will explain the waterfall method used in this project development. The result and discussion will explain in details the result and discussion from my project. The last chapter in this project report is conclusion and recommendatio

    Computational methods to engineer process-structure-property relationships in organic electronics: The case of organic photovoltaics

    Get PDF
    Ever since the Nobel prize winning work by Heeger and his colleagues, organic electronics enjoyed increasing attention from researchers all over the world. While there is a large potential for organic electronics in areas of transistors, solar cells, diodes, flexible displays, RFIDs, smart textiles, smart tattoos, artificial skin, bio-electronics, medical devices and many more, there have been very few applications that reached the market. Organic photovoltaics especially can utilize large market of untapped solar power -- portable and affordable solar conversion devices. While there are several reasons for their unavailability, a major one is the challenge of controlling device morphology at several scales, simultaneously. The morphology is intricately related to the processing of the device and strongly influences performance. Added to this is the unending development of new polymeric materials in search of high power conversion efficiencies. Fully understanding this intricate relationship between materials, processing conditions and power conversion is highly resource and time intensive. The goal of this work is to provide tightly coupled computational routes to these expensive experiments, and demonstrate process control using in-silico experiments. This goal is achieved in multiple stages and is commonly called the process-structure-property loop in material science community. We leverage recent advances in high performance computing (HPC) and high throughput computing (HTC) towards this end. Two open-source software packages were developed: GRATE and PARyOpt. GRATE provides a means to reliably and repeatably quantify TEM images for identifying transport characteristics. It solves the problem of manually quantifying large number of large images with fine details. PARyOpt is a Gaussian process based optimization library that is especially useful for optimizing expensive phenomena. Both these are highly modular and designed to be easily integrated with existing software. It is anticipated that the organic electronics community will use these tools to accelerate discovery and development of new-age devices
    • …
    corecore