67 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Benchmarking SQL on MapReduce systems using large astronomy databases

    Get PDF
    International audienceIn the era of bigdata, with a massive set of digital information of unprecedented volumes being collected and/or produced in several application domains , it becomes more and more difficult to manage and query large data repositories. In the framework of the PetaSky project (http://com.isima.fr/Petasky), we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project (http://www.lsst.org/). The overall size of the database that will be produced is expected to exceed 60 PB [28]. In order to evaluate the performances of existing SQL On MapReduce data management systems, we conducted extensive experiments by using data and queries from the area of cosmology. The goal of this work is to report on the ability of such systems to support large scale declarative queries. We mainly investigated the impact of data partitioning, indexing and compression on query execution performances

    HaoLap: a Hadoop based OLAP system for big data

    Get PDF
    International audienceIn recent years, facing information explosion, industry and academia have adopted distributed file system and MapReduce programming model to address new challenges the big data has brought. Based on these technologies, this paper presents HaoLap (Hadoop based oLap), an OLAP (OnLine Analytical Processing) system for big data. Drawing on the experience of Multidimensional OLAP (MOLAP), HaoLap adopts the specified multidimensional model to map the dimensions and the measures; the dimension coding and traverse algorithm to achieve the roll up operation on dimension hierarchy; the partition and linearization algorithm to store dimensions and measures; the chunk selection algorithm to optimize OLAP performance; and MapReduce to execute OLAP. The paper illustrates the key techniques of HaoLap including system architecture, dimension definition, dimension coding and traversing, partition, data storage, OLAP and data loading algorithm. We evaluated HaoLap on a real application and compared it with Hive, HadoopDB, HBaseLattice, and Olap4Cloud. The experiment results show that HaoLap boost the efficiency of data loading, and has a great advantage in the OLAP performance of the data set size and query complexity, and meanwhile HaoLap also completely support dimension operations

    SmallClient for big data: an indexing framework towards fast data retrieval

    No full text
    Numerous applications are continuously generating massive amount of data and it has become critical to extract useful information while maintaining acceptable computing performance. The objective of this work is to design an indexing framework which minimizes indexing overhead and improves query execution and data search performance with optimum aggregation of computing performance. We propose Small-Client, an indexing framework to speed up query execution. SmallClient has three modules: block creation, index creation and query execution. Block creation module supports improving data retrieval performance with minimum data uploading overhead. Index creation module allows maximum indexes on a dataset to increase index hit ratio with minimized indexing overhead. Finally, query execution module offers incoming queries to utilize these indexes. The evaluation shows that Small-Client outperforms Hadoop full scan with more than 90% search performance. Meanwhile, indexing overhead of SmallClient is reduced to approximately 50% and 80% for index size and indexing time respectively

    Layout Optimization for Distributed Relational Databases Using Machine Learning

    Get PDF
    A common problem when running Web-based applications is how to scale-up the database. The solution to this problem usually involves having a smart Database Administrator determine how to spread the database tables out amongst computers that will work in parallel. Laying out database tables across multiple machines so they can act together as a single efficient database is hard. Automated methods are needed to help eliminate the time required for database administrators to create optimal configurations. There are four operators that we consider that can create a search space of possible database layouts: 1) denormalizing, 2) horizontally partitioning, 3) vertically partitioning, and 4) fully replicating. Textbooks offer general advice that is useful for dealing with extreme cases - for instance you should fully replicate a table if the level of insert to selects is close to zero. But even this seemingly obvious statement is not necessarily one that will lead to a speed up once you take into account that some nodes might be a bottle neck. There can be complex interactions between the 4 different operators which make it even more difficult to predict what the best thing to do is. Instead of using best practices to do database layout, we need a system that collects empirical data on when these 4 different operators are effective. We have implemented a state based search technique to try different operators, and then we used the empirically measured data to see if any speed up occurred. We recognized that the costs of creating the physical database layout are potentially large, but it is necessary since we want to know the Ground Truth about what is effective and under what conditions. After creating a dataset where these four different operators have been applied to make different databases, we can employ machine learning to induce rules to help govern the physical design of the database across an arbitrary number of computer nodes. This learning process, in turn, would allow the database placement algorithm to get better over time as it trains over a set of examples. What this algorithm calls for is that it will try to learn 1) What is a good database layout for a particular application given a query workload? and 2) Can this algorithm automatically improve itself in making recommendations by using machine learned rules to try to generalize when it makes sense to apply each of these operators? There has been considerable research done in parallelizing databases where large amounts of data are shipped from one node to another to answer a single query. Sometimes the costs of shipping the data back and forth might be high, so in this work we assume that it might be more efficient to create a database layout where each query can be answered by a single node. To make this assumption requires that all the incoming query templates are known beforehand. This requirement can easily be satisfied in the case of a Web-based application due to the characteristic that users typically interact with the system through a web interface such as web forms. In this case, unseen queries are not necessarily answerable, without first possibly reconstructing the data on a single machine. Prior knowledge of these exact query templates allows us to select the best possible database table placements across multiple nodes. But in the case of trying to improve the efficiency of a Web-based application, a web site provider might feel that they are willing to suffer the inconvenience of not being able to answer an arbitrary query, if they are in turn provided with a system that runs more efficiently

    Supporting Efficient Database Processing in Mapreduce

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore