964 research outputs found

    Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

    Full text link
    MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. Casper is a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.Comment: 12 pages, additional 4 pages of references and appendi

    Tupleware: Redefining Modern Analytics

    Full text link
    There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    POC on Credit Card “e-Statement” Details Generation for ANZ Bank

    Get PDF
    The storage and processing of data are major issues in information technology today. Every organization has been rapidly growing data day by day, and it becomes tough for the information systems to process and respond to the various queries required of them. Banking is one such industry which needs to handle millions of data records each time. Utilizing Hadoop as a solution is one way to handle these records more effectively and in less time. From this Proof of Concept (POC), the time difference between executing queries will take much less compared to the existing database system. The growth of data challenges cutting-edge companies like Google, Yahoo, Amazon, Microsoft and many more like them. They need to go through the terabytes and even petabytes of data to figure out issues regarding these websites which are popular among people. The tools they had at the time were not equipped to cope with this issue. Then Google presented MapReduce, a system they had used to cope with this issue. The majority of companies were facing the same issue as Google, so they did not want to develop another system like Google developed, and this system was suitable for all of them. After some time, this system became open source for all of them, and many companies appreciated this effort. That system was named as Hadoop, and today it is major part of the computing world. Due to its efficiency, many more companies are going to rely on Hadoop, and they are going to establish this system in their companies. Hadoop is used for running huge distributed programs so its simplicity and accessibility give it an edge over writing and running distributed programs. Any good programmer can create his own Hadoop instance in minutes, and it is also very cheap to create. Hadoop is moreover, very scalable and robust. Due to Hadoop’s features, it is getting very popular in the academic and industrial world. MapReduce is a model of data processing and in this model, data can easily be scalable over multiple systems. In this model, two terms are used for data processing, and those are mappers and reducers. Sometimes it is nontrivial to decompose the data application into mappers and into reducers. However, once you write an application in the MapReduce format then scaling of that application to run over many hundreds of systems is not a big issue. Some minor changes may still be required to take place, however due to its efficiency and scalability, programmers are attracted towards MapReduce like a bear towards honey. According to experts, this era is an era of development of unbelievable things, and these developments require large systems with larger data storage in them to cope with the immense storage issues. Hadoop plays an effective role to cope with this issue with its scalability and many more striking features. Hadoop is also an astonishing development. There is a challenge that must be fulfilled and that is how the existing data will move to the Hadoop infrastructure, when the existing data infrastructure is based on traditional relational database and Structured Query Language (SQL). Meanwhile there is the concept of Hive. Hive provides a dialect of SQL named as Hive Query Language to fulfil the query of data storage in the cluster of Hadoop instances. Hive does not work as a database, instead it is bound to the limitations imposed by the constraints of Hadoop. The most surprising limitation is that it cannot provide record level updates, such as insert and delete. You can only make new tables, or you can perform queries to output results to files. Hive also does not provide transactional data
    • …
    corecore