6,439 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Incremental scoping study and implementation plan

    Get PDF
    This report is one of the first deliverables from the Incremental project, which seeks to investigate and improve the research data management infrastructure at the universities of Glasgow and Cambridge and to learn lessons and develop resources of value to other institutions. Coming at the end of the project’s scoping study, this report identifies the key themes and issues that emerged and proposes a set of activities to address those needs. As its name suggests, Incremental deliberately adopts a stepped, pragmatic approach to supporting research data management. It recognises that solutions will vary across different departmental and institutional contexts; and that top-down, policy-driven or centralised solutions are unlikely to prove as effective as practical support delivered in a clear and timely manner where the benefits can be clearly understood and will justify any effort or resources required. The findings of the scoping study have confirmed the value of this approach and the main recommendations of this report are concerned with the development and delivery of suitable resources. Although some differences were observed between disciplines, these seemed to be as much a feature of different organisational cultures as the nature of the research being undertaken. Our study found that there were many common issues across the groups and that the responses to these issues need not be highly technical or expensive to implement. What is required is that these resources employ jargon-free language and use examples of relevance to researchers and that they can be accessed easily at the point of need. There are resources already available (institutionally and externally) that can address researchers’ data management needs but these are not being fully exploited. So in many cases Incremental will be enabling efficient and contextualised access, or tailoring resources to specific environments, rather than developing resources from scratch. While Incremental will concentrate on developing, repurposing and leveraging practical resources to support researchers in their management of data, it recognises that this will be best achieved within a supportive institutional context (both in terms of policy and provision). The need for institutional support is especially evident when long-term preservation and data sharing are considered – these activities are clearly more effective and sustainable if addressed at more aggregated levels (e.g. repositories) rather than left to individual researchers or groups. So in addition to its work in developing resources, the Incremental project will seek to inform the development of a more comprehensive data management infrastructure at each institution. In Cambridge, this will be connected with the library’s CUPID project (Cambridge University Preservation Development) and at Glasgow in conjunction with the Digital Preservation Advisory Board

    A novel cloud based elastic framework for big data preprocessing

    Get PDF
    A number of analytical big data services based on the cloud computing paradigm such as Amazon Redshift and Google Bigquery have recently emerged. These services are based on columnar databases rather than traditional Relational Database Management Systems (RDBMS) and are able to analyse massive datasets in mere seconds. This has led many organisations to retain and analyse their massive logs, sensory or marketing datasets, which were previously discarded due to the inability to either store or analyse them. Although these big data services have addressed the issue of big data analysis, the ability to efficiently de-normalise and prepare this data to a format that can be imported into these services remains a challenge. This paper describes and implements a novel, generic and scalable cloud based elastic framework for Big Data Preprocessing (BDP). Since the approach described by this paper is entirely based on cloud computing it is also possible to measure the overall cost incurred by these preprocessing activities
    • …
    corecore