6,439 research outputs found
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Incremental scoping study and implementation plan
This report is one of the first deliverables from the Incremental project, which seeks to investigate
and improve the research data management infrastructure at the universities of Glasgow and
Cambridge and to learn lessons and develop resources of value to other institutions. Coming at the
end of the project’s scoping study, this report identifies the key themes and issues that emerged
and proposes a set of activities to address those needs.
As its name suggests, Incremental deliberately adopts a stepped, pragmatic approach to supporting
research data management. It recognises that solutions will vary across different departmental and
institutional contexts; and that top-down, policy-driven or centralised solutions are unlikely to prove
as effective as practical support delivered in a clear and timely manner where the benefits can be
clearly understood and will justify any effort or resources required. The findings of the scoping
study have confirmed the value of this approach and the main recommendations of this report are
concerned with the development and delivery of suitable resources.
Although some differences were observed between disciplines, these seemed to be as much a
feature of different organisational cultures as the nature of the research being undertaken. Our
study found that there were many common issues across the groups and that the responses to
these issues need not be highly technical or expensive to implement. What is required is that these
resources employ jargon-free language and use examples of relevance to researchers and that
they can be accessed easily at the point of need. There are resources already available
(institutionally and externally) that can address researchers’ data management needs but these are
not being fully exploited. So in many cases Incremental will be enabling efficient and contextualised
access, or tailoring resources to specific environments, rather than developing resources from
scratch.
While Incremental will concentrate on developing, repurposing and leveraging practical resources to
support researchers in their management of data, it recognises that this will be best achieved within
a supportive institutional context (both in terms of policy and provision). The need for institutional
support is especially evident when long-term preservation and data sharing are considered – these
activities are clearly more effective and sustainable if addressed at more aggregated levels (e.g.
repositories) rather than left to individual researchers or groups. So in addition to its work in
developing resources, the Incremental project will seek to inform the development of a more
comprehensive data management infrastructure at each institution. In Cambridge, this will be
connected with the library’s CUPID project (Cambridge University Preservation Development) and
at Glasgow in conjunction with the Digital Preservation Advisory Board
A novel cloud based elastic framework for big data preprocessing
A number of analytical big data services based on the cloud computing paradigm such as Amazon Redshift and Google Bigquery have recently emerged. These services are based on columnar databases rather than traditional Relational Database Management Systems (RDBMS) and are able to analyse massive datasets in mere seconds. This has led many organisations to retain and analyse their massive logs, sensory or marketing datasets, which were previously discarded due to the inability to either store or analyse them. Although these big data services have addressed the issue of big data analysis, the ability to efficiently de-normalise and prepare this data to a format that can be imported into these services remains a challenge. This paper describes and implements a novel, generic and scalable cloud based elastic framework for Big Data Preprocessing (BDP). Since the approach described by this paper is entirely based on cloud computing it is also possible to measure the overall cost incurred by these preprocessing activities
- …