3,779 research outputs found
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Development of a large-scale neuroimages and clinical variables data atlas in the neuGRID4You (N4U) project
© 2015 Elsevier Inc.. Exceptional growth in the availability of large-scale clinical imaging datasets has led to the development of computational infrastructures that offer scientists access to image repositories and associated clinical variables data. The EU FP7 neuGRID and its follow on neuGRID4You (N4U) projects provide a leading e-Infrastructure where neuroscientists can find core services and resources for brain image analysis. The core component of this e-Infrastructure is the N4U Virtual Laboratory, which offers easy access for neuroscientists to a wide range of datasets and algorithms, pipelines, computational resources, services, and associated support services. The foundation of this virtual laboratory is a massive data store plus a set of Information Services collectively called the 'Data Atlas'. This data atlas stores datasets, clinical study data, data dictionaries, algorithm/pipeline definitions, and provides interfaces for parameterised querying so that neuroscientists can perform analyses on required datasets. This paper presents the overall design and development of the Data Atlas, its associated dataset indexing and retrieval services that originated from the development of the N4U Virtual Laboratory in the EU FP7 N4U project in the light of detailed user requirements
The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes
In the current context of Big Data, a multitude of new NoSQL solutions for
storing, managing, and extracting information and patterns from semi-structured
data have been proposed and implemented. These solutions were developed to
relieve the issue of rigid data structures present in relational databases, by
introducing semi-structured and flexible schema design. As current data
generated by different sources and devices, especially from IoT sensors and
actuators, use either XML or JSON format, depending on the application,
database technologies that store and query semi-structured data in XML format
are needed. Thus, Native XML Databases, which were initially designed to
manipulate XML data using standardized querying languages, i.e., XQuery and
XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently,
the majority of these solutions have been replaced with the more modern JSON
based Database Management Systems. However, we believe that XML-based solutions
can still deliver performance in executing complex queries on heterogeneous
collections. Unfortunately nowadays, research lacks a clear comparison of the
scalability and performance for database technologies that store and query
documents in XML versus the more modern JSON format. Moreover, to the best of
our knowledge, there are no Big Data-compliant benchmarks for such database
technologies. In this paper, we present a comparison for selected
Document-Oriented Database Systems that either use the XML format to encode
documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB,
CouchDB, and Couchbase. To underline the performance differences we also
propose a benchmark that uses a heterogeneous complex schema on a large DBLP
corpus.Comment: 28 pages, 6 figures, 7 table
The design considerations and development of a simulator for the backtesting of investment strategies
The skill of accurately predicting the optimal time to buy or sell shares on the stock market is one that has been actively sought by both experienced and novice investors since the advent of the stock exchange in the early 1930s. Since then, the finance industry has employed a plethora of techniques to improve the prediction power of the investor. This thesis is an investigation into one of those techniques and the advancement of this technique through the use of computational power. The technique of portfolio strategy backtesting as a vehicle to achieve improved predictive power is one that has existed within financial services for decades. Portfolio backtesting, as alluded to by its name, is the empirical testing of an investment strategy to determine how the strategy would have performed historically, with a view that past performance may be indicative of future performance
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
- …