3,819 research outputs found
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming
information processing tasks that humans currently have to complete manually.
However, to do so, agent plans must be capable of representing the myriad of
actions and control flows required to perform those tasks. In addition, since
these tasks can require integrating multiple sources of remote information ?
typically, a slow, I/O-bound process ? it is desirable to make execution as
efficient as possible. To address both of these needs, we present a flexible
software agent plan language and a highly parallel execution system that enable
the efficient execution of expressive agent plans. The plan language allows
complex tasks to be more easily expressed by providing a variety of operators
for flexibly processing the data as well as supporting subplans (for
modularity) and recursion (for indeterminate looping). The executor is based on
a streaming dataflow model of execution to maximize the amount of operator and
data parallelism possible at runtime. We have implemented both the language and
executor in a system called THESEUS. Our results from testing THESEUS show that
streaming dataflow execution can yield significant speedups over both
traditional serial (von Neumann) as well as non-streaming dataflow-style
execution that existing software and robot agent execution systems currently
support. In addition, we show how plans written in the language we present can
represent certain types of subtasks that cannot be accomplished using the
languages supported by network query engines. Finally, we demonstrate that the
increased expressivity of our plan language does not hamper performance;
specifically, we show how data can be integrated from multiple remote sources
just as efficiently using our architecture as is possible with a
state-of-the-art streaming-dataflow network query engine
On transformation of query scheduling strategies in distributed and heterogeneous database systems
This work considers a problem of optimal query processing in heterogeneous and distributed database systems. A global query sub- mitted at a local site is decomposed into a number of queries processed at the remote sites. The partial results returned by the queries are in- tegrated at a local site. The paper addresses a problem of an optimal scheduling of queries that minimizes time spend on data integration of the partial results into the final answer. A global data model defined in this work provides a unified view of the heterogeneous data structures located at the remote sites and a system of operations is defined to ex- press the complex data integration procedures. This work shows that the transformations of an entirely simultaneous query processing strate- gies into a hybrid (simultaneous/sequential) strategy may in some cases lead to significantly faster data integration. We show how to detect such cases, what conditions must be satisfied to transform the schedules, and how to transform the schedules into the more efficient ones
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey
Graph processing is becoming increasingly prevalent across many application
domains. In spite of this prevalence, there is little research about how graphs
are actually used in practice. We performed an extensive study that consisted
of an online survey of 89 users, a review of the mailing lists, source
repositories, and whitepapers of a large suite of graph software products, and
in-person interviews with 6 users and 2 developers of these products. Our
online survey aimed at understanding: (i) the types of graphs users have; (ii)
the graph computations users run; (iii) the types of graph software users use;
and (iv) the major challenges users face when processing their graphs. We
describe the participants' responses to our questions highlighting common
patterns and challenges. Based on our interviews and survey of the rest of our
sources, we were able to answer some new questions that were raised by
participants' responses to our online survey and understand the specific
applications that use graph data and software. Our study revealed surprising
facts about graph processing in practice. In particular, real-world graphs
represent a very diverse range of entities and are often very large,
scalability and visualization are undeniably the most pressing challenges faced
by participants, and data integration, recommendations, and fraud detection are
very popular applications supported by existing graph software. We hope these
findings can guide future research
Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems
Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ).
This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)
- …