1,724 research outputs found
A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures
Scientific problems that depend on processing large amounts of data require
overcoming challenges in multiple areas: managing large-scale data
distribution, co-placement and scheduling of data with compute resources, and
storing and transferring large volumes of data. We analyze the ecosystems of
the two prominent paradigms for data-intensive applications, hereafter referred
to as the high-performance computing and the Apache-Hadoop paradigm. We propose
a basis, common terminology and functional factors upon which to analyze the
two approaches of both paradigms. We discuss the concept of "Big Data Ogres"
and their facets as means of understanding and characterizing the most common
application workloads found across the two paradigms. We then discuss the
salient features of the two paradigms, and compare and contrast the two
approaches. Specifically, we examine common implementation/approaches of these
paradigms, shed light upon the reasons for their current "architecture" and
discuss some typical workloads that utilize them. In spite of the significant
software distinctions, we believe there is architectural similarity. We discuss
the potential integration of different implementations, across the different
levels and components. Our comparison progresses from a fully qualitative
examination of the two paradigms, to a semi-quantitative methodology. We use a
simple and broadly used Ogre (K-means clustering), characterize its performance
on a range of representative platforms, covering several implementations from
both paradigms. Our experiments provide an insight into the relative strengths
of the two paradigms. We propose that the set of Ogres will serve as a
benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure
Muppet: MapReduce-Style Processing of Fast Data
MapReduce has emerged as a popular method to process big data. In the past
few years, however, not just big data, but fast data has also exploded in
volume and availability. Examples of such data include sensor data streams, the
Twitter Firehose, and Facebook updates. Numerous applications must process fast
data. Can we provide a MapReduce-style framework so that developers can quickly
write such applications and execute them over a cluster of machines, to achieve
low latency and high scalability? In this paper we report on our investigation
of this question, as carried out at Kosmix and WalmartLabs. We describe
MapUpdate, a framework like MapReduce, but specifically developed for fast
data. We describe Muppet, our implementation of MapUpdate. Throughout the
description we highlight the key challenges, argue why MapReduce is not well
suited to address them, and briefly describe our current solutions. Finally, we
describe our experience and lessons learned with Muppet, which has been used
extensively at Kosmix and WalmartLabs to power a broad range of applications in
social media and e-commerce.Comment: VLDB201
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
CPL: A Core Language for Cloud Computing -- Technical Report
Running distributed applications in the cloud involves deployment. That is,
distribution and configuration of application services and middleware
infrastructure. The considerable complexity of these tasks resulted in the
emergence of declarative JSON-based domain-specific deployment languages to
develop deployment programs. However, existing deployment programs unsafely
compose artifacts written in different languages, leading to bugs that are hard
to detect before run time. Furthermore, deployment languages do not provide
extension points for custom implementations of existing cloud services such as
application-specific load balancing policies.
To address these shortcomings, we propose CPL (Cloud Platform Language), a
statically-typed core language for programming both distributed applications as
well as their deployment on a cloud platform. In CPL, application services and
deployment programs interact through statically typed, extensible interfaces,
and an application can trigger further deployment at run time. We provide a
formal semantics of CPL and demonstrate that it enables type-safe, composable
and extensible libraries of service combinators, such as load balancing and
fault tolerance.Comment: Technical report accompanying the MODULARITY '16 submissio
- …