3 research outputs found
Integrative Dynamic Reconfiguration in a Parallel Stream Processing Engine
Load balancing, operator instance collocations and horizontal scaling are
critical issues in Parallel Stream Processing Engines to achieve low data
processing latency, optimized cluster utilization and minimized communication
cost respectively. In previous work, these issues are typically tackled
separately and independently. We argue that these problems are tightly coupled
in the sense that they all need to determine the allocations of workloads and
migrate computational states at runtime. Optimizing them independently would
result in suboptimal solutions. Therefore, in this paper, we investigate how
these three issues can be modeled as one integrated optimization problem. In
particular, we first consider jobs where workload allocations have little
effect on the communication cost, and model the problem of load balance as a
Mixed-Integer Linear Program. Afterwards, we present an extended solution
called ALBIC, which support general jobs. We implement the proposed techniques
on top of Apache Storm, an open-source Parallel Stream Processing Engine. The
extensive experimental results over both synthetic and real datasets show that
our techniques clearly outperform existing approaches
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
STRETCH: Virtual Shared-Nothing Parallelism for Scalable and Elastic Stream Processing
Stream processing applications extract value from raw data through Directed
Acyclic Graphs of data analysis tasks. Shared-nothing (SN) parallelism is the
de-facto standard to scale stream processing applications. Given an
application, SN parallelism instantiates several copies of each analysis task,
making each instance responsible for a dedicated portion of the overall
analysis, and relies on dedicated queues to exchange data among connected
instances. On the one hand, SN parallelism can scale the execution of
applications both up and out since threads can run task instances within and
across processes/nodes. On the other hand, its lack of sharing can cause
unnecessary overheads and hinder the scaling up when threads operate on data
that could be jointly accessed in shared memory. This trade-off motivated us in
studying a way for stream processing applications to leverage shared memory and
boost the scale up (before the scale out) while adhering to the widely-adopted
and SN-based APIs for stream processing applications.
We introduce STRETCH, a framework that maximizes the scale up and offers
instantaneous elastic reconfigurations (without state transfer) for stream
processing applications. We propose the concept of Virtual Shared-Nothing (VSN)
parallelism and elasticity and provide formal definitions and correctness
proofs for the semantics of the analysis tasks supported by STRETCH, showing
they extend the ones found in common Stream Processing Engines. We also provide
a fully implemented prototype and show that STRETCH's performance exceeds that
of state-of-the-art frameworks such as Apache Flink and offers, to the best of
our knowledge, unprecedented ultra-fast reconfigurations, taking less than 40
ms even when provisioning tens of new task instances