4,522 research outputs found
On-the-fly pipeline parallelism
Pipeline parallelism organizes a parallel program as a linear sequence of s stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism.
Whereas most concurrency platforms that support pipeline parallelism use a "construct-and-run" approach, this paper investigates "on-the-fly" pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding "runaway" pipelines. Given a pipeline computation with T[subscript 1] work and T[subscript ∞] span (critical-path length), Piper executes the computation on P processors in T[subscript P]≤ T[subscript 1]/P + O(T[subscript ∞] + lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time.
We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as lazy enabling and dependency folding. We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-1162148)National Science Foundation (U.S.). Graduate Research Fellowshi
Active Image-based Modeling with a Toy Drone
Image-based modeling techniques can now generate photo-realistic 3D models
from images. But it is up to users to provide high quality images with good
coverage and view overlap, which makes the data capturing process tedious and
time consuming. We seek to automate data capturing for image-based modeling.
The core of our system is an iterative linear method to solve the multi-view
stereo (MVS) problem quickly and plan the Next-Best-View (NBV) effectively. Our
fast MVS algorithm enables online model reconstruction and quality assessment
to determine the NBVs on the fly. We test our system with a toy unmanned aerial
vehicle (UAV) in simulated, indoor and outdoor experiments. Results show that
our system improves the efficiency of data acquisition and ensures the
completeness of the final model.Comment: To be published on International Conference on Robotics and
Automation 2018, Brisbane, Australia. Project Page:
https://huangrui815.github.io/active-image-based-modeling/ The author's
personal page: http://www.sfu.ca/~rha55
DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge
The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for
processing large astronomical datasets at a scale required by the Square
Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex
data reduction pipelines consisting of both data sets and algorithmic
components and an implementation run-time to execute such pipelines on
distributed resources. By mapping the logical view of a pipeline to its
physical realisation, DALiuGE separates the concerns of multiple stakeholders,
allowing them to collectively optimise large-scale data processing solutions in
a coherent manner. The execution in DALiuGE is data-activated, where each
individual data item autonomously triggers the processing on itself. Such
decentralisation also makes the execution framework very scalable and flexible,
supporting pipeline sizes ranging from less than ten tasks running on a laptop
to tens of millions of concurrent tasks on the second fastest supercomputer in
the world. DALiuGE has been used in production for reducing interferometry data
sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide
Spectral Radioheliograph; and is being developed as the execution framework
prototype for the Science Data Processor (SDP) consortium of the Square
Kilometre Array (SKA) telescope. This paper presents a technical overview of
DALiuGE and discusses case studies from the CHILES and MUSER projects that use
DALiuGE to execute production pipelines. In a companion paper, we provide
in-depth analysis of DALiuGE's scalability to very large numbers of tasks on
two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and
Computin
Low power techniques for video compression
This paper gives an overview of low-power techniques proposed in the literature for mobile multimedia and Internet applications. Exploitable aspects are discussed in the behavior of different video compression tools. These power-efficient solutions are then classified by synthesis domain and level of abstraction. As this paper is meant to be a starting point for further research in the area, a lowpower hardware & software co-design methodology is outlined in the end as a possible scenario for video-codec-on-a-chip implementations on future mobile multimedia platforms
Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
Production-quality parallel applications are often a mixture of diverse
operations, such as computation- and communication-intensive, regular and
irregular, tightly coupled and loosely linked operations. In conventional
construction of parallel applications, each process performs all the
operations, which might result inefficient and seriously limit scalability,
especially at large scale. We propose a decoupling strategy to improve the
scalability of applications running on large-scale systems.
Our strategy separates application operations onto groups of processes and
enables a dataflow processing paradigm among the groups. This mechanism is
effective in reducing the impact of load imbalance and increases the parallel
efficiency by pipelining multiple operations. We provide a proof-of-concept
implementation using MPI, the de-facto programming system on current
supercomputers. We demonstrate the effectiveness of this strategy by decoupling
the reduce, particle communication, halo exchange and I/O operations in a set
of scientific and data-analytics applications. A performance evaluation on
8,192 processes of a Cray XC40 supercomputer shows that the proposed approach
can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017
Parallel Simulations for Analysing Portfolios of Catastrophic Event Risk
At the heart of the analytical pipeline of a modern quantitative
insurance/reinsurance company is a stochastic simulation technique for
portfolio risk analysis and pricing process referred to as Aggregate Analysis.
Support for the computation of risk measures including Probable Maximum Loss
(PML) and the Tail Value at Risk (TVAR) for a variety of types of complex
property catastrophe insurance contracts including Cat eXcess of Loss (XL), or
Per-Occurrence XL, and Aggregate XL, and contracts that combine these measures
is obtained in Aggregate Analysis.
In this paper, we explore parallel methods for aggregate risk analysis. A
parallel aggregate risk analysis algorithm and an engine based on the algorithm
is proposed. This engine is implemented in C and OpenMP for multi-core CPUs and
in C and CUDA for many-core GPUs. Performance analysis of the algorithm
indicates that GPUs offer an alternative HPC solution for aggregate risk
analysis that is cost effective. The optimised algorithm on the GPU performs a
1 million trial aggregate simulation with 1000 catastrophic events per trial on
a typical exposure set and contract structure in just over 20 seconds which is
approximately 15x times faster than the sequential counterpart. This can
sufficiently support the real-time pricing scenario in which an underwriter
analyses different contractual terms and pricing while discussing a deal with a
client over the phone.Comment: Proceedings of the Workshop at the International Conference for High
Performance Computing, Networking, Storage and Analysis (SC), 2012, 8 page
- …