6,878 research outputs found
Exploiting Fine-Grain Ordered Parallelism in Dense Matrix Algorithms
Dense linear algebra kernels are critical for wireless applications, and the
oncoming proliferation of 5G only amplifies their importance. Many such matrix
algorithms are inductive, and exhibit ample amounts of fine-grain ordered
parallelism -- when multiple computations flow with fine-grain
producer/consumer dependences, and where the iteration domain is not easily
tileable. Synchronization overheads make multi-core parallelism ineffective and
the non-tileable iterations make the vector-VLIW approach less effective,
especially for the typically modest-sized matrices. Because CPUs and DSPs lose
order-of-magnitude performance/hardware utilization, costly and inflexible
ASICs are often employed in signal processing pipelines. A programmable
accelerator with similar performance/power/area would be highly desirable. We
find that fine-grain ordered parallelism can be exploited by supporting: 1.
fine-grain stream-based communication/synchronization; 2. inductive data-reuse
and memory access patterns; 3. implicit vector-masking for partial vectors; 4.
hardware specialization of dataflow criticality. In this work, we propose,
REVEL, as a next-generation DSP architecture. It supports the above features in
its ISA and microarchitecture, and further uses a novel vector-stream control
paradigm to reduce control overheads. Across a suite of linear algebra kernels,
REVEL outperforms equally provisioned DSPs by 4.6x-37x in latency and achieves
a performance per mm 2 of 8.3x. It is only 2.2x higher power to achieve the
same performance as ideal ASICs, at about 55% of the combined area
A Scalable Platform for Distributed Object Tracking across a Many-camera Network
Advances in deep neural networks (DNN) and computer vision (CV) algorithms
have made it feasible to extract meaningful insights from large-scale
deployments of urban cameras. Tracking an object of interest across the camera
network in near real-time is a canonical problem. However, current tracking
platforms have two key limitations: 1) They are monolithic, proprietary and
lack the ability to rapidly incorporate sophisticated tracking models; and 2)
They are less responsive to dynamism across wide-area computing resources that
include edge, fog and cloud abstractions. We address these gaps using Anveshak,
a runtime platform for composing and coordinating distributed tracking
applications. It provides a domain-specific dataflow programming model to
intuitively compose a tracking application, supporting contemporary CV advances
like query fusion and re-identification, and enabling dynamic scoping of the
camera network's search space to avoid wasted computation. We also offer
tunable batching and data-dropping strategies for dataflow blocks deployed on
distributed resources to respond to network and compute variability. These
balance the tracking accuracy, its real-time performance and the active
camera-set size. We illustrate the concise expressiveness of the programming
model for tracking applications. Our detailed experiments for a network of
1000 camera-feeds on modest resources exhibit the tunable scalability,
performance and quality trade-offs enabled by our dynamic tracking, batching
and dropping strategies
On reconfigurable tiled multi-core programming
For a generic flexible efficient array antenna receiver platform a hierarchical reconfigurable tiled architecture has been proposed. The architecture provides a flexible reconfigurable solution, but partitioning, mapping, modelling and programming such systems remains an issue. A semantic model has been presented to allow the development of the model for the specification, design and implementation. The semantic model is used for partitioning the application, evaluating the consequences and mapping to an architectures. Design space exploration allows us to adapt the partitioning and mapping to an architecture or visa-versa.\ud
\ud
With tiled reconfigurable cores as basis for the architecture, this paper explores the different options for processing cores and its suitability with respect to the design flow of the semantic model approach. Trade-offs with respect to granularity depending on flexibility and efficiency allow interesting design evaluations, especially for programability. This work therefore represent an important step forward in the design flow for designing and using multi-core tiled architectures
Parallel Wavelet Schemes for Images
In this paper, we introduce several new schemes for calculation of discrete
wavelet transforms of images. These schemes reduce the number of steps and, as
a consequence, allow to reduce the number of synchronizations on parallel
architectures. As an additional useful property, the proposed schemes can
reduce also the number of arithmetic operations. The schemes are primarily
demonstrated on CDF 5/3 and CDF 9/7 wavelets employed in JPEG 2000 image
compression standard. However, the presented method is general, and it can be
applied on any wavelet transform. As a result, our scheme requires only two
memory barriers for 2-D CDF 5/3 transform compared to four barriers in the
original separable form or three barriers in the non-separable scheme recently
published. Our reasoning is supported by exhaustive experiments on high-end
graphics cards.Comment: This is a preprint of the article that appeared in the Journal of
Real-Time Image Processing. The final publication is available at Springer
via http://doi.org/10.1007/s11554-016-0646-
Current Trends and Future Research Directions for Interactive Music
In this review, it is explained and compared different software and
formalisms used in music interaction: sequencers, computer-assisted
improvisation, meta- instruments, score-following, asynchronous dataflow
languages, synchronous dataflow languages, process calculi, temporal
constraints and interactive scores. Formal approaches have the advantage of
providing rigorous semantics of the behavior of the model and proving
correctness during execution. The main disadvantage of formal approaches is
lack of commercial tools
A Coprocessor for Accelerating Visual Information Processing
Visual information processing will play an increasingly important role in
future electronics systems. In many applications, e.g. video surveillance
cameras, data throughput of microprocessors is not sufficient and power
consumption is too high. Instruction profiling on a typical test algorithm has
shown that pixel address calculations are the dominant operations to be
optimized. Therefore AddressLib, a structured scheme for pixel addressing was
developed, that can be accelerated by AddressEngine, a coprocessor for visual
information processing. In this paper, the architectural design of
AddressEngine is described, which in the first step supports a subset of the
AddressLib. Dataflow and memory organization are optimized during architectural
design. AddressEngine was implemented in a FPGA and was tested with MPEG-7
Global Motion Estimation algorithm. Results on processing speed and circuit
complexity are given and compared to a pure software implementation. The next
step will be the support for the full AddressLib, including segment addressing.
An outlook on further investigations on dynamic reconfiguration capabilities is
given.Comment: Submitted on behalf of EDAA (http://www.edaa.com/
Exploring the Equivalence between Dynamic Dataflow Model and Gamma - General Abstract Model for Multiset mAnipulation
With the increase of the search for computational models where the expression
of parallelism occurs naturally, some paradigms arise as options for the next
generation of computers. In this context, dynamic Dataflow and Gamma - General
Abstract Model for Multiset mAnipulation) - emerge as interesting computational
models choices. In the dynamic Dataflow model, operations are performed as soon
as their associated operators are available, without rely on a Program Counter
to dictate the execution order of instructions. The Gamma paradigm is based on
a parallel multiset rewriting scheme. It provides a non-deterministic execution
model inspired by an abstract chemical machine metaphor, where operations are
formulated as reactions that occur freely among matching elements belonging to
the multiset. In this work, equivalence relations between the dynamic Dataflow
and Gamma paradigms are exposed and explored, while methods to convert from
Dataflow to Gamma paradigm and vice versa are provided. It is shown that
vertices and edges of a dynamic Dataflow graph can correspond, respectively, to
reactions and multiset elements in the Gamma paradigm. Implementation aspects
of execution environments that could be mutually beneficial to both models are
also discussed. This work provides the scientific community with the
possibility of taking profit of both parallel programming models, contributing
with a versatility component to researchers and developers. Finally, it is
important to state that, to the best of our knowledge, the similarity relations
between both dynamic Dataflow and Gamma models presented here have not been
reported in any previous work.Comment: Study submitted to the IPDPS 2019 - IEEE International Parallel and
Distributed Processing Symposiu
Research Challenges for Heterogeneous CPS Design
Heterogeneous computing is widely used at all levels of computing from data
center to edge due to its power/performance characteristics. However,
heterogeneity presents challenges. Interoperability---the management of
workloads across heterogeneous resources---requires more careful design than is
the case for homogeneous platforms. Cyber-physical systems present additional
challenges. This article considers research challenges in heterogeneous CPS
design, including interoperability, physical modeling, models of computation,
self-awareness and adaptation, architecture, and scheduling.Comment: This is a pre-publication version of a paper that has been accepted
for publication in IEEE Computer. The official/final version of the paper
will be posted on IEEE Xplor
Multiprocessor Scheduling of a Multi-mode Dataflow Graph Considering Mode Transition Delay
Synchronous Data Flow (SDF) model is widely used for specifying signal
processing or streaming applications. Since modern embedded applications become
more complex with dynamic behavior changes at run-time, several extensions of
the SDF model have been proposed to specify the dynamic behavior changes while
preserving static analyzability of the SDF model. They assume that an
application has a finite number of behaviors (or modes) and each behavior
(mode) is represented by an SDF graph. They are classified as multi-mode
dataflow models in this paper. While there exist several scheduling techniques
for multi-mode dataflow models, no one allows task migration between modes. By
observing that the resource requirement can be additionally reduced if task
migration is allowed, we propose a multiprocessor scheduling technique of a
multi-mode dataflow graph considering task migration between modes. Based on a
genetic algorithm, the proposed technique schedules all SDF graphs in all modes
simultaneously to minimize the resource requirement. To satisfy the throughput
constraint, the proposed technique calculates the actual throughput requirement
of each mode and the output buffer size for tolerating throughput jitter. We
compare the proposed technique with a method which analyzes SDF graphs in each
execution mode separately and a method that does not allow task migration for
synthetic examples and three real applications: H.264 decoder, vocoder, and LTE
receiver algorithms
TensorFlow: A system for large-scale machine learning
TensorFlow is a machine learning system that operates at large scale and in
heterogeneous environments. TensorFlow uses dataflow graphs to represent
computation, shared state, and the operations that mutate that state. It maps
the nodes of a dataflow graph across many machines in a cluster, and within a
machine across multiple computational devices, including multicore CPUs,
general-purpose GPUs, and custom designed ASICs known as Tensor Processing
Units (TPUs). This architecture gives flexibility to the application developer:
whereas in previous "parameter server" designs the management of shared state
is built into the system, TensorFlow enables developers to experiment with
novel optimizations and training algorithms. TensorFlow supports a variety of
applications, with particularly strong support for training and inference on
deep neural networks. Several Google services use TensorFlow in production, we
have released it as an open-source project, and it has become widely used for
machine learning research. In this paper, we describe the TensorFlow dataflow
model in contrast to existing systems, and demonstrate the compelling
performance that TensorFlow achieves for several real-world applications.Comment: 18 pages, 9 figures; v2 has a spelling correction in the metadat
- …