14,581 research outputs found
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
The Parallel Algorithm for the 2-D Discrete Wavelet Transform
The discrete wavelet transform can be found at the heart of many
image-processing algorithms. Until now, the transform on general-purpose
processors (CPUs) was mostly computed using a separable lifting scheme. As the
lifting scheme consists of a small number of operations, it is preferred for
processing using single-core CPUs. However, considering a parallel processing
using multi-core processors, this scheme is inappropriate due to a large number
of steps. On such architectures, the number of steps corresponds to the number
of points that represent the exchange of data. Consequently, these points often
form a performance bottleneck. Our approach appropriately rearranges
calculations inside the transform, and thereby reduces the number of steps. In
other words, we propose a new scheme that is friendly to parallel environments.
When evaluating on multi-core CPUs, we consistently overcome the original
lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and
8-core Intel Xeon processors.Comment: accepted for publication at ICGIP 201
An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics
Near-sensor data analytics is a promising direction for IoT endpoints, as it
minimizes energy spent on communication and reduces network load - but it also
poses security concerns, as valuable data is stored or sent over the network at
various stages of the analytics pipeline. Using encryption to protect sensitive
data at the boundary of the on-chip analytics engine is a way to address data
security issues. To cope with the combined workload of analytics and encryption
in a tight power envelope, we propose Fulmine, a System-on-Chip based on a
tightly-coupled multi-core cluster augmented with specialized blocks for
compute-intensive data processing and encryption functions, supporting software
programmability for regular computing tasks. The Fulmine SoC, fabricated in
65nm technology, consumes less than 20mW on average at 0.8V achieving an
efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to
25MIPS/mW in software. As a strong argument for real-life flexible application
of our platform, we show experimental results for three secure analytics use
cases: secure autonomous aerial surveillance with a state-of-the-art deep CNN
consuming 3.16pJ per equivalent RISC op; local CNN-based face detection with
secured remote recognition in 5.74pJ/op; and seizure detection with encrypted
data collection from EEG within 12.7pJ/op.Comment: 15 pages, 12 figures, accepted for publication to the IEEE
Transactions on Circuits and Systems - I: Regular Paper
GPU Acceleration of Image Convolution using Spatially-varying Kernel
Image subtraction in astronomy is a tool for transient object discovery such
as asteroids, extra-solar planets and supernovae. To match point spread
functions (PSFs) between images of the same field taken at different times a
convolution technique is used. Particularly suitable for large-scale images is
a computationally intensive spatially-varying kernel. The underlying algorithm
is inherently massively parallel due to unique kernel generation at every pixel
location. The spatially-varying kernel cannot be efficiently computed through
the Convolution Theorem, and thus does not lend itself to acceleration by Fast
Fourier Transform (FFT). This work presents results of accelerated
implementation of the spatially-varying kernel image convolution in multi-cores
with OpenMP and graphic processing units (GPUs). Typical speedups over ANSI-C
were a factor of 50 and a factor of 1000 over the initial IDL implementation,
demonstrating that the techniques are a practical and high impact path to
terabyte-per-night image pipelines and petascale processing.Comment: 4 pages. Accepted to IEEE-ICIP 201
Hyperdrive: A Multi-Chip Systolically Scalable Binary-Weight CNN Inference Engine
Deep neural networks have achieved impressive results in computer vision and
machine learning. Unfortunately, state-of-the-art networks are extremely
compute and memory intensive which makes them unsuitable for mW-devices such as
IoT end-nodes. Aggressive quantization of these networks dramatically reduces
the computation and memory footprint. Binary-weight neural networks (BWNs)
follow this trend, pushing weight quantization to the limit. Hardware
accelerators for BWNs presented up to now have focused on core efficiency,
disregarding I/O bandwidth and system-level efficiency that are crucial for
deployment of accelerators in ultra-low power devices. We present Hyperdrive: a
BWN accelerator dramatically reducing the I/O bandwidth exploiting a novel
binary-weight streaming approach, which can be used for arbitrarily sized
convolutional neural network architecture and input resolution by exploiting
the natural scalability of the compute units both at chip-level and
system-level by arranging Hyperdrive chips systolically in a 2D mesh while
processing the entire feature map together in parallel. Hyperdrive achieves 4.3
TOp/s/W system-level efficiency (i.e., including I/Os)---3.1x higher than
state-of-the-art BWN accelerators, even if its core uses resource-intensive
FP16 arithmetic for increased robustness
- …