81 research outputs found
Parallel stereo vision algorithm
Integrating a stereo-photogrammetric robot
head into a real-time system requires software
solutions that rapidly resolve the stereo correspondence
problem. The stereo-matcher presented in this
paper uses therefore code parallelisation and was
tested on three different processors with x87 and AVX.
The results show that a 5mega pixels colour image can
be matched in 5,55 seconds or as monochrome in 3,3
seconds
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
Facilitating High Performance Code Parallelization
With the surge of social media on one hand and the ease of obtaining information due to cheap sensing devices and open source APIs on the other hand, the amount of data that can be processed is as well vastly increasing. In addition, the world of computing has recently been witnessing a growing shift towards massively parallel distributed systems due to the increasing importance of transforming data into knowledge in today’s data-driven world. At the core of data analysis for all sorts of applications lies pattern matching. Therefore, parallelizing pattern matching algorithms should be made efficient in order to cater to this ever-increasing abundance of data. We propose a method that automatically detects a user’s single threaded function call to search for a pattern using Java’s standard regular expression library, and replaces it with our own data parallel implementation using Java bytecode injection. Our approach facilitates parallel processing on different platforms consisting of shared memory systems (using multithreading and NVIDIA GPUs) and distributed systems (using MPI and Hadoop). The major contributions of our implementation consist of reducing the execution time while at the same time being transparent to the user. In addition to that, and in the same spirit of facilitating high performance code parallelization, we present a tool that automatically generates Spark Java code from minimal user-supplied inputs. Spark has emerged as the tool of choice for efficient big data analysis. However, users still have to learn the complicated Spark API in order to write even a simple application. Our tool is easy to use, interactive and offers Spark’s native Java API performance. To the best of our knowledge and until the time of this writing, such a tool has not been yet implemented
Accelerating Event Stream Processing in On- and Offline Systems
Due to a growing number of data producers and their ever-increasing data volume, the ability to ingest, analyze, and store potentially never-ending streams of data is a mission-critical task in today's data processing landscape.
A widespread form of data streams are event streams, which consist of continuously arriving notifications about some real-world phenomena. For example, a temperature sensor naturally generates an event stream by periodically measuring the temperature and reporting it with measurement time in case of a substantial change to the previous measurement.
In this thesis, we consider two kinds of event stream processing: online and offline. Online refers to processing events solely in main memory as soon as they arrive, while offline means processing event data previously persisted to non-volatile storage. Both modes are supported by widely used scale-out general-purpose stream processing engines (SPEs) like Apache Flink or Spark Streaming. However, such engines suffer from two significant deficiencies that severely limit their processing performance. First, for offline processing, they load the entire stream from non-volatile secondary storage and replay all data items into the associated online engine in order of their original arrival. While this naturally ensures unified query semantics for on- and offline processing, the costs for reading the entire stream from non-volatile storage quickly dominate the overall processing costs.
Second, modern SPEs focus on scaling out computations across the nodes of a cluster, but use only a fraction of the available resources of individual nodes. This thesis tackles those problems with three different approaches.
First, we present novel techniques for the offline processing of two important query types (windowed aggregation and sequential pattern matching). Our methods utilize well-understood indexing techniques to reduce the total amount of data to read from non-volatile storage. We show that this improves the overall query runtime significantly. In particular, this thesis develops the first index-based algorithms for pattern queries expressed with the Match_Recognize clause, a new and powerful language feature of SQL that has received little attention so far.
Second, we show how to maximize resource utilization of single nodes by exploiting the capabilities of modern hardware. Therefore, we develop a prototypical shared-memory CPU-GPU-enabled event processing system. The system provides implementations of all major event processing operators (filtering, windowed aggregation, windowed join, and sequential pattern matching). Our experiments reveal that regarding resource utilization and processing throughput, such a hardware-enabled system is superior to hardware-agnostic general-purpose engines.
Finally, we present TPStream, a new operator for pattern matching over temporal intervals. TPStream achieves low processing latency and, in contrast to sequential pattern matching, is easily parallelizable even for unpartitioned input streams. This results in maximized resource utilization, especially for modern CPUs with multiple cores
Image Stream Similarity Search in GPU Clusters
Images are an important part of today’s society. They are everywhere on the internet
and computing, from news articles to diverse areas such as medicine, autonomous
vehicles and social media. This enormous amount of images requires massive amounts
of processing power to process, upload, download and search for images. The ability to
search an image, and find similar images in a library of millions of others empowers users
with great advantages. Different fields have different constraints, but all benefit from the
quick processing that can be achieved.
Problems arise when creating a solution for this. The similarity calculation between
several images, performing thousands of comparisons every second, is a challenge. The
results of such computations are very large, and pose a challenge when attempting to
process. Solutions for these problems often take advantage of graphs in order to index
images and their similarity. The graph can then be used for the querying process. Creating
and processing such a graph in an acceptable time frame poses yet another challenge.
In order to tackle these challenges, we take advantage of a cluster of machines equipped
with Graphics Processing Units (GPUs), enabling us to parallelize the process of describing
an image visually and finding other images similar to it in an acceptable time frame.
GPUs are incredibly efficient at processing data such as images and graphs, through algorithms
that are heavily parallelizable. We propose a scalable and modular system that
takes advantage of GPUs, distributed computing and fine-grained parallellism to detect
image features, index images in a graph and allow users to search for similar images.
The solution we propose is able to compare up to 5000 images every second. It is
also able to query a graph with thousands of nodes and millions of edges in a matter
of milliseconds, achieving a very efficient query speed. The modularity of our solution
allows the interchangeability of algorithms and different steps in the solution, which
provides great adaptability to any needs
Acceleration of stereo-matching on multi-core CPU and GPU
This paper presents an accelerated version of a
dense stereo-correspondence algorithm for two different parallelism
enabled architectures, multi-core CPU and GPU. The
algorithm is part of the vision system developed for a binocular
robot-head in the context of the CloPeMa 1 research project.
This research project focuses on the conception of a new clothes
folding robot with real-time and high resolution requirements
for the vision system. The performance analysis shows that
the parallelised stereo-matching algorithm has been significantly
accelerated, maintaining 12x and 176x speed-up respectively
for multi-core CPU and GPU, compared with non-SIMD singlethread
CPU. To analyse the origin of the speed-up and gain
deeper understanding about the choice of the optimal hardware,
the algorithm was broken into key sub-tasks and the performance
was tested for four different hardware architectures
A Study of Reconfigurable Accelerators for Cloud Computing
Due to the exponential increase in network traffic in the data centers, thousands of servers interconnected with high bandwidth switches are required. Field Programmable Gate Arrays (FPGAs) with Cloud ecosystem offer high performance in efficiency and energy, making them active resources, easy to program and reconfigure. This paper looks at FPGAs as reconfigurable accelerators for the cloud computing presents the main hardware accelerators that have been presented in various widely used cloud computing applications such as: MapReduce, Spark, Memcached, Databases
Real Time Panoramic Image Processing
Image stitching algorithms are able to join sets of images together and provide a wider field of a vision when compared with an image from a single standard camera. Traditional techniques for accomplishing this are able to adequately produce a stitch for a static set of images, but suffer when differing lighting conditions exist between the two images. Additionally, traditional techniques suffer from processing times that are too slow for real time use cases. We propose a solution which resolves the issues encountered by traditional image stitching techniques. To resolve the issues with lighting difference, two blending schemes have been implemented, a standard approach and a superpixel approach. To verify the integrity of the cached solution, a validation scheme has been implemented. Using this scheme, invalid solutions can be detected, and the cache regenerated. Finally, these components are packaged together in a parallel processing architecture to ensure that frame processing is never interrupted
- …