888 research outputs found
SkelCL: enhancing OpenCL for high-level programming of multi-GPU systems
Application development for modern high-performance systems with Graphics Processing Units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs.
In this paper, we present SkelCL – a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL provides three main enhancements to the OpenCL standard: 1) computations are conveniently expressed using parallel algorithmic patterns (skeletons); 2) memory management is simplified using parallel
container data types (vectors and matrices); 3) an automatic data (re)distribution mechanism allows for implicit data movements between
GPUs and ensures scalability when using multiple GPUs. We demonstrate how SkelCL is used to implement parallel applications on one- and two-dimensional data. We report experimental results to evaluate our approach in terms of programming effort and performance
Large scale evaluation of local image feature detectors on homography datasets
We present a large scale benchmark for the evaluation of local feature
detectors. Our key innovation is the introduction of a new evaluation protocol
which extends and improves the standard detection repeatability measure. The
new protocol is better for assessment on a large number of images and reduces
the dependency of the results on unwanted distractors such as the number of
detected features and the feature magnification factor. Additionally, our
protocol provides a comprehensive assessment of the expected performance of
detectors under several practical scenarios. Using images from the
recently-introduced HPatches dataset, we evaluate a range of state-of-the-art
local feature detectors on two main tasks: viewpoint and illumination invariant
detection. Contrary to previous detector evaluations, our study contains an
order of magnitude more image sequences, resulting in a quantitative evaluation
significantly more robust to over-fitting. We also show that traditional
detectors are still very competitive when compared to recent deep-learning
alternatives.Comment: Accepted to BMVC 201
Accelerating incoherent dedispersion
Incoherent dedispersion is a computationally intensive problem that appears
frequently in pulsar and transient astronomy. For current and future transient
pipelines, dedispersion can dominate the total execution time, meaning its
computational speed acts as a constraint on the quality and quantity of science
results. It is thus critical that the algorithm be able to take advantage of
trends in commodity computing hardware. With this goal in mind, we present
analysis of the 'direct', 'tree' and 'sub-band' dedispersion algorithms with
respect to their potential for efficient execution on modern graphics
processing units (GPUs). We find all three to be excellent candidates, and
proceed to describe implementations in C for CUDA using insight gained from the
analysis. Using recent CPU and GPU hardware, the transition to the GPU provides
a speed-up of 9x for the direct algorithm when compared to an optimised
quad-core CPU code. For realistic recent survey parameters, these speeds are
high enough that further optimisation is unnecessary to achieve real-time
processing. Where further speed-ups are desirable, we find that the tree and
sub-band algorithms are able to provide 3-7x better performance at the cost of
certain smearing, memory consumption and development time trade-offs. We finish
with a discussion of the implications of these results for future transient
surveys. Our GPU dedispersion code is publicly available as a C library at:
http://dedisp.googlecode.com/Comment: 15 pages, 4 figures, 2 tables, accepted for publication in MNRA
Hardware Acceleration in Image Stitching: GPU vs FPGA
Image stitching is a process where two or more images with an overlapping field of view are combined. This process is commonly used to increase the field of view or image quality of a system. While this process is not particularly difficult for modern personal computers, hardware acceleration is often required to achieve real-time performance in low-power image stitching solutions. In this thesis, two separate hardware accelerated image stitching solutions are developed and compared. One solution is accelerated using a Xilinx Zynq UltraScale+ ZU3EG FPGA and the other solution is accelerated using an Nvidia RTX 2070 Super GPU. The image stitching solutions implemented in this paper increase the system’s field of view and involve the end-to-end process of feature detection, image registration, and image mixing. The latency, resource utilization, and power consumption for the accelerated portions of each system are compared and each systems tradeoffs and use cases are considered
Latency and accuracy optimized mobile face detection
Abstract. Face detection is a preprocessing step in many computer vision applications. Important factors are accuracy, inference duration, and energy efficiency of the detection framework. Computationally light detectors that execute in real-time are a requirement for many application areas, such as face tracking and recognition. Typical operating platforms in everyday use are smartphones and embedded devices, which have limited computation capacity.
The capability of face detectors is comparable to the ability of a human in easy detection tasks. When the conditions change, the challenges become different. Current challenges in face detection include atypically posed and tiny faces. Partially occluded faces and dim or bright environments pose challenges for detection systems. State-of-the-art performance in face detection research employs deep learning methods called neural networks, which loosely imitate the mammalian brain system. The most relevant technologies are convolutional neural networks, which are designed for local feature description.
In this thesis, the main computational optimization approach is neural network quantization. The network models were delegated to digital signal processors and graphics processing units. Quantization was shown to reduce the latency of computation substantially. The most energy-efficient inference was achieved through digital signal processor delegation. Multithreading was used for inference acceleration. It reduced the amount of energy consumption per algorithm run.Latenssi- ja tarkkuusoptimoitu kasvontunnistus mobiililaitteilla. Tiivistelmä. Kasvojen ilmaisu on esikäsittelyvaihe monelle konenäön sovellukselle. Tärkeitä kasvoilmaisimen ominaisuuksia ovat tarkkuus, energiatehokkuus ja suoritusnopeus. Monet sovellukset vaativat laskennallisesti kevyitä ilmaisimia, jotka toimivat reaaliajassa. Esimerkkejä sovelluksista ovat kasvojen seuranta- ja tunnistusjärjestelmät. Yleisiä käyttöalustoja ovat älypuhelimet ja sulautetut järjestelmät, joiden laskentakapasiteetti on rajallinen.
Kasvonilmaisimien tarkkuus vastaa ihmisen kykyä helpoissa ilmaisuissa. Nykyiset ongelmat kasvojen ilmaisussa liittyvät epätyypillisiin asentoihin ja erityisen pieniin kasvokokoihin. Myös kasvojen osittainen peittyminen, ja pimeät ja kirkkaat ympäristöt, vaikeuttavat ilmaisua. Neuroverkkoja käytetään tekoälyjärjestelmissä, joiden lähtökohtana on ollut mallintaa nisäkkäiden aivojen toimintaa. Konvoluutiopohjaiset neuroverkot ovat erikoistuneet paikallisten piirteiden analysointiin.
Tässä opinnäytetyössä käytetty laskennallisen optimoinnin menetelmä on neuroverkkojen kvantisointi. Neuroverkkojen ajo delegoitiin digitaalisille signaalinkäsittely- ja grafiikkasuorittimille. Kvantisoinnin osoitettiin vähentävän laskenta-aikaa huomattavasti ja suurin energiatehokkuus saavutettiin digitaalisen signaaliprosessorin avulla. Suoritusnopeutta lisättiin monisäikeistyksellä, jonka havaittiin vähentävän energiankulutusta
- …