1,209 research outputs found
Template matching method for the analysis of interstellar cloud structure
The structure of interstellar medium can be characterised at large scales in
terms of its global statistics (e.g. power spectra) and at small scales by the
properties of individual cores. Interest has been increasing in structures at
intermediate scales, resulting in a number of methods being developed for the
analysis of filamentary structures. We describe the application of the generic
template-matching (TM) method to the analysis of maps. Our aim is to show that
it provides a fast and still relatively robust way to identify elongated
structures or other image features. We present the implementation of a TM
algorithm for map analysis. The results are compared against rolling Hough
transform (RHT), one of the methods previously used to identify filamentary
structures. We illustrate the method by applying it to Herschel surface
brightness data. The performance of the TM method is found to be comparable to
that of RHT but TM appears to be more robust regarding the input parameters,
for example, those related to the selected spatial scales. Small modifications
of TM enable one to target structures at different size and intensity levels.
In addition to elongated features, we demonstrate the possibility of using TM
to also identify other types of structures. The TM method is a viable tool for
data quality control, exploratory data analysis, and even quantitative analysis
of structures in image data.Comment: 12 pages, accepted to A&
MELT - a Translated Domain Specific Language Embedded in the GCC Compiler
The GCC free compiler is a very large software, compiling source in several
languages for many targets on various systems. It can be extended by plugins,
which may take advantage of its power to provide extra specific functionality
(warnings, optimizations, source refactoring or navigation) by processing
various GCC internal representations (Gimple, Tree, ...). Writing plugins in C
is a complex and time-consuming task, but customizing GCC by using an existing
scripting language inside is impractical. We describe MELT, a specific
Lisp-like DSL which fits well into existing GCC technology and offers
high-level features (functional, object or reflexive programming, pattern
matching). MELT is translated to C fitted for GCC internals and provides
various features to facilitate this. This work shows that even huge, legacy,
software can be a posteriori extended by specifically tailored and translated
high-level DSLs.Comment: In Proceedings DSL 2011, arXiv:1109.032
Visual Analysis Algorithms for Embedded Systems
Visual search systems are very popular applications, but on-line versions in 3G wireless environments suffer from network constraint like unstable or limited bandwidth that entail latency in query delivery, significantly degenerating the user’s experience. An alternative is to exploit the ability of the newest mobile devices to perform heterogeneous activities, like not only creating but also processing images. Visual feature extraction and compression can be performed on on-board Graphical Processing Units (GPUs), making smartphones capable of detecting a generic object (matching) in an exact way or of performing a classification activity.
The latest trends in visual search have resulted in dedicated efforts in MPEG standardization, namely the MPEG CDVS (Compact Descriptor for Visual Search) standard. CDVS is an ISO/IEC standard used to extract a compressed descriptor.
As regards to classification, in recent years neural networks have acquired an impressive importance and have been applied to several domains. This thesis focuses on the use of Deep Neural networks to classify images by means of Deep learning.
Implementing visual search algorithms and deep learning-based classification on embedded environments is not a mere code-porting activity. Recent embedded devices are equipped with a powerful but limited number of resources, like development boards such as GPGPUs. GPU architectures fit particularly well, because they allow to execute more operations in parallel, following the SIMD (Single Instruction Multiple Data) paradigm. Nonetheless, it is necessary to make good design choices for the best use of available hardware and memory.
For visual search, following the MPEG CDVS standard, the contribution of this thesis is an efficient feature computation phase, a parallel CDVS detector, completely implemented on embedded devices supporting the OpenCL framework. Algorithmic choices and implementation details to target the intrinsic characteristics of the selected embedded platforms are presented and discussed. Experimental results on several GPUs show that the GPU-based solution is up to 7× faster than the
CPU-based one. This speed-up opens new visual search scenarios exploiting entire real-time on-board computations with no data transfer.
As regards to the use of Deep convolutional neural networks for off-line image classification, their computational and memory requirements are huge, and this is an issue on embedded devices. Most of the complexity derives from the convolutional layers and in particular from the matrix multiplications they entail. The contribution of this thesis is a self-contained implementation to image classification providing common layers used in neural networks. The approach relies on a heterogeneous CPU-GPU scheme for performing convolutions in the transform domain. Experimental results show that the heterogeneous scheme described in this thesis boasts a 50× speedup over the CPU-only reference and outperforms a GPU-based reference by 2×, while slashing the power consumption by nearly 30%
Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs
Pattern matching is an important building block for many security applications, including Network Intrusion Detection Systems (NIDS). As NIDS grow in functionality and complexity, the time overhead and energy consumption of pattern matching become a significant consideration that limits the deployability of such systems, especially on resource-constrained devices.\ua0On the other hand, the emergence of new computing platforms, such as embedded devices with integrated, general-purpose Graphics Processing Units (GPUs), brings new, interesting challenges and opportunities for algorithm design in this setting: how to make use of new architectural features and how to evaluate their effect on algorithm performance. Up to now, work that focuses on pattern matching for such platforms has been limited to specific algorithms in isolation.In this work, we present a systematic and comprehensive benchmark that allows us to co-evaluate both existing and new pattern matching algorithms on heterogeneous devices equipped with embedded GPUs, suitable for medium- to high-level IoT deployments. We evaluate the algorithms on such a heterogeneous device, in close connection with the architectural features of the platform and provide insights on how these features affect the algorithms\u27 behavior. We find that, in our target embedded platform, GPU-based pattern matching algorithms have competitive performance compared to the CPU and consume half as much energy as the CPU-based variants.\ua0Based on these insights, we also propose HYBRID, a new pattern matching approach that efficiently combines techniques from existing approaches and outperforms them by 1.4x, across a range of realistic and synthetic data sets. Our benchmark details the effect of various optimizations, thus providing a path forward to make existing security mechanisms such as NIDS deployable on IoT devices
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
A CUDA backend for Marrow and its Optimisation via Machine Learning
In the modern days, various industries like business and science deal with collecting, processing
and storing massive amounts of data. Conventional CPUs, which are optimised
for sequential performance, struggle to keep up with processing so much data, however
GPUs, designed for parallel computations, are more than up for the task.
Using GPUs for general processing has become more popular in recent years due
to the need for fast parallel processing, but developing programs that execute on the
GPU can be difficult and time consuming. Various high-level APIs that compile into
GPU programs exist, however due to the abstraction of lower level concepts and lack
of algorithm specific optimisations, it may not be possible to reach peak performance.
Optimisation specifically is an interesting problem, optimisation patterns very rarely can
be applied uniformly to different algorithms and manually tuning individual programs
is extremely time consuming.
Machine learning compilation is a concept that has gained some attention in recent
years, with good reason. The idea is to have a model trained using a machine learning
algorithm and have it make an estimate on how to optimise an input program. Predicting
the best optimisations for a program is much faster than doing it manually, in works
making use of this technique, it has shown to also provide even better optimisations.
In this thesis, we will be working with the Marrow framework and develop a CUDA
based backend for it, so that low-level GPU code may be generated. Additionally, we will
be training a machine learning model and use it to automatically optimise the CUDA
code generated from Marrow programs.Hoje em dia, várias indústrias como negócios e ciência lidam com a coleção, processamento
e armazenamento de enormes quantidades de dados. CPUs convencionais, que
são otimizados para processarem sequencialmente, têm dificuldade a processar tantos
dados eficientemente, no entanto, GPUs que são desenhados para efetuarem computações
paralelas, são mais que adequados para a tarefa.
Usar GPUs para computações genéricas tem-se tornado mais comum em anos recentes
devído à necessidade de processamento paralelo rápido, mas desenvolver programas que
executam na GPU pode ser bastante díficil e demorar demasiado tempo. Existem várias
APIs de alto nível que compilem para a GPU, mas devído à abstração de conceitos de
baixo nível e à falta de otimizações específicas para algoritmos, pode ser impossível obter
o máximo de efficiência. É interessante o problema de otimização, pois na maior parte dos
casos é impossível aplicar padróes de otimização uniformemente em diferentes algoritmos
e encontrar a melhor maneira de otimizar um programa manualmente demora bastante
tempo.
Compilação usando aprendizagem automática é um conceito que tem ficado mais popular
em tempos recentes, e por boas razões. A ideia consiste em ter um modelo treinado
através com um algoritmo de aprendizagem automática e usa-lo para ter uma estimativa
das melhor otimizações que se podem aplicar a um dado programa. Prever as melhores
otimizações com um modelo é muito mais rápido que o processo manual, e trabalhos que
usam esta técnica demonstram obter otmizações ainda melhores.
Nesta tese, vamos trabalhar com a framework Marrow e desevolver uma backend de
CUDA para a mesma, de forma a que esta possa gerar código de baixo nível para a GPU.
Para além disso, vamos treinar um modelo de aprendizagem automática e usa-lo para
otimizar código CUDA gerado a partir de programas do Marrow automáticamente
- …