1,209 research outputs found

    Symbolic crosschecking of data-parallel floating-point code

    Get PDF

    Template matching method for the analysis of interstellar cloud structure

    Full text link
    The structure of interstellar medium can be characterised at large scales in terms of its global statistics (e.g. power spectra) and at small scales by the properties of individual cores. Interest has been increasing in structures at intermediate scales, resulting in a number of methods being developed for the analysis of filamentary structures. We describe the application of the generic template-matching (TM) method to the analysis of maps. Our aim is to show that it provides a fast and still relatively robust way to identify elongated structures or other image features. We present the implementation of a TM algorithm for map analysis. The results are compared against rolling Hough transform (RHT), one of the methods previously used to identify filamentary structures. We illustrate the method by applying it to Herschel surface brightness data. The performance of the TM method is found to be comparable to that of RHT but TM appears to be more robust regarding the input parameters, for example, those related to the selected spatial scales. Small modifications of TM enable one to target structures at different size and intensity levels. In addition to elongated features, we demonstrate the possibility of using TM to also identify other types of structures. The TM method is a viable tool for data quality control, exploratory data analysis, and even quantitative analysis of structures in image data.Comment: 12 pages, accepted to A&

    MELT - a Translated Domain Specific Language Embedded in the GCC Compiler

    Full text link
    The GCC free compiler is a very large software, compiling source in several languages for many targets on various systems. It can be extended by plugins, which may take advantage of its power to provide extra specific functionality (warnings, optimizations, source refactoring or navigation) by processing various GCC internal representations (Gimple, Tree, ...). Writing plugins in C is a complex and time-consuming task, but customizing GCC by using an existing scripting language inside is impractical. We describe MELT, a specific Lisp-like DSL which fits well into existing GCC technology and offers high-level features (functional, object or reflexive programming, pattern matching). MELT is translated to C fitted for GCC internals and provides various features to facilitate this. This work shows that even huge, legacy, software can be a posteriori extended by specifically tailored and translated high-level DSLs.Comment: In Proceedings DSL 2011, arXiv:1109.032

    Visual Analysis Algorithms for Embedded Systems

    Get PDF
    Visual search systems are very popular applications, but on-line versions in 3G wireless environments suffer from network constraint like unstable or limited bandwidth that entail latency in query delivery, significantly degenerating the user’s experience. An alternative is to exploit the ability of the newest mobile devices to perform heterogeneous activities, like not only creating but also processing images. Visual feature extraction and compression can be performed on on-board Graphical Processing Units (GPUs), making smartphones capable of detecting a generic object (matching) in an exact way or of performing a classification activity. The latest trends in visual search have resulted in dedicated efforts in MPEG standardization, namely the MPEG CDVS (Compact Descriptor for Visual Search) standard. CDVS is an ISO/IEC standard used to extract a compressed descriptor. As regards to classification, in recent years neural networks have acquired an impressive importance and have been applied to several domains. This thesis focuses on the use of Deep Neural networks to classify images by means of Deep learning. Implementing visual search algorithms and deep learning-based classification on embedded environments is not a mere code-porting activity. Recent embedded devices are equipped with a powerful but limited number of resources, like development boards such as GPGPUs. GPU architectures fit particularly well, because they allow to execute more operations in parallel, following the SIMD (Single Instruction Multiple Data) paradigm. Nonetheless, it is necessary to make good design choices for the best use of available hardware and memory. For visual search, following the MPEG CDVS standard, the contribution of this thesis is an efficient feature computation phase, a parallel CDVS detector, completely implemented on embedded devices supporting the OpenCL framework. Algorithmic choices and implementation details to target the intrinsic characteristics of the selected embedded platforms are presented and discussed. Experimental results on several GPUs show that the GPU-based solution is up to 7× faster than the CPU-based one. This speed-up opens new visual search scenarios exploiting entire real-time on-board computations with no data transfer. As regards to the use of Deep convolutional neural networks for off-line image classification, their computational and memory requirements are huge, and this is an issue on embedded devices. Most of the complexity derives from the convolutional layers and in particular from the matrix multiplications they entail. The contribution of this thesis is a self-contained implementation to image classification providing common layers used in neural networks. The approach relies on a heterogeneous CPU-GPU scheme for performing convolutions in the transform domain. Experimental results show that the heterogeneous scheme described in this thesis boasts a 50× speedup over the CPU-only reference and outperforms a GPU-based reference by 2×, while slashing the power consumption by nearly 30%

    Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs

    Get PDF
    Pattern matching is an important building block for many security applications, including Network Intrusion Detection Systems (NIDS). As NIDS grow in functionality and complexity, the time overhead and energy consumption of pattern matching become a significant consideration that limits the deployability of such systems, especially on resource-constrained devices.\ua0On the other hand, the emergence of new computing platforms, such as embedded devices with integrated, general-purpose Graphics Processing Units (GPUs), brings new, interesting challenges and opportunities for algorithm design in this setting: how to make use of new architectural features and how to evaluate their effect on algorithm performance. Up to now, work that focuses on pattern matching for such platforms has been limited to specific algorithms in isolation.In this work, we present a systematic and comprehensive benchmark that allows us to co-evaluate both existing and new pattern matching algorithms on heterogeneous devices equipped with embedded GPUs, suitable for medium- to high-level IoT deployments. We evaluate the algorithms on such a heterogeneous device, in close connection with the architectural features of the platform and provide insights on how these features affect the algorithms\u27 behavior. We find that, in our target embedded platform, GPU-based pattern matching algorithms have competitive performance compared to the CPU and consume half as much energy as the CPU-based variants.\ua0Based on these insights, we also propose HYBRID, a new pattern matching approach that efficiently combines techniques from existing approaches and outperforms them by 1.4x, across a range of realistic and synthetic data sets. Our benchmark details the effect of various optimizations, thus providing a path forward to make existing security mechanisms such as NIDS deployable on IoT devices

    Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

    Full text link
    Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO

    A CUDA backend for Marrow and its Optimisation via Machine Learning

    Get PDF
    In the modern days, various industries like business and science deal with collecting, processing and storing massive amounts of data. Conventional CPUs, which are optimised for sequential performance, struggle to keep up with processing so much data, however GPUs, designed for parallel computations, are more than up for the task. Using GPUs for general processing has become more popular in recent years due to the need for fast parallel processing, but developing programs that execute on the GPU can be difficult and time consuming. Various high-level APIs that compile into GPU programs exist, however due to the abstraction of lower level concepts and lack of algorithm specific optimisations, it may not be possible to reach peak performance. Optimisation specifically is an interesting problem, optimisation patterns very rarely can be applied uniformly to different algorithms and manually tuning individual programs is extremely time consuming. Machine learning compilation is a concept that has gained some attention in recent years, with good reason. The idea is to have a model trained using a machine learning algorithm and have it make an estimate on how to optimise an input program. Predicting the best optimisations for a program is much faster than doing it manually, in works making use of this technique, it has shown to also provide even better optimisations. In this thesis, we will be working with the Marrow framework and develop a CUDA based backend for it, so that low-level GPU code may be generated. Additionally, we will be training a machine learning model and use it to automatically optimise the CUDA code generated from Marrow programs.Hoje em dia, várias indústrias como negócios e ciência lidam com a coleção, processamento e armazenamento de enormes quantidades de dados. CPUs convencionais, que são otimizados para processarem sequencialmente, têm dificuldade a processar tantos dados eficientemente, no entanto, GPUs que são desenhados para efetuarem computações paralelas, são mais que adequados para a tarefa. Usar GPUs para computações genéricas tem-se tornado mais comum em anos recentes devído à necessidade de processamento paralelo rápido, mas desenvolver programas que executam na GPU pode ser bastante díficil e demorar demasiado tempo. Existem várias APIs de alto nível que compilem para a GPU, mas devído à abstração de conceitos de baixo nível e à falta de otimizações específicas para algoritmos, pode ser impossível obter o máximo de efficiência. É interessante o problema de otimização, pois na maior parte dos casos é impossível aplicar padróes de otimização uniformemente em diferentes algoritmos e encontrar a melhor maneira de otimizar um programa manualmente demora bastante tempo. Compilação usando aprendizagem automática é um conceito que tem ficado mais popular em tempos recentes, e por boas razões. A ideia consiste em ter um modelo treinado através com um algoritmo de aprendizagem automática e usa-lo para ter uma estimativa das melhor otimizações que se podem aplicar a um dado programa. Prever as melhores otimizações com um modelo é muito mais rápido que o processo manual, e trabalhos que usam esta técnica demonstram obter otmizações ainda melhores. Nesta tese, vamos trabalhar com a framework Marrow e desevolver uma backend de CUDA para a mesma, de forma a que esta possa gerar código de baixo nível para a GPU. Para além disso, vamos treinar um modelo de aprendizagem automática e usa-lo para otimizar código CUDA gerado a partir de programas do Marrow automáticamente
    corecore