39 research outputs found

    Execução eficiente do padrão de propagação de ondas irregulares na arquitetura Many Integrated Core

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, Programa de Pós-Graducação em Informática, 2016.A execução eficiente de algoritmos de processamento de imagens é uma área ativa da Bioinformática. Uma das classes de algoritmos em processamento de imagens ou de padrão de computação comum nessa área é a Irregular Wavefront Propagation Pattern (IWPP). Nessa classe, elementos propagam informações para seus vizinhos em forma de ondas de propagação. Esse padrão de propagação resulta em acessos a dados e expansões irregulares. Por essa característica irregular, implementações paralelas atuais dessa classe de algoritmos necessitam de operações atômicas, o que acaba sendo muito custoso e também inviabiliza a implementação por meio de instruções Single Instruction, Multiple Data (SIMD) na arquitetura Many Integrated Core (MIC), que são fundamentais para atingir alto desempenho nessa arquitetura. O objetivo deste trabalho é reprojetar o algoritmo Irregular Wavefront Propagation Pattern, de forma a possibilitar sua eficiente execução em processadores com arquitetura Many Integrated Core que utilizem instruções SIMD. Neste trabalho, utilizando o Intel® Xeon Phi™, foram implementadas uma versão vetorizada, apresentando ganhos de até 5:63 em relação à versão não-vetorizada; uma versão paralela utilizando fila First In, First Out (FIFO) cuja escalabilidade demonstrou-se boa com speedups em torno de 55 em relação à um núcleo do coprocessador; uma versão utilizando fila de prioridades cuja velocidade foi de 1:62 mais veloz que a versão mais rápida em GPU conhecida na literatura, e uma versão cooperativa entre processadores heterogêneos que permitem processar imagens que ultrapassem a capacidade de memória do Intel® Xeon Phi™, e também possibilita a utilização de múltiplos dispositivos na execução do algoritmo.The efficient execution of image processing algorithms is an active area of Bioinformatics. In image processing, one of the classes of algorithms or computing pattern that works with irregular data structures is the Irregular Wavefront Propagation Pattern (IWPP). In this class, elements propagate information to neighbors in the form of wave propagation. This propagation results in irregular access to data and expansions. Due to this irregularity, current implementations of this class of algorithms requires atomic operations, which is very costly and also restrains implementations with Single Instruction, Multiple Data (SIMD) instructions in Many Integrated Core (MIC) architectures, which are critical to attain high performance on this processor. The objective of this study is to redesign the Irregular Wavefront Propagation Pattern algorithm in order to enable the efficient execution on processors with Many Integrated Core architecture using SIMD instructions. In this work, using the Intel® Xeon Phi™ coprocessor, we have implemented a vector version of IWPP with up to 5:63 gains on non-vectored version, a parallel version using First In, First Out (FIFO) queue that attained speedup up to 55 as compared to the single core version on the coprocessor, a version using priority queue whose performance was 1:62 better than the fastest version of GPU based implementation available in the literature, and a cooperative version between heterogeneous processors that allow to process images bigger than the Intel® Xeon Phi™ memory and also provides a way to utilize all the available devices in the computation

    Accelerating Sensitivity Analysis in Microscopy Image Segmentation Workflows

    Get PDF
    With the increasingly availability of digital microscopy imagery equipments there is a demand for efficient execution of whole slide tissue image applications. Through the process of sensitivity analysis it is possible to improve the output quality of such applications, and thus, improve the desired analysis quality. Due to the high computational cost of such analyses and the recurrent nature of executed tasks from sensitivity analysis methods (i.e., reexecution of tasks), the opportunity for computation reuse arises. By performing computation reuse we can optimize the run time of sensitivity analysis applications. This work focuses then on finding new ways to take advantage of computation reuse opportunities on multiple task abstraction levels. This is done by presenting the coarse-grain merging strategy and the new fine-grain merging algorithms, implemented on top of the Region Templates Framework.Comment: 44 page

    Fast algorithm for real-time rings reconstruction

    Get PDF
    The GAP project is dedicated to study the application of GPU in several contexts in which real-time response is important to take decisions. The definition of real-time depends on the application under study, ranging from answer time of μs up to several hours in case of very computing intensive task. During this conference we presented our work in low level triggers [1] [2] and high level triggers [3] in high energy physics experiments, and specific application for nuclear magnetic resonance (NMR) [4] [5] and cone-beam CT [6]. Apart from the study of dedicated solution to decrease the latency due to data transport and preparation, the computing algorithms play an essential role in any GPU application. In this contribution, we show an original algorithm developed for triggers application, to accelerate the ring reconstruction in RICH detector when it is not possible to have seeds for reconstruction from external trackers

    Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

    Get PDF
    Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM

    Morphological Image Reconstruction Implementation Using a Hardware-Software Approach in FPGA

    Get PDF
    Dissertação (mestrado) — Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Mecânica, 2018.A presente dissertação de mestrado implementa um algoritmo de reconstrução morfológica de imagens com uma abordagem hardware/software utilizando um sistema embarcado em FPGA. A parte de hardware foi desenvolvida em VHDL em uma FPGA da família Cyclone R V da Al- tera que possui um processador ARM R CortexTM A9 no mesmo chip, possibilitando a execução da parte de software em linguagem C. O algoritmo em si recebe como entrada duas imagens chamadas marker e mask, e entrega como saída uma imagem que é a reconstrução morfológica baseada no conteúdo das duas. Este trabalho implementa o algoritmo através de uma estraté- gia de particionamento de imagem em imagens menores, executando a versão sequencial (SR) do algoritmo em hardware simultaneamente à reconstrução usando fila nas fronteiras das sub- imagens em software. A proposta do trabalho é a utilização de propriedades inerentes à imple- mentações em hardware, como a paralelização e alto desempenho, em conjunto com a flexibili- dade e rapidez de um projeto de software para alcançar uma solução final que ao mesmo tempo possua um bom desempenho, tendo em vista as limitações de um sistema embarcado com pouca memória, e ofereça uma solução final flexível, permitindo ao usuário escolher o tamanho da imagem a ser processada via software. A arquitetura do hardware roda em frequência de 150 Mhz, utiliza protocolo Avalon para se comunicar, possui uma memória DDR3 externa de 1GB para armazenamento temporário das imagens e está conectado a um processador ARM Cortex- A9 através de um barramento AMBA R AXI3. A parte de software roda nesse processador a 925 Mhz e possui outra memória DDR3 de 1GB. O programa primeiro configura os registradores internos do hardware de acordo com os parâmetros escolhidos (tamanho da imagem, endereço de memória das imagens entre outros), ordena o hardware a realizar o algoritmo de recon- strução morfológica diversas vezes em imagens menores e depois executa a propagação do algoritmo entre as bordas dessas sub-imagens processadas pelo hardware. Como método de verificação funcional da solução, as imagens resultantes foram comparadas com as produzidas pelo MATLAB R . A melhor solução proposta por este trabalho alcançou uma melhoria em torno de 2x em comparação com a melhor solução teórica possível do algoritmo de reconstrução mor- fológica sequencial (SR) implementado em um hardware de 150 Mhz, uma melhoria de até 12x em relação ao algoritmo de reconstrução morfológica Fast Hybrid (FH) proposto por Vincent [1] rodando em um ARM R CortexTM A9 e uma melhoria de até 2x em comparação com o mesmo al- goritmo rodando em uma CPU Intel R CoreTM i5. Os testes finais para validar a solução mostram resultados corretos para imagens em escala de cinza com resolução de 8 bits de até 8192x8192 pixels (imagens 8k). Até onde se sabe, este é a primeira implementação hardware/software do algoritmo de reconstrução morfológica de imagens descrito na literatura.This MSc dissertation implements a morphological image reconstruction algorithm using a hard- ware/software approach in an FPGA based embedded system. The hardware part was devel- oped in VHDL in a Cyclone R V FPGA from Altera that has an ARM R CortexTM A9 processor in the same chip, allowing the execution of the software part in C language. The algorithm itself receives as input two images, called marker and mask, and generates as output a morpholog- ically reconstructed image based on the content of the two inputs. This work implements the algorithm through a strategy of image partition, executing a sequential reconstruction version (SR) in hardware together to a reconstruction using a queue of pixels on the boundary of the sub-images in software. The purpose of this work is to use inherent hardware implementation properties, like parallelization and high performance, together with the flexibility and quickness of a software design to achieve a final solution that at the same time has a good performance, in view of the limitations of an embedded system, and offers a flexible final solution, allowing the user to choose the image size that would be processed through software settings. The hardware architecture runs at 150 Mhz, uses Avalon protocol to communicate, has a DDR3 memory of 1GB to store temporary images and is connected to an ARM Cortex-A9 through an AMBA R AXI3 bus. The software part runs on this processor at 925 Mhz and has another DDR3 memory of 1GB. The program first configures the internal hardware registers according to the parameters chosen (image size, memory address, and others), commands the hardware to perform the morpho- logical reconstruction algorithm several times in smaller images and then runs the propagation algorithm between the borders of the sub-images processed by the hardware. As a functional verification method for this solution, the resulted images were compared to the ones produced by MATLAB R . The best solution proposed by this work achieved a speedup of around 2x compared to the best theoretical solution possible of the sequential morphological reconstruction algorithm implemented in a hardware that runs at 150Mhz, a speedup of up to 12x in relation to the fast hybrid reconstruction algorithm proposed by Vincent [1] being executed in an ARM R CortexTM A9 processor, and even a speedup of up to 2x in comparison with the same algorithm running in an Intel R CoreTM i5 CPU. The final tests to validate this solution show correct results for 8 bits grayscale images of up to 8192x8192 pixels (8k images). To the best knowledge of the author, this is the first hardware/software implementation of the morphological image reconstruction al- gorithm described in the literature

    Optimizing Sparse Linear Algebra on Reconfigurable Architecture

    Full text link
    The rise of cloud computing and deep machine learning in recent years have led to a tremendous growth of workloads that are not only large, but also have highly sparse representations. A large fraction of machine learning problems are formulated as sparse linear algebra problems in which the entries in the matrices are mostly zeros. Not surprisingly, optimizing linear algebra algorithms to take advantage of this sparseness is critical for efficient computation on these large datasets. This thesis presents a detailed analysis of several approaches to sparse matrix-matrix multiplication, a core computation of linear algebra kernels. While the arithmetic count of operations for the nonzero elements of the matrices are the same regardless of the algorithm used to perform matrix-matrix multiplication, there is significant variation in the overhead of navigating the sparse data structures to match the nonzero elements with the correct indices. This work explores approaches to minimize these overheads as well as the number of memory accesses for sparse matrices. To provide concrete numbers, the thesis examines inner product, outer product and row-wise algorithms on Transmuter, a flexible accelerator that can reconfigure its cache and crossbars at runtime to meet the demands of the program. This thesis shows how the reconfigurability of Transmuter can improve complex workloads that contain multiple kernels with varying compute and memory requirements, such as the Graphsage deep neural network and the Sinkhorn algorithm for optimal transport distance. Finally, we examine a novel Transmuter feature: register-to-register queues for efficient communication between its processing elements, enabling systolic array style computation for signal processing algorithms.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169877/1/dohypark_1.pd

    Doctor of Philosophy

    Get PDF
    dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving
    corecore