    An efficient and accurate framework for large-scale sequences of DNA barcodes

    Dissertação de mestrado integrado em Engenharia InformáticaDNA barcodes are short sequences of pre-defined gene regions that contain a sufficient amount of intra- and inter-species genetic information. High-throughput sequencing techniques are currently used to identify large sequences of DNA barcodes in a species genome, in a relatively short time. Domain experts require adequate self-contained tools to accurately and efficiently process DNA barcode data in a reasonable time, taking advantage of current parallel and heterogeneous computing systems. They also expect to use these tools on different computing platforms, from laptops to high-performance servers, without requiring a broad knowledge in software engineering to develop efficient computational applications. The main goal of this project was to develop a framework and associated user-friendly tools for domain experts to efficiently support DNA barcoding studies, providing an abstraction of the performance issues. 4SpecID is the key outcome of this work: an application software that integrates a semi-automated auditing and annotation tool for reference libraries, to ensure the quality standards of the compiled data, aiming to enable a grounded decision when identifying species from DNA barcodes. Its graphics interface aids the end user to specify the operations and it also simplifies data filtering and remote file handling. The C++ ported version (from MATLAB) was fully tested and is more robust than the original version. Architecture features common to laptop and compute servers were exploited, namely parallel programming techniques and memory models. The presented validation and performance results show significant improvements on execution times, not only on the sequential version, but also by using the available parallel capabilities of the underlying computing platforms.Os códigos de barras de ADN são pequenas sequência de regiões genéticas predefinidas que contêm uma quantidade suficiente de informação genética intra e interespécies. Técnicas de sequenciamento de alto desempenho são usadas na identificação de grandes sequências de códigos de barras de ADN no genoma de uma espécie. No entanto, é necessário que sejam desenvolvidas ferramentas adequadas para que os especialistas de domínio processem dados de código de barras de ADN de forma precisa e num intervalo de tempo viável, utilizando os sistemas de computação paralelos e heterogêneos que existem. Destas ferramentas é esperado que possam ser utilizadas recorrendo a diferentes plataformas de computação, de laptops a servidores de alto desempenho, sem exigir um amplo conhecimento em engenharia de software para serem utilizadas ou usadas para a criação de outras ferramentas. O objetivo principal deste projeto é desenvolver uma estrutura que forneça uma abstração dos possíveis desafios de desempenho e permitir que especialistas no domínio tenham uma forma computacional eficiente para realizar um estudo de código de barras de DNA. Neste projecto desenvolveu-se uma ferramenta, 4SpecID, que visa permitir uma decisão fundamentada na identificação de espécies através de códigos de barras de DNA: uma auditoria semi-automática e ferramenta de anotação para bibliotecas de referência, para garantir os padrões de qualidade dos dados compilados. Este projeto também explorou as vantagens das arquiteturas de servidores de computação e laptops mais comuns, como técnicas de programação paralela e modelos de memória. Os resultados de validação e desempenho apresentados mostram que é possível obter melhores tempos de execução utilizando as características disponíveis das plataformas subjacentes

    Performance and Microarchitectural Analysis for Image Quality Assessment

    This thesis presents performance analysis for five matured Image Quality Assessment algorithms: VSNR, MAD, MSSIM, BLIINDS, and VIF, using the VTune ... from Intel. The main performance parameter considered is execution time. First, we conduct Hotspot Analysis to find the most time consuming sections for the five algorithms. Second, we perform Microarchitecural Analysis to analyze the behavior of the algorithms for Intel's Sandy Bridge microarchitecture to find architectural bottlenecks. Existing research for improving the performance of IQA algorithms is based on advanced signal processing techniques. Our research focuses on the interaction of IQA algorithms with the underlying hardware and architectural resources. We propose techniques to improve performance using coding techniques that exploit the hardware resources and consequently improve the execution time and computational performance. Along with software tuning methods, we also propose a generic custom IQA hardware engine based on the microarchitectural analysis and the behavior of these five IQA algorithms with the underlying microarchitectural resources.School of Electrical & Computer Engineerin

    End to end Multi-Objective Optimisation of H.264 and HEVC Codecs

    All multimedia devices now incorporate video CODECs that comply with international video coding standards such as H.264 / MPEG4-AVC and the new High Efficiency Video Coding Standard (HEVC) otherwise known as H.265. Although the standard CODECs have been designed to include algorithms with optimal efficiency, large number of coding parameters can be used to fine tune their operation, within known constraints of for e.g., available computational power, bandwidth, consumer QoS requirements, etc. With large number of such parameters involved, determining which parameters will play a significant role in providing optimal quality of service within given constraints is a further challenge that needs to be met. Further how to select the values of the significant parameters so that the CODEC performs optimally under the given constraints is a further important question to be answered. This thesis proposes a framework that uses machine learning algorithms to model the performance of a video CODEC based on the significant coding parameters. Means of modelling both the Encoder and Decoder performance is proposed. We define objective functions that can be used to model the performance related properties of a CODEC, i.e., video quality, bit-rate and CPU time. We show that these objective functions can be practically utilised in video Encoder/Decoder designs, in particular in their performance optimisation within given operational and practical constraints. A Multi-objective Optimisation framework based on Genetic Algorithms is thus proposed to optimise the performance of a video codec. The framework is designed to jointly minimize the CPU Time, Bit-rate and to maximize the quality of the compressed video stream. The thesis presents the use of this framework in the performance modelling and multi-objective optimisation of the most widely used video coding standard in practice at present, H.264 and the latest video coding standard, H.265/HEVC. When a communication network is used to transmit video, performance related parameters of the communication channel will impact the end-to-end performance of the video CODEC. Network delays and packet loss will impact the quality of the video that is received at the decoder via the communication channel, i.e., even if a video CODEC is optimally configured network conditions will make the experience sub-optimal. Given the above the thesis proposes a design, integration and testing of a novel approach to simulating a wired network and the use of UDP protocol for the transmission of video data. This network is subsequently used to simulate the impact of packet loss and network delays on optimally coded video based on the framework previously proposed for the modelling and optimisation of video CODECs. The quality of received video under different levels of packet loss and network delay is simulated, concluding the impact on transmitted video based on their content and features

    Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application

    To fully exploit emerging processor architectures, programs will need to employ threaded parallelism within a node and message passing across nodes. Today, MPI+OpenMP is the preferred programming model for this task. However, tuning MPI+OpenMP programs for clusters is difficult. Performance tools can help users identify bottlenecks and uncover opportunities for improvement. Applications to analyze seismic data employ scalable parallel systems to produce timely results. This thesis describes our experiences of applying performance tools to gain insight into an MPI+OpenMP code that performs Reverse Time Migration (RTM) to analyze seismic data and also assess the capabilities of available tools for analyzing the performance of a sophisticated application that employ both message-passing and threaded parallelism. The tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from Rice University's HPCToolkit and hardware performance counters, we were able to improve the performance of the RTM code by roughly 30 percent

    Efficient computation of the matrix square root in heterogeneous platforms

    Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient cache memory usage. Recent collaborative work between the Numerical Algorithms Group and the University of Minho led to a blocked approach to the matrix square root algorithm with significant efficiency improvements, particularly in a multicore shared memory environment. Distributed memory architectures were left unexplored. In these systems data is distributed across multiple memory spaces, including those associated with specialized accelerator devices, such as GPUs. Systems with these devices are known as heterogeneous platforms. This dissertation focuses on studying the blocked matrix square root algorithm, first in a multicore environment, and then in heterogeneous platforms. Two types of hardware accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs. The initial implementation confirmed the advantages of the blocked method and showed excellent scalability in a multicore environment. The same implementation was also used in the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour and the CPU-only alternative. Several optimizations techniques were applied to the common implementation, which managed to reduce the gap between the two environments. The implementation for CUDA-enabled devices followed a different programming model and was not able to benefit from any of the previous solutions. It also required the implementation of BLAS and LAPACK routines, since no existing package fits the requirements of this application. The measured performance also showed that the CPU-only implementation is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de eficiência significativas, particularmente num ambiente multicore de memória partilhada. Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são conhecidos como plataformas heterogéneas. Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA habilitados para CUDA. A implementação inicial confirmou as vantagens do método por blocos e mostrou uma escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes. A implementação para dispositivos CUDA seguiu um modelo de programação diferente e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos desta implementação. A performance medida também mostrou que a alternativa usando apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga

    Acceleration of Axisymetric Ultrasound Simulations

    Simulácia šírenia ultrazvuku prostredníctvom mäkkých biologických tkanív má širokú škálu praktických aplikácií. Patria sem dizajn prevodníkov pre diagnostický a terapeutický ultrazvuk, vývoj nových metód spracovania signálov a zobrazovacích techník, štúdium anomálií ultrazvukových lúčov v heterogénnych médiách, ultrazvuková klasifikácia tkanív, učenie rádiológov používať ultrazvukové zariadenia a interpretáciu ultrazvukových obrazov, modelové vrstvenie medicínskeho obrazu a plánovanie liečby pre ultrazvuk s vysokou intenzitou. Ultrazvuková simulácia však predstavuje výpočtovo zložitý problém, pretože simulačné domény sú veľmi veľké v porovnaní s akustickými vlnovými dĺžkami, ktoré sú predmetom záujmu. Ale ak je problém osovo symetrický, problém môže byť riešený v 2D.To umožňuje spúšťanie simulácií na mriežke s väčším počtom bodov, s menším využitím výpoč- tových zdrojov za kratšiu dobu. Táto práca modeluje a implementuje zrýchlenie vlnovej nelineárnej ultrazvukovej simulácie v axisymetrickom súradnicovom systéme realizovanom v Matlabe pomocou Mex súborov pre diskrétne sínové a kosínové transformácie. Axisymetrická simulácia bola implementovaná v C++ ako open source rozšírenie K-WAVE toolboxu. Kód je optimalizovaný na beh na jednom uzle superpočítaču Salomon (IT4Innovations, Ostrava, Česká republika) s dvoma dvanásť-jadrovými procesormi Intel Xeon E5-2680v3. Na maximalizáciu výpočtovej efektívnosti boli vykonané viaceré optimalizácie kódu. Po prvé, fourierové tramsformácie boli vypočítané pomocou real-to-complex FFT z knižnice FFTW. V porovnaní s complex-to-complex FFT to znížilo čas výpočtu a pamäť spojenú s výpočtom FFT o takmer 50%. Taktiež diskrétne sínové a kosínové transformácie sa počítali pomocou knižnice FFTW, ktoré v Matlab verzii museli byť vyvolané z dynamicky načítaných MEX súborov. Po druhé, aby sa znížilo zaťaženie priepustnosti pamäte, boli všetky operácie počítané jednoduchej presnosti pohyblivej rádovej čiarky. Po tretie, elementárne operá- cie boli paralelizované pomocou OpenMP a potom vektorizované pomocou rozšírení SIMD (SSE). Celkový výpočet C++ verzie je až do 34-násobne rýchlejší a využíva menej ako tretinu pamäte ako Matlab verzia simulácie. Simulácia ktorá by trvala takmer dva dni tak môže byť vypočítaná za jeden a pol hodinu. Toto všetko umožňuje počítať simuláciu na výpočetnej mriežke s veľkosťou 16384 × 8192 bodov v primeranom čase.The simulation of ultrasound propagation through soft biological tissue has a wide range of practical applications. These include the design of transducers for diagnostic and therapeutic ultrasound, the development of new signal processing and imaging techniques, studying the aberration of ultrasound beams in heterogeneous media, ultrasonic tissue classification, training ultrasonographers to use ultrasound equipment and interpret ultrasound images, model-based medical image registration, and treatment planning and dosimetry for high-intensity focused ultrasound. However, ultrasound simulation presents a computationally difficult problem, as simulation domains are very large compared with the acoustic wavelengths of interest. But if the problem is axisymmetric, the governing equations can also be solved in 2D. This allows running simulations with larger grid size, with less computational resources and in a shorter time. This paper model and implements an acceleration of the Full-wave Nonlinear Ultrasound Simulation in an Axisymmetric Coordinate System implemented in Matlab using Mex Files for FFTW DST and DCT transformations. The axisymmetric simulation was implemented in C++ as an extension to the open source K-WAVE toolbox. The codes were optimized to run using one node of Salomon supercomputer cluster (IT4Innovations, Ostrava, Czechia) with two twelve-core Intel Xeon E5-2680v3 processors. To maximize computational efficiency, several stages of code optimization were performed. First, the FFTs were computed using the real-to-complex FFT from the FFTW library. Compared to the complex-to-complex FFT, this reduced the compute time and memory associated with the FFT by nearly 50%. Also, real-to-real DCTs and DSTs were computed using FFTW library, which ones in Matlab version, had to be invoked from dynamically loaded MEX Files. Second, to save memory bandwidth, all operations were computed in single precision. Third, element-wise operations were parallelized using OpenMP and then optimized using streaming SIMD extensions (SSE). The overall computation of the C++ k-space model is up to 34-times faster and uses less than one-third of the memory than Matlab version. The simulation which would take nearly two days by Matlab implementation can be now computed in one and half hour. This all allows running the simulation on the computational grid with 16384 × 8192 grid points within a reasonable time.

    Efficient Algorithms for Large-Scale Image Analysis

    This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications