47 research outputs found
An efficient and accurate framework for large-scale sequences of DNA barcodes
Dissertação de mestrado integrado em Engenharia InformáticaDNA barcodes are short sequences of pre-defined gene regions that contain a sufficient
amount of intra- and inter-species genetic information. High-throughput sequencing techniques are currently used to identify large sequences of DNA barcodes in a species genome, in a relatively short time.
Domain experts require adequate self-contained tools to accurately and efficiently process
DNA barcode data in a reasonable time, taking advantage of current parallel and heterogeneous computing systems. They also expect to use these tools on different computing platforms, from laptops to high-performance servers, without requiring a broad knowledge in software engineering to develop efficient computational applications.
The main goal of this project was to develop a framework and associated user-friendly tools
for domain experts to efficiently support DNA barcoding studies, providing an abstraction
of the performance issues.
4SpecID is the key outcome of this work: an application software that integrates a
semi-automated auditing and annotation tool for reference libraries, to ensure the quality
standards of the compiled data, aiming to enable a grounded decision when identifying
species from DNA barcodes. Its graphics interface aids the end user to specify the operations
and it also simplifies data filtering and remote file handling.
The C++ ported version (from MATLAB) was fully tested and is more robust than
the original version. Architecture features common to laptop and compute servers were
exploited, namely parallel programming techniques and memory models.
The presented validation and performance results show significant improvements on
execution times, not only on the sequential version, but also by using the available parallel
capabilities of the underlying computing platforms.Os códigos de barras de ADN são pequenas sequência de regiões genéticas predefinidas
que contêm uma quantidade suficiente de informação genética intra e interespécies.
Técnicas de sequenciamento de alto desempenho são usadas na identificação de grandes
sequências de códigos de barras de ADN no genoma de uma espécie.
No entanto, é necessário que sejam desenvolvidas ferramentas adequadas para que os
especialistas de domínio processem dados de código de barras de ADN de forma precisa e
num intervalo de tempo viável, utilizando os sistemas de computação paralelos e heterogêneos que existem. Destas ferramentas é esperado que possam ser utilizadas recorrendo a
diferentes plataformas de computação, de laptops a servidores de alto desempenho, sem
exigir um amplo conhecimento em engenharia de software para serem utilizadas ou usadas
para a criação de outras ferramentas.
O objetivo principal deste projeto é desenvolver uma estrutura que forneça uma abstração
dos possíveis desafios de desempenho e permitir que especialistas no domínio tenham
uma forma computacional eficiente para realizar um estudo de código de barras de DNA.
Neste projecto desenvolveu-se uma ferramenta, 4SpecID, que visa permitir uma decisão
fundamentada na identificação de espécies através de códigos de barras de DNA: uma
auditoria semi-automática e ferramenta de anotação para bibliotecas de referência, para
garantir os padrões de qualidade dos dados compilados.
Este projeto também explorou as vantagens das arquiteturas de servidores de computação
e laptops mais comuns, como técnicas de programação paralela e modelos de memória. Os
resultados de validação e desempenho apresentados mostram que é possível obter melhores
tempos de execução utilizando as características disponíveis das plataformas subjacentes
Performance and Microarchitectural Analysis for Image Quality Assessment
This thesis presents performance analysis for five matured Image Quality Assessment algorithms: VSNR, MAD, MSSIM, BLIINDS, and VIF, using the VTune ... from Intel. The main performance parameter considered is execution time. First, we conduct Hotspot Analysis to find the most time consuming sections for the five algorithms. Second, we perform Microarchitecural Analysis to analyze the behavior of the algorithms for Intel's Sandy Bridge microarchitecture to find architectural bottlenecks. Existing research for improving the performance of IQA algorithms is based on advanced signal processing techniques. Our research focuses on the interaction of IQA algorithms with the underlying hardware and architectural resources. We propose techniques to improve performance using coding techniques that exploit the hardware resources and consequently improve the execution time and computational performance. Along with software tuning methods, we also propose a generic custom IQA hardware engine based on the microarchitectural analysis and the behavior of these five IQA algorithms with the underlying microarchitectural resources.School of Electrical & Computer Engineerin
End to end Multi-Objective Optimisation of H.264 and HEVC Codecs
All multimedia devices now incorporate video CODECs that comply with international video coding standards such as H.264 / MPEG4-AVC and the new High Efficiency Video Coding Standard (HEVC) otherwise known as H.265. Although the standard CODECs have been designed to include algorithms with optimal efficiency, large number of coding parameters can be used to fine tune their operation, within known constraints of for e.g., available computational power, bandwidth, consumer QoS requirements, etc. With large number of such parameters involved, determining which parameters will play a significant role in providing optimal quality of service within given constraints is a further challenge that needs to be met. Further how to select the values of the significant parameters so that the CODEC performs optimally under the given constraints is a further important question to be answered.
This thesis proposes a framework that uses machine learning algorithms to model the performance of a video CODEC based on the significant coding parameters. Means of modelling both the Encoder and Decoder performance is proposed. We define objective functions that can be used to model the performance related properties of a CODEC, i.e., video quality, bit-rate and CPU time. We show that these objective functions can be practically utilised in video Encoder/Decoder designs, in particular in their performance optimisation within given operational and practical constraints. A Multi-objective Optimisation framework based on Genetic Algorithms is thus proposed to optimise the performance of a video codec. The framework is designed to jointly minimize the CPU Time, Bit-rate and to maximize the quality of the compressed video stream. The thesis presents the use of this framework in the performance modelling and multi-objective optimisation of the most widely used video coding standard in practice at present, H.264 and the latest video coding standard, H.265/HEVC.
When a communication network is used to transmit video, performance related parameters of the communication channel will impact the end-to-end performance of the video CODEC. Network delays and packet loss will impact the quality of the video that is received at the decoder via the communication channel, i.e., even if a video CODEC is optimally configured network conditions will make the experience sub-optimal. Given the above the thesis proposes a design, integration and testing of a novel approach to simulating a wired network and the use of UDP protocol for the transmission of video data. This network is subsequently used to simulate the impact of packet loss and network delays on optimally coded video based on the framework previously proposed for the modelling and optimisation of video CODECs. The quality of received video under different levels of packet loss and network delay is simulated, concluding the impact on transmitted video based on their content and features
Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application
To fully exploit emerging processor architectures, programs will need to employ threaded parallelism within a node and message passing across nodes. Today, MPI+OpenMP is the preferred programming model for this task. However, tuning MPI+OpenMP programs for clusters is difficult. Performance tools can help users identify bottlenecks and uncover opportunities for improvement. Applications to analyze seismic data employ scalable parallel systems to produce timely results. This thesis describes our experiences of applying performance tools to gain insight into an MPI+OpenMP code that performs Reverse Time Migration (RTM) to analyze seismic data and also assess the capabilities of available tools for analyzing the performance of a sophisticated application that employ both message-passing and threaded parallelism. The tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from Rice University's HPCToolkit and hardware performance counters, we were able to improve the performance of the RTM code by roughly 30 percent
Efficient computation of the matrix square root in heterogeneous platforms
Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient
cache memory usage. Recent collaborative work between the Numerical Algorithms
Group and the University of Minho led to a blocked approach to the matrix square root algorithm
with significant efficiency improvements, particularly in a multicore shared memory
environment.
Distributed memory architectures were left unexplored. In these systems data is distributed
across multiple memory spaces, including those associated with specialized accelerator
devices, such as GPUs. Systems with these devices are known as heterogeneous
platforms.
This dissertation focuses on studying the blocked matrix square root algorithm, first
in a multicore environment, and then in heterogeneous platforms. Two types of hardware
accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs.
The initial implementation confirmed the advantages of the blocked method and showed
excellent scalability in a multicore environment. The same implementation was also used in
the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour
and the CPU-only alternative. Several optimizations techniques were applied to the
common implementation, which managed to reduce the gap between the two environments.
The implementation for CUDA-enabled devices followed a different programming model
and was not able to benefit from any of the previous solutions. It also required the implementation
of BLAS and LAPACK routines, since no existing package fits the requirements of
this application. The measured performance also showed that the CPU-only implementation
is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao
mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de
colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma
abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de
eficiência significativas, particularmente num ambiente multicore de memória partilhada.
Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas
os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a
dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são
conhecidos como plataformas heterogéneas.
Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por
blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois
tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA
habilitados para CUDA.
A implementação inicial confirmou as vantagens do método por blocos e mostrou uma
escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada
para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento
esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas
a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes.
A implementação para dispositivos CUDA seguiu um modelo de programação diferente
e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação
de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos
desta implementação. A performance medida também mostrou que a alternativa usando
apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga
Acceleration of Axisymetric Ultrasound Simulations
Simulácia šírenia ultrazvuku prostredníctvom mäkkých biologických tkanív má širokú škálu praktických aplikácií. Patria sem dizajn prevodníkov pre diagnostický a terapeutický ultrazvuk, vývoj nových metód spracovania signálov a zobrazovacích techník, štúdium anomálií ultrazvukových lúčov v heterogénnych médiách, ultrazvuková klasifikácia tkanív, učenie rádiológov používať ultrazvukové zariadenia a interpretáciu ultrazvukových obrazov, modelové vrstvenie medicínskeho obrazu a plánovanie liečby pre ultrazvuk s vysokou intenzitou. Ultrazvuková simulácia však predstavuje výpočtovo zložitý problém, pretože simulačné domény sú veľmi veľké v porovnaní s akustickými vlnovými dĺžkami, ktoré sú predmetom záujmu. Ale ak je problém osovo symetrický, problém môže byť riešený v 2D.To umožňuje spúšťanie simulácií na mriežke s väčším počtom bodov, s menším využitím výpoč- tových zdrojov za kratšiu dobu. Táto práca modeluje a implementuje zrýchlenie vlnovej nelineárnej ultrazvukovej simulácie v axisymetrickom súradnicovom systéme realizovanom v Matlabe pomocou Mex súborov pre diskrétne sínové a kosínové transformácie. Axisymetrická simulácia bola implementovaná v C++ ako open source rozšírenie K-WAVE toolboxu. Kód je optimalizovaný na beh na jednom uzle superpočítaču Salomon (IT4Innovations, Ostrava, Česká republika) s dvoma dvanásť-jadrovými procesormi Intel Xeon E5-2680v3. Na maximalizáciu výpočtovej efektívnosti boli vykonané viaceré optimalizácie kódu. Po prvé, fourierové tramsformácie boli vypočítané pomocou real-to-complex FFT z knižnice FFTW. V porovnaní s complex-to-complex FFT to znížilo čas výpočtu a pamäť spojenú s výpočtom FFT o takmer 50%. Taktiež diskrétne sínové a kosínové transformácie sa počítali pomocou knižnice FFTW, ktoré v Matlab verzii museli byť vyvolané z dynamicky načítaných MEX súborov. Po druhé, aby sa znížilo zaťaženie priepustnosti pamäte, boli všetky operácie počítané jednoduchej presnosti pohyblivej rádovej čiarky. Po tretie, elementárne operá- cie boli paralelizované pomocou OpenMP a potom vektorizované pomocou rozšírení SIMD (SSE). Celkový výpočet C++ verzie je až do 34-násobne rýchlejší a využíva menej ako tretinu pamäte ako Matlab verzia simulácie. Simulácia ktorá by trvala takmer dva dni tak môže byť vypočítaná za jeden a pol hodinu. Toto všetko umožňuje počítať simuláciu na výpočetnej mriežke s veľkosťou 16384 × 8192 bodov v primeranom čase.The simulation of ultrasound propagation through soft biological tissue has a wide range of practical applications. These include the design of transducers for diagnostic and therapeutic ultrasound, the development of new signal processing and imaging techniques, studying the aberration of ultrasound beams in heterogeneous media, ultrasonic tissue classification, training ultrasonographers to use ultrasound equipment and interpret ultrasound images, model-based medical image registration, and treatment planning and dosimetry for high-intensity focused ultrasound. However, ultrasound simulation presents a computationally difficult problem, as simulation domains are very large compared with the acoustic wavelengths of interest. But if the problem is axisymmetric, the governing equations can also be solved in 2D. This allows running simulations with larger grid size, with less computational resources and in a shorter time. This paper model and implements an acceleration of the Full-wave Nonlinear Ultrasound Simulation in an Axisymmetric Coordinate System implemented in Matlab using Mex Files for FFTW DST and DCT transformations. The axisymmetric simulation was implemented in C++ as an extension to the open source K-WAVE toolbox. The codes were optimized to run using one node of Salomon supercomputer cluster (IT4Innovations, Ostrava, Czechia) with two twelve-core Intel Xeon E5-2680v3 processors. To maximize computational efficiency, several stages of code optimization were performed. First, the FFTs were computed using the real-to-complex FFT from the FFTW library. Compared to the complex-to-complex FFT, this reduced the compute time and memory associated with the FFT by nearly 50%. Also, real-to-real DCTs and DSTs were computed using FFTW library, which ones in Matlab version, had to be invoked from dynamically loaded MEX Files. Second, to save memory bandwidth, all operations were computed in single precision. Third, element-wise operations were parallelized using OpenMP and then optimized using streaming SIMD extensions (SSE). The overall computation of the C++ k-space model is up to 34-times faster and uses less than one-third of the memory than Matlab version. The simulation which would take nearly two days by Matlab implementation can be now computed in one and half hour. This all allows running the simulation on the computational grid with 16384 × 8192 grid points within a reasonable time.
Efficient Algorithms for Large-Scale Image Analysis
This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications