44 research outputs found

    Standardized Construction of HPC Clusters for Academic Usage

    Get PDF
    A model for the standardization of design and implementation of HPC clusters to be used in universities is presented. Standardization is achieved by using an open-source operating system, network infrastructure, and software packages. The cluster is configured for universities intending to implement an HPC cluster for research or teaching use. No prior understanding of clusters is assumed but a basic understanding of programming, networking and computers in general is required

    Application of HPC in eddy current electromagnetic problem solution

    Get PDF
    As engineering problems are becoming more and more advanced, the size of an average model solved by partial differential equations is rapidly growing and, in order to keep simulation times within reasonable bounds, both faster computers and more efficient software implementations are needed. In the first part of this thesis, the full potential of simulation software has been exploited through high performance parallel computing techniques. In particular, the simulation of induction heating processes is accomplished within reasonable solution times, by implementing different parallel direct solvers for large sparse linear system, in the solution process of a commercial software. The performance of such library on shared memory systems has been remarkably improved by implementing a multithreaded version of MUMPS (MUltifrontal Massively Parallel Solver) library, which have been tested on benchmark matrices arising from typical induction heating process simulations. A new multithreading approach and a low rank approximation technique have been implemented and developed by MUMPS team in Lyon and Toulouse. In the context of a collaboration between MUMPS team and DII-University of Padova, a preliminary version of such functionalities could be tested on induction heating benchmark problems, and a substantial reduction of the computational cost and memory requirements could be achieved. In the second part of this thesis, some examples of design methodology by virtual prototyping have been described. Complex multiphysics simulations involving electromagnetic, circuital, thermal and mechanical problems have been performed by exploiting parallel solvers, as developed in the first part of this thesis. Finally, multiobjective stochastic optimization algorithms have been applied to multiphysics 3D model simulations in search of a set of improved induction heating device configurations

    A Multi-Core Numerical Framework for Characterizing Flow in Oil Reservoirs

    Get PDF
    Presented at the SCS Spring Simulation Multi-Conference – SpringSim 2011, April 4-7, 2011 – Boston, USA Awarded Best Paper in the 19th High Performance Computing Symposium and Best Overall Paper at SpringSim 2011.This paper presents a numerical framework that enables scalable, parallel execution of engineering simulations on multi-core, shared memory architectures. Distribution of the simulations is done by selective hash-tabling of the model domain which spatially decomposes it into a number of orthogonal computational tasks. These tasks, the size of which is critical to optimal cache blocking and consequently performance, are then distributed for execution to multiple threads using the previously presented task management algorithm, H-Dispatch. Two numerical methods, smoothed particle hydrodynamics (SPH) and the lattice Boltzmann method (LBM), are discussed in the present work, although the framework is general enough to be used with any explicit time integration scheme. The implementation of both SPH and the LBM within the parallel framework is outlined, and the performance of each is presented in terms of speed-up and efficiency. On the 24-core server used in this research, near linear scalability was achieved for both numerical methods with utilization efficiencies up to 95%. To close, the framework is employed to simulate fluid flow in a porous rock specimen, which is of broad geophysical significance, particularly in enhanced oil recovery

    Heterogeneous multicore systems for signal processing

    Get PDF
    This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included

    FPGA acceleration of sequence analysis tools in bioinformatics

    Full text link
    Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

    Overlapping communication and computation by using a hybrid MPI/SMPSs approach

    Get PDF
    A previous version of this document was submitted for publication by october 2008.Communication overhead is one of the dominant factors that affect performance in high-performance computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more effort to write parallel code and making less readable code. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar), a task-based shared-memory programming model, enhanced with a restart mechanism allowing the programmer to introduce the asynchronism that is necessary to enable the effective communication/computation overlap in a productive way. We demonstrate the hybrid use of MPI/SMPSs with the high-performance LINPACK benchmark, which uses the lookahead technique to overlap communication and computation. MPI/SMPSs improves the performance of a pure MPI with look-ahead by 7,6% on a 1024 processors machine. In addition to better performance, hybrid MPI/SMPSs substantially reduces code complexity, it is less sensitive to network bandwidth and operating system noise, and improves the use of main memory.Postprint (published version

    SIMD based multicore processor for image and video processing

    Get PDF
    制度:新 ; 報告番号:甲3602号 ; 学位の種類:博士(工学) ; 授与年月日:2012/3/15 ; 早大学位記番号:新595

    Processamento de imagens médicas usando GPU

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaA aplicação CapView utiliza um algoritmo de classificação baseado em SVM (Support Vector Machines) para automatizar a segmentação topográfica de vídeos do trato intestinal obtidos por cápsula endoscópica. Este trabalho explora a aplicação de processadores gráficos (GPU) para execução paralela desse algoritmo. Após uma etapa de otimização da versão sequencial, comparou-se o desempenho obtido por duas abordagens: (1) desenvolvimento apenas do código do lado do host, com suporte em bibliotecas especializadas para a GPU, e (2) desenvolvimento de todo o código, incluindo o que é executado no GPU. Ambas permitiram ganhos (speedups) significativos, entre 1,4 e 7 em testes efetuados com GPUs individuais de vários modelos. Usando um cluster de 4 GPU do modelo de maior capacidade, conseguiu-se, em todos os casos testados, ganhos entre 26,2 e 27,2 em relação à versão sequencial otimizada. Os métodos desenvolvidos foram integrados na aplicação CapView, utilizada em rotina em ambientes hospitalares.The CapView application uses a classification algorithm based on SVMs (Support Vector Machines) for automatic topographic segmentation of gastrointestinal tract videos obtained through capsule endoscopy. This work explores the use graphic processors (GPUs) to parallelize the segmentation algorithm. After an optimization phase of the sequential version, two new approaches were analyzed: (1) development of the host code only, with support of specialized libraries for the GPU, and (2) development of the host and the device’s code. The two approaches caused substantial gains, with speedups between 1.4 and 7 times in tests made with several different individual GPUs. In a cluster of 4 GPUs of the most capable model, speedups between 26.2 and 27.2 times were achieved, compared to the optimized sequential version. The methods developed were integrated in the CapView application, used in routine in medical environments