2,769 research outputs found

    GPU acceleration for statistical gene classification

    Get PDF
    The use of Bioinformatic tools in routine clinical diagnostics is still facing a number of issues. The more complex and advanced bioinformatic tools become, the more performance is required by the computing platforms. Unfortunately, the cost of parallel computing platforms is usually prohibitive for both public and small private medical practices. This paper presents a successful experience in using the parallel processing capabilities of Graphical Processing Units (GPU) to speed up bioinformatic tasks such as statistical classification of gene expression profiles. The results show that using open source CUDA programming libraries allows to obtain a significant increase in performances and therefore to shorten the gap between advanced bioinformatic tools and real medical practic

    Genomic co-processor for long read assembly

    Get PDF
    Genomics data is transforming medicine and our understanding of life in fundamental ways; however, it is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. In order to enable the vast potential of exponentially growing genomics data, domain specific acceleration provides one of the few remaining approaches to continue to scale compute performance and efficiency, since general-purpose architectures are struggling to handle the huge amount of data needed for genome alignment. The aim of this project is to implement a genomic-coprocessor targeting HPC FPGAs starting from the Darwin FPGA co-processor. In this scenario, the final objective is the simulation and implementation of the algorithms described by Darwin using Alveo boards, exploiting High Bandwidth Memory (HBM) to increase its performance

    Registration using Graphics Processor Unit

    Get PDF
    Data point set registration is an important operation in coordinate metrology. Registration is the operation by which sampled point clouds are aligned with a CAD model by a 4X4 homogeneous transformation (e.g., rotation and translation). This alignment permits validation of the produced artifact\u27s geometry. State-of-the-art metrology systems are now capable of generating thousands, if not millions, of data points during an inspection operation, resulting in increased computational power to fully utilize these larger data sets. The registration process is an iterative nonlinear optimization operation having an execution time directly related to the number of points processed and CAD model complexity. The objective function to be minimized by this optimization is the sum of the square distances between each point in the point cloud and the closest surface in the CAD model. A brute force approach to registration, which is often used, is to compute the minimum distance between each point and each surface in the CAD model. As point cloud sizes and CAD model complexity increase, this approach becomes intractable and inefficient. Highly efficient numerical and analytical gradient based algorithms exist and their goal is to convergence to an optimal solution in minimum time. This thesis presents a new approach to efficiently perform the registration process by employing readily available computer hardware, the graphical processor unit (GPU). The data point set registration time for the GPU shows a significant improvement (around 15-20 times) over typical CPU performance. Efficient GPU programming decreases the complexity of the steps and improves the rate of convergence of the existing algorithms. The experimental setup reveals the exponential increasing nature of the CPU and the linear performance of the GPU in various aspects of an algorithm. The importance of CPU in the GPU programming is highlighted. The future implementations disclose the possible extensions of a GPU for higher order and complex coordinate metrology algorithms

    CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

    Get PDF
    Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches

    A performance focused, development friendly and model aided parallelization strategy for scientific applications

    Get PDF
    The amelioration of high performance computing platforms has provided unprecedented computing power with the evolution of multi-core CPUs, massively parallel architectures such as General Purpose Graphics Processing Units (GPGPUs) and Many Integrated Core (MIC) architectures such as Intel\u27s Xeon phi coprocessor. However, it is a great challenge to leverage capabilities of such advanced supercomputing hardware, as it requires efficient and effective parallelization of scientific applications. This task is difficult mainly due to complexity of scientific algorithms coupled with the variety of available hardware and disparate programming models. To address the aforementioned challenges, this thesis presents a parallelization strategy to accelerate scientific applications that maximizes the opportunities of achieving speedup while minimizing the development efforts. Parallelization is a three step process (1) choose a compatible combination of architecture and parallel programming language, (2) translate base code/algorithm to a parallel language and (3) optimize and tune the application. In this research, a quantitative comparison of run time for various implementations of k-means algorithm, is used to establish that native languages (OpenMP, MPI, CUDA) perform better on respective architectures as opposed to vendor-neutral languages such as OpenCL. A qualitative model is used to select an optimal architecture for a given application by aligning the capabilities of accelerators with characteristics of the application. Once the optimal architecture is chosen, the corresponding native language is employed. This approach provides the best performance with reasonable accuracy (78%) of predicting a fitting combination, while eliminating the need for exploring different architectures individually. It reduces the required development efforts considerably as the application need not be re-written in multiple languages. The focus can be solely on optimization and tuning to achieve the best performance on available architectures with minimized investment in terms of cost and efforts. To verify the prediction accuracy of the qualitative model, the OpenDwarfs benchmark suite, which implements the Berkeley\u27s dwarfs in OpenCL, is used. A dwarf is an algorithmic method that captures a pattern of computation and communication. For the purpose of this research, the focus is on 9 application from various algorithmic domains that cover the seven dwarfs of symbolic computation, which were identified by Phillip Colella, as omnipresent in scientific and engineering applications. To validate the parallelization strategy collectively, a case study is undertaken. This case study involves parallelization of the Lower Upper Decomposition for the Gaussian Elimination algorithm from the linear algebra domain, using conventional trial and error methods as well as the proposed \u27Architecture First, Language Later\u27\u27 strategy. The development efforts incurred are contrasted for both methods. The aforesaid proposed strategy is observed to reduce the development efforts by an average of 50%

    Accelerating Malware Detection via a Graphics Processing Unit

    Get PDF
    Real-time malware analysis requires processing large amounts of data storage to look for suspicious files. This is a time consuming process that (requires a large amount of processing power) often affecting other applications running on a personal computer. This research investigates the viability of using Graphic Processing Units (GPUs), present in many personal computers, to distribute the workload normally processed by the standard Central Processing Unit (CPU). Three experiments are conducted using an industry standard GPU, the NVIDIA GeForce 9500 GT card. The goal of the first experiment is to find the optimal number of threads per block for calculating MD5 file hash. The goal of the second experiment is to find the optimal number of threads per block for searching an MD5 hash database for matches. In the third experiment, the size of the executable, executable type (benign or malicious), and processing hardware are varied in a full factorial experimental design. The experiment records if the file is benign or malicious and measure the time required to identify the executable. This information can be used to analyze the performance of GPU hardware against CPU hardware. Experimental results show that a GPU can calculate a MD5 signature hash and scan a database of malicious signatures 82% faster than a CPU for files between 0 96 kB. If the file size is increased to 97 - 192 kB the GPU is 85% faster than the CPU. This demonstrates that the GPU can provide a greater performance increase over a CPU. These results could help achieve faster anti-malware products, faster network intrusion detection system response times, and faster firewall applications

    A computationally efficient framework for large-scale distributed fingerprint matching

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of requirements for the degree of Master of Science, School of Computer Science and Applied Mathematics. May 2017.Biometric features have been widely implemented to be utilized for forensic and civil applications. Amongst many different kinds of biometric characteristics, the fingerprint is globally accepted and remains the mostly used biometric characteristic by commercial and industrial societies due to its easy acquisition, uniqueness, stability and reliability. There are currently various effective solutions available, however the fingerprint identification is still not considered a fully solved problem mainly due to accuracy and computational time requirements. Although many of the fingerprint recognition systems based on minutiae provide good accuracy, the systems with very large databases require fast and real time comparison of fingerprints, they often either fail to meet the high performance speed requirements or compromise the accuracy. For fingerprint matching that involves databases containing millions of fingerprints, real time identification can only be obtained through the implementation of optimal algorithms that may utilize the given hardware as robustly and efficiently as possible. There are currently no known distributed database and computing framework available that deal with real time solution for fingerprint recognition problem involving databases containing as many as sixty million fingerprints, the size which is close to the size of the South African population. This research proposal intends to serve two main purposes: 1) exploit and scale the best known minutiae matching algorithm for a minimum of sixty million fingerprints; and 2) design a framework for distributed database to deal with large fingerprint databases based on the results obtained in the former item.GR201

    Iris localization using parallel computing

    Get PDF
    In this thesis, we have proposed a parallel iris localization technique by implementing canny edge detection in parallel on Graphical Processing Units(GPU) with the help of Compute Unified Device Architecture(CUDA) plateform. The output of canny edge detector which is binary image transfer from GPU/Device to CPU/Host and it is given to serial circular hough transform as input that locate the iris region from image. In this thesis, we follow the Wilde’s approach of iris recognition in which he used the edge detector, and Circular Hough Transform for detecting iris region from an eye image.We processed canny edge detection part of iris localization on GPU in parallel manner and Hough transform serially on CPU. In edge detection, we processed a number of pixels in parallel that execute on cores of GPU in block and thread manner, that reduces the execution time. The outcome of canny edge detector given to serial hough transform that locate the iris region from image. Then we compare the execution time of our parallel technique with existing serial one. In our case, execution time is reduced by 10 to 12 percent in comparison of serial approach. We use the 96 core NVidia GeForce GT 630 GPU for implementation
    corecore