2,143 research outputs found

    Parallel Computation of Nonrigid Image Registration

    Get PDF
    Automatic intensity-based nonrigid image registration brings significant impact in medical applications such as multimodality fusion of images, serial comparison for monitoring disease progression or regression, and minimally invasive image-guided interventions. However, due to memory and compute intensive nature of the operations, intensity-based image registration has remained too slow to be practical for clinical adoption, with its use limited primarily to as a pre-operative too. Efficient registration methods can lead to new possibilities for development of improved and interactive intraoperative tools and capabilities. In this thesis, we propose an efficient parallel implementation for intensity-based three-dimensional nonrigid image registration on a commodity graphics processing unit. Optimization techniques are developed to accelerate the compute-intensive mutual information computation. The study is performed on the hierarchical volume subdivision-based algorithm, which is inherently faster than other nonrigid registration algorithms and structurally well-suited for data-parallel computation platforms. The proposed implementation achieves more than 50-fold runtime improvement over a standard implementation on a CPU. The execution time of nonrigid image registration is reduced from hours to minutes while retaining the same level of registration accuracy

    A highly parameterized and efficient FPGA-based skeleton for pairwise biological sequence alignment

    Get PDF

    Design and Evaluation of a BLAST Ungapped Extension Accelerator, Master\u27s Thesis

    Get PDF
    The amount of biosequence data being produced each year is growing exponentially. Extracting useful information from this massive amount of data is becoming an increasingly difficult task. This thesis focuses on accelerating the most widely-used software tool for analyzing genomic data, BLAST. This thesis presents Mercury BLAST, a novel method for accelerating searches through massive DNA databases. Mercury BLAST takes a streaming approach to the BLAST computation by offloading the performance-critical sections onto reconfigurable hardware. This hardware is then used in combination with the processor of the host system to deliver BLAST results in a fraction of the time of the general-purpose processor alone. Mercury BLAST makes use of new algorithms combined with reconfigurable hardware to accelerate BLAST-like similarity search. An evaluation of this method for use in real BLAST-like searches is presented along with a characterization of the quality of results associated with using these new algorithms in specialized hardware. The primary focus of this thesis is the design of the ungapped extension stage of Mercury BLAST. The architecture of the ungapped extension stage is described along with the context of this stage within the Mercury BLAST system. The design is compact and performs over 20× faster than that of the standard software ungapped extension, yielding close to 50× speedup over the complete software BLAST application. The quality of Mercury BLAST results is essentially equivalent to the standard BLAST results

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Architecture for Document Clustering in Reconfigurable Hardware, Master\u27s Thesis, December 2006

    Get PDF
    High-performance document clustering systems enable similar documents to automatically self-organize into groups. In the past, the large amount of computational time needed to cluster documents prevented practical use of such systems with a large number of documents. A full hardware implementation of K-means clustering has been designed and implemented in reconfigurable hardware that rapidly clusters a half million documents. Documents and concepts are represented as vectors with 4000 dimensions. The circuit was implemented in Field Programmable Gate Array (FPGA) logic and uses four parallel cosine distance metrics to cluster document vectors together. An exploration of the effect of the integer approximation of the cosine theta distance metric was investigated. Through experiments, measurements were performed to determine the effect of utilizing different numeric representations for the concept vectors. As compared to a full K-means implementation in software, it was found that using carefully chosen integer representations yielded clustering results that were nearly identical to results obtained using full foating-point representations. Hardware was synthesized and run on the Field Programmable Port Extender (FPX) platform. This implementation on the Virtex-E 2000 FPGA ran 26 times faster than an algorithmically equivalent software running on an Intel 3.60 GHz Xeon. The same architecture was scaled to implement a faster and larger design for the Xilinx-4 LX200. This larger implementation outperformed the equivalent software version on the same Xeon by a factor of 328
    • 

    corecore