8 research outputs found

    BLAST+: architecture and applications

    Get PDF
    BACKGROUND: Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. RESULTS: We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. CONCLUSION: The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications

    Regular Expression Synthesis for BLAST Two-Hit Filtering

    Get PDF
    Genomic databases are exhibiting a growth rate that is outpacing Moore\u27s Law, which has made database search algorithms a popular application for use on emerging processor technologies. NCBI BLAST is the standard tool for performing searches against these databases, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this dissertation we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on-chip memory structure and allocation. Our algorithm transforms the database—not the query—into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on-chip and the third off-chip. Our results show that it is possible to achieve speedups of up to 8x based on the relative reduction in I/O of our approach versus that of NCBI BLAST, with a minimal impact on sensitivity. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes

    A deterministic finite automaton for faster protein hit detection in BLAST

    No full text
    BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST-by improving its algorithms and optimizations-is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/

    Design and analysis of an accelerated seed generation stage for BLASTP on the Mercury system - Master\u27s Thesis, August 2006

    Get PDF
    NCBI BLASTP is a popular sequence analysis tool used to study the evolutionary relationship between two protein sequences. Protein databases continue to grow exponentially as entire genomes of organisms are sequenced, making sequence analysis a computationally demanding task. For example, a search of the E. coli. k12 proteome against the GenBank Non-Redundant database takes 36 hours on a standard workstation. In this thesis, we look to address the problem by accelerating protein searching using Field Programmable Gate Arrays. We focus our attention on the BLASTP heuristic, building on work done earlier to accelerate DNA searching on the Mercury platform. We analyze the performance characteristics of the BLASTP algorithm and explore the design space of the seed generation stage in detail. We propose a hardware/software architecture and evaluate the performance of the individual stage, and its effect on the overall BLASTP pipeline running on the Mercury system. The seed generation stage is 13x faster than the software equivalent, and the integrated BLASTP pipeline is predicted to yield a speedup of 50x over NCBI BLASTP. Mercury BLASTP also shows a 2.5x speed improvement over the only other BLASTP-like accelerator for FPGAs while consuming far fewer logic resources

    Efficient homology search for genomic sequence databases

    Get PDF
    Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research

    High performance bioinformatics and computational biology on general-purpose graphics processing units

    Get PDF
    Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology

    Genome evolution in Prochlorococcus and marine Synechococcus

    Get PDF

    Detection and management of redundancy for information retrieval

    Get PDF
    The growth of the web, authoring software, and electronic publishing has led to the emergence of a new type of document collection that is decentralised, amorphous, dynamic, and anarchic. In such collections, redundancy is a significant issue. Documents can spread and propagate across such collections without any control or moderation. Redundancy can interfere with the information retrieval process, leading to decreased user amenity in accessing information from these collections, and thus must be effectively managed. The precise definition of redundancy varies with the application. We restrict ourselves to documents that are co-derivative: those that share a common heritage, and hence contain passages of common text. We explore document fingerprinting, a well-known technique for the detection of co-derivative document pairs. Our new lossless fingerprinting algorithm improves the effectiveness of a range of document fingerprinting approaches. We empirically show that our algorithm can be highly effective at discovering co-derivative document pairs in large collections. We study the occurrence and management of redundancy in a range of application domains. On the web, we find that document fingerprinting is able to identify widespread redundancy, and that this redundancy has a significant detrimental effect on the quality of search results. Based on user studies, we suggest that redundancy is most appropriately managed as a postprocessing step on the ranked list and explain how and why this should be done. In the genomic area of sequence homology search, we explain why the existing techniques for redundancy discovery are increasingly inefficient, and present a critique of the current approaches to redundancy management. We show how document fingerprinting with a modified version of our algorithm provides significant efficiency improvements, and propose a new approach to redundancy management based on wildcards. We demonstrate that our scheme provides the benefits of existing techniques but does not have their deficiencies. Redundancy in distributed information retrieval systems - where different parts of the collection are searched by autonomous servers - cannot be effectively managed using traditional fingerprinting techniques. We thus propose a new data structure, the grainy hash vector, for redundancy detection and management in this environment. We show in preliminary tests that the grainy hash vector is able to accurately detect a good proportion of redundant document pairs while maintaining low resource usage
    corecore