6 research outputs found

    HMM-FRAME: accurate protein domain classification for metagenomic sequences containing frameshift errors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein domain classification is an important step in metagenomic annotation. The state-of-the-art method for protein domain classification is profile HMM-based alignment. However, the relatively high rates of insertions and deletions in homopolymer regions of pyrosequencing reads create frameshifts, causing conventional profile HMM alignment tools to generate alignments with marginal scores. This makes error-containing gene fragments unclassifiable with conventional tools. Thus, there is a need for an accurate domain classification tool that can detect and correct sequencing errors.</p> <p>Results</p> <p>We introduce HMM-FRAME, a protein domain classification tool based on an augmented Viterbi algorithm that can incorporate error models from different sequencing platforms. HMM-FRAME corrects sequencing errors and classifies putative gene fragments into domain families. It achieved high error detection sensitivity and specificity in a data set with annotated errors. We applied HMM-FRAME in Targeted Metagenomics and a published metagenomic data set. The results showed that our tool can correct frameshifts in error-containing sequences, generate much longer alignments with significantly smaller E-values, and classify more sequences into their native families.</p> <p>Conclusions</p> <p>HMM-FRAME provides a complementary protein domain classification tool to conventional profile HMM-based methods for data sets containing frameshifts. Its current implementation is best used for small-scale metagenomic data sets. The source code of HMM-FRAME can be downloaded at <url>http://www.cse.msu.edu/~zhangy72/hmmframe/</url> and at <url>https://sourceforge.net/projects/hmm-frame/</url>.</p

    Comparison of Coding DNA

    Get PDF
    We discuss a model for the evolutionary distance between two coding DNA sequences which specializes to the DNA/protein model proposed in Hein [3]. We discuss the DNA/protein model in details and present a quadratic time algorithm that computes an optimal alignment of two coding DNA sequences in the model under the assumption of affine gap cost. The algorithm solves a conjecture in [3] and we believe that the constant factor of the running time is sufficiently small to make the algorithm feasible in practice

    Comparison of Coding DNA

    Get PDF
    We discuss a model for the evolutionary distance between two coding DNA sequences which specializes to the DNA/protein model proposed in Hein [3]. We discuss the DNA/protein model in details and present a quadratic time algorithm that computes an optimal alignment of two coding DNA sequences in the model under the assumption of affine gap cost. The algorithm solves a conjecture in [3] and we believe that the constant factor of the running time is sufficiently small to make the algorithm feasible in practice

    Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

    Get PDF
    BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms

    生醫分析系統之語意整合

    Get PDF
    [[abstract]]這計畫提議建立一個知識系統,允許生物醫學的研究人員透過以自然語言查詢方 式,綜合查詢複雜的生物資訊數據及影像訊息。我們的數據庫的目標是使數據的輸 入更有效率的,更有組織性,容易取回,及使操作和綜合變得容易。此系統以阿茲海 默症作為研究的對象。這一個知識系統與傳統知識系統的基本的區別在於它支援複雜 的數據組織和一個強大的查詢界面。 SemanticObjects 是由美國加州大學Irvine 分校和日本NEC 共同開發的一個物件 相關的平台,目的是為建造一物件知識系統。它允許使用者有效的組織及儲存生物學 模式和數據成階層式的複雜物件。使用者可利用結構性的自然語言來查詢及利用此知 識系統的數據。 最後,我們將迅速地把這個以SemanticObjects 為主的知識系統成為網站應用。這 使其它的研究人員可分享及獲得是項研究的結果。 我們提議的系統由以下的數個模組組成,a) 文字採礦模組,b) microarry/SNP 模 組,c) 基因網路模組,d)影像模組和e)實驗模組。 This proposal suggests building a knowledge system that allows biomedical researchers to synthesize complex bioinformatics information and images data via natural language query. The goal of our database is to facilitate efficient data entry, organization, retrieval, manipulation and integration. The Alzheimer』s Disease was chosen as our study case. A fundamental distinction of the biological database addressed in this research and the others is that it supports both complex data organization and a powerful querying facility. SemanticObjects is an object-relational platform that has been jointly developed by University of California, Irvine and NEC Soft, Japan as a tool for building object knowledge systems. It allows users to efficiently organize and store biological models and data as complex objects that are hierarchically structured. User can query and manipulate the data in Structured Natural Language (SNL). Finally, we will rapidly deploy this SemanticObjects database into a web application. This makes it easy for the research community to share the results obtained from proposed research. Our proposed system consists of: a) a text mining module, b) a microarry/SNP module, c) a gene network module, d) an image module, and e) a web laboratory module

    Algorithms for the search of amino acid patterns in nucleic acid sequences.

    No full text
    Some algorithms are described for the search of regions in a nucleic acid sequence that, when translated into amino acids, are homologous to a given amino acid pattern. All algorithms are modifications of the dynamic programming method for sequence comparison such that the translation of codons is taken into account. One of the algorithms has been implemented as a FORTRAN 77 program. The program operates on files that follow the format of the EMBL Nucleotide Sequence Data Library
    corecore