28 research outputs found

    ProFAT: a web-based tool for the functional annotation of protein sequences

    Get PDF
    BACKGROUND: The functional annotation of proteins relies on published information concerning their close and remote homologues in sequence databases. Evidence for remote sequence similarity can be further strengthened by a similar biological background of the query sequence and identified database sequences. However, few tools exist so far, that provide a means to include functional information in sequence database searches. RESULTS: We present ProFAT, a web-based tool for the functional annotation of protein sequences based on remote sequence similarity. ProFAT combines sensitive sequence database search methods and a fold recognition algorithm with a simple text-mining approach. ProFAT extracts identified hits based on their biological background by keyword-mining of annotations, features and most importantly, literature associated with a sequence entry. A user-provided keyword list enables the user to specifically search for weak, but biologically relevant homologues of an input query. The ProFAT server has been evaluated using the complete set of proteins from three different domain families, including their weak relatives and could correctly identify between 90% and 100% of all domain family members studied in this context. ProFAT has furthermore been applied to a variety of proteins from different cellular contexts and we provide evidence on how ProFAT can help in functional prediction of proteins based on remotely conserved proteins. CONCLUSION: By employing sensitive database search programs as well as exploiting the functional information associated with database sequences, ProFAT can detect remote, but biologically relevant relationships between proteins and will assist researchers in the prediction of protein function based on remote homologies

    SeLOXā€”a locus of recombination site search tool for the detection and directed evolution of site-specific recombination systems

    Get PDF
    Site-specific recombinases have become a resourceful tool for genome engineering, allowing sophisticated in vivo DNA modifications and rearrangements, including the precise removal of integrated retroviruses from host genomes. In a recent study, a mutant form of Cre recombinase has been used to excise the provirus of a specific HIV-1 strain from the human genome. To achieve provirus excision, the Cre recombinase had to be evolved to recombine an asymmetric locus of recombination (lox)-like sequence present in the long terminal repeat (LTR) regions of a HIV-1 strain. One pre-requisite for this type of work is the identification of degenerate lox-like sites in genomic sequences. Given their natureā€”two inverted repeats flanking a spacer of variable lengthā€”existing search tools like BLAST or RepeatMasker perform poorly. To address this lack of available algorithms, we have developed the web-server SeLOX, which can identify degenerate lox-like sites within genomic sequences. SeLOX calculates a position weight matrix based on lox-like sequences, which is used to search genomic sequences. For computational efficiency, we transform sequences into binary space, which allows us to use a bit-wise AND Boolean operator for comparisons. Next to finding lox-like sites for Cre type recombinases in HIV LTR sequences, we have used SeLOX to identify lox-like sites in HIV LTRs for six yeast recombinases. We finally demonstrate the general usefulness of SeLOX in identifying lox-like sequences in large genomes by searching Cre type recombination sites in the entire human genome. SeLOX is freely available at http://selox.mpi-cbg.de/cgi-bin/selox/index

    BioBuilder as a database development and functional annotation platform for proteins

    Get PDF
    BACKGROUND: The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. RESULTS: BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. CONCLUSIONS: As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL

    HMMerThread: Detecting Remote, Functional Conserved Domains in Entire Genomes by Combining Relaxed Sequence-Database Searches with Fold Recognition

    Get PDF
    Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find āˆ¼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de

    Reference Data

    No full text
    <p>Flybase <em>D. melanogaster</em> Release 5.34 mRNA and polyadenylated ncRNA reference as GTF file for use with the bowtie suite of tools.Ā </p

    Prepare Flybase GTF file

    No full text
    <p>Script using cufflinks' gffread to generate a clean GTF file for use with cufflinks and htseqĀ </p

    Preprocessors

    No full text
    <p>Scripts to deal with reference files</p

    morFeus: a web-based program to detect remotely conserved orthologs using symmetrical best hits and orthology network scoring

    Get PDF
    Background: Searching the orthologs of a given protein or DNA sequence is one of the most important and most commonly used Bioinformatics methods in Biology. Programs like BLAST or the orthology search engine Inparanoid can be used to find orthologs when the similarity between two sequences is sufficiently high. They however fail when the level of conservation is low. The detection of remotely conserved proteins oftentimes involves sophisticated manual intervention that is difficult to automate. Results: Here, we introduce morFeus, a search program to find remotely conserved orthologs. Based on relaxed sequence similarity searches, morFeus selects sequences based on the similarity of their alignments to the query, tests for orthology by iterative reciprocal BLAST searches and calculates a network score for the resulting network of orthologs that is a measure of orthology independent of the E-value. Detecting remotely conserved orthologs of a protein using morFeus thus requires no manual intervention. We demonstrate the performance of morFeus by comparing it to state-of-the-art orthology resources and methods. We provide an example of remotely conserved orthologs, which were experimentally shown to be functionally equivalent in the respective organisms and therefore meet the criteria of the orthology-function conjecture. Conclusions: Based on our results, we conclude that morFeus is a powerful and specific search method for detecting remotely conserved orthologs

    Rapid Validation of Protein Identifications with the Borderline Statistical Confidence via De Novo Sequencing and MS BLAST Searches

    No full text
    Protein identifications with the borderline statistical confidence are typically produced by matching a few marginal quality MS/MS spectra to database peptide sequences and represent a significant bottleneck in the reliable and reproducible characterization of proteomes. Here, we present a method for rapid validation of borderline hits that circumvents the need in, often biased, manual inspection of raw MS/MS spectra. The approach takes advantage of the independent interpretation of corresponding MS/MS spectra by PepNovo de novo sequencing software followed by mass spectrometry-driven BLAST (MS BLAST) sequence-similarity database searches that utilize all partially inaccurate, degenerate and redundant candidate peptide sequences. In a case study involving the identification of more than 180 Caenorhabditis elegans proteins by nanoLC-MS/MS analysis on a linear ion trap LTQ mass spectrometer, the approach enabled rapid assignment (confirmation or rejection) of more than 70 % of Mascot hits of borderline statistical confidence. Keywords: de novo sequencing ā€¢ database searching ā€¢ borderline hits ā€¢ MS/MS ā€¢ MS BLAST ā€¢ PepNovo Nanoflow liquid chromatography-tandem mass spectrometry (nanoLC-MS/MS) is employed in a variety of bottom-up proteomics projects (reviewed in refs 1-4). Individual protein
    corecore