20 research outputs found

    FASTA Herder: a web application to trim protein sequence sets

    No full text
    The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near full-length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca

    Peer2ref: a peer-reviewer finding web tool that uses author disambiguation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Reviewer and editor selection for peer review is getting harder for authors and publishers due to the specialization onto narrower areas of research carried by the progressive growth of the body of knowledge. Examination of the literature facilitates finding appropriate reviewers but is time consuming and complicated by author name ambiguities.</p> <p>Results</p> <p>We have developed a method called peer2ref to support authors and editors in selecting suitable reviewers for scientific manuscripts. Peer2ref works from a text input, usually the abstract of the manuscript, from which important concepts are extracted as keywords using a fuzzy binary relations approach. The keywords are searched on indexed profiles of words constructed from the bibliography attributed to authors in MEDLINE. The names of these scientists have been previously disambiguated by coauthors identified across the whole MEDLINE. The methods have been implemented in a web server that automatically suggests experts for peer-review among scientists that have authored manuscripts published during the last decade in more than 3,800 journals indexed in MEDLINE.</p> <p>Conclusion</p> <p>peer2ref web server is publicly available at <url>http://www.ogic.ca/projects/peer2ref/</url>.</p

    Detection of alpha-solenoid proteins using a neural network.

    No full text
    <p>(A) A repeat is made of two helices (H1 and H2) separated by a linker sequence (L). Two detection windows of 19 amino acids are considered, one for each helix. During detection, different window shifts are tested by sliding the input windows H1 or H2 one residue apart from the middle-residue (red box), as indicated by the gaps between red and green boxes. (B) Precision-recall curves comparing the performance of ARD2 in identifying alpha-solenoids in our PDB set using different sets of parameters. A protein was identified as containing an alpha-solenoid if it had 3 or more hits above a given score threshold spaced between 30 and 135 amino acids of each other. This restricts the hits to an expected periodic range within 30 to 40 amino acids. The blue discontinuous and continuous curves show performance for ARD and ARD2 training sets, respectively, without using window shifts. Discontinuous and continuous red curves show performance for ARD and ARD2 training sets, respectively, for a window shift of 1. Different points across each curve correspond to score thresholds from 0.80 to 0.90, with a 0.01 step. The best recall for a 100% precision is obtained when using the window shift and a score threshold of 0.87 (precision: 1.00, recall: 0.28). The ARD2 training set produced generally better results than the ARD training set, and resulted in the best value of precision × recall for a threshold score of 0.86 (precision  = 0.93, recall  = 0.32). (C) Comparison of structures recalled from the positive set (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894.s001" target="_blank">Table S1</a>) by the Armadillo profile from InterPro and ARD2. Proteins detected outside of the positive set circle (Green) are consequently false positives (See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894.s002" target="_blank">Table S2</a> for a detailed list of the proteins detected).</p

    Examples of detected alpha-solenoid structures.

    No full text
    <p>Each repeat consists of two alpha-helices, depicted here in red and green. (A) HEAT repeats buried in the core of the PI3KC catalytic subunit p110alpha (cyan), in complex with p85alpha (orange) (PDB ID 3HHM <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Mandelker1" target="_blank">[22]</a>). (B) Alpha-solenoid binding RNA in exportin5 (PDB ID 3A6P <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Okada1" target="_blank">[23]</a>). (C) Lipid-binding protein. Isoprenoid lipid directly binding the HEAT repeats is colored in magenta, zinc atom in blue (PDB ID 3DRA <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Hast1" target="_blank">[25]</a>). (D) TPR repeats protein, virulence regulator from <i>Bacillus thuringiensis</i> (PDB ID 2QFC <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Declerck1" target="_blank">[26]</a>). (E) Ankyrin repeats protein Q5ZSV0 from <i>Legionella pneumophila</i> (PDB ID 2AJA <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Rose1" target="_blank">[57]</a>). (F) Irregular alpha-solenoid, glutamyl-tRNA synthetase from <i>Thermotoga maritima</i> (PDB ID 3AL0 <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Takai1" target="_blank">[61]</a>).</p

    Percentage of alpha-solenoids versus number of genes.

    No full text
    <p>Two-dimensional box plot of percentage of alpha-solenoids against genome size averaged for several representative species with completely sequenced genomes from four bacterial groups: Cyanobacteria, Planctomycetes, Firmicutes and Chlamydiae. Each box shows the distribution of one of these four groups and summarizes two distributions: the percentage of alpha-solenoids associated to the genome of species of that group in the vertical direction, and the size of the genomes of the species of that group in horizontal direction. In each direction, the box is limited by first and third quartile of the distributions. The middle line (horizontal or vertical) inside of the boxes indicates the median value.</p

    Alignment of rotatin homologs.

    No full text
    <p>A multiple sequence alignment of human rotatin and homologs in other species was produced and represented using BiasViz <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0079894#pone.0079894-Huska1" target="_blank">[62]</a>. Top lane: Jpred3 2D prediction for human rotatin (red: gaps, green: alpha-helix, blue: beta-strand). Bottom part: multiple sequence alignment (red: gaps, black to white: score of ARD2 prediction from 0 to 1). Most of the secondary structure prediction is alpha-helical. Clusters of periodic alpha-solenoid hits can be seen at the positions indicated by the blue bars. Other scattered hits are distributed through the entire alignment.</p

    Functions of proteins with alpha-solenoids.

    No full text
    <p>Each protein is displayed with its PDB ID and the type of interaction its repeats are involved in. Though most of structures dock to proteins, we here point out the involvement of alpha-solenoids in protein-protein (P/P), protein-lipid (P/L) and protein-nucleic acid (P/N), either DNA or RNA. The diversity of function is broader than previously known.</p
    corecore