Search CORE

274 research outputs found

Measuring the accuracy of genome-size multiple alignments

Author: Prakash Amol
Tompa Martin
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

A novel computational approach can assess the accuracy of genomic alignments, and reveals suspicious regions in the 17-vertebrate MULTIZ alignment available on the UCSC Genome Browser

Crossref

Springer - Publisher Connector

PubMed Central

MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes

Author: Neph Shane
Tompa Martin
Publication venue: Oxford University Press
Publication date: 14/07/2006
Field of study

Phylogenetic footprinting is a method for the discovery of regulatory elements in a set of homologous regulatory regions, usually collected from multiple species. It does so by identifying the most conserved motifs in those homologous regions. This note describes web software that has been designed specifically for this purpose in prokaryotic genomes, making use of the phylogenetic relationships among the homologous sequences in order to make more accurate predictions. The software is called MicroFootPrinter and is available at

Crossref

PubMed Central

PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences

Author: Blanchette Mathieu
Sinha Saurabh
Tompa Martin
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: This paper addresses the problem of discovering transcription factor binding sites in heterogeneous sequence data, which includes regulatory sequences of one or more genes, as well as their orthologs in other species. RESULTS: We propose an algorithm that integrates two important aspects of a motif's significance – overrepresentation and cross-species conservation – into one probabilistic score. The algorithm allows the input orthologous sequences to be related by any user-specified phylogenetic tree. It is based on the Expectation-Maximization technique, and scales well with the number of species and the length of input sequences. We evaluate the algorithm on synthetic data, and also present results for data sets from yeast, fly, and human. CONCLUSIONS: The results demonstrate that the new approach improves motif discovery by exploiting multiple species information

Springer - Publisher Connector

PubMed Central

eScholarship@McGill

Meta-analysis of Inter-species Liver Co-expression Networks Elucidates Traits Associated with Common Human Diseases

Author: Narayanan Manikandan
Schadt Eric E.
Tompa Martin
Wang Kai
Zhong Hua
Zhu Jun
Publication venue: Public Library of Science
Publication date: 01/12/2009
Field of study

Co-expression networks are routinely used to study human diseases like obesity and diabetes. Systematic comparison of these networks between species has the potential to elucidate common mechanisms that are conserved between human and rodent species, as well as those that are species-specific characterizing evolutionary plasticity. We developed a semi-parametric meta-analysis approach for combining gene-gene co-expression relationships across expression profile datasets from multiple species. The simulation results showed that the semi-parametric method is robust against noise. When applied to human, mouse, and rat liver co-expression networks, our method out-performed existing methods in identifying gene pairs with coherent biological functions. We identified a network conserved across species that highlighted cell-cell signaling, cell-adhesion and sterol biosynthesis as main biological processes represented in genome-wide association study candidate gene sets for blood lipid levels. We further developed a heterogeneity statistic to test for network differences among multiple datasets, and demonstrated that genes with species-specific interactions tend to be under positive selection throughout evolution. Finally, we identified a human-specific sub-network regulated by RXRG, which has been validated to play a different role in hyperlipidemia and Type 2 diabetes between human and mouse. Taken together, our approach represents a novel step forward in integrating gene co-expression networks from multiple large scale datasets to leverage not only common information but also differences that are dataset-specific

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Analysis of computational approaches for motif discovery

Author: CT Workman
DK Neal
G Pavesi
GEP Box
GZ Hertz
JD Hughes
M Burset
M Tompa
Martin Tompa
Nan Li
P McCullagh
S Sinha
TL Bailey
V Matys
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

Recently, we performed an assessment of 13 popular computational tools for discovery of transcription factor binding sites (M. Tompa, N. Li, et al., "Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites", Nature Biotechnology, Jan. 2005). This paper contains follow-up analysis of the assessment results, and raises and discusses some important issues concerning the state of the art in motif discovery methods: 1. We categorize the objective functions used by existing tools, and design experiments to evaluate whether any of these objective functions is the right one to optimize. 2. We examine various features of the data sets that were used in the assessment, such as sequence length and motif degeneracy, and identify which features make data sets hard for current motif discovery tools. 3. We identify an important feature that has not yet been used by existing tools and propose a new objective function that incorporates this feature

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Computational Pipeline for High- Throughput Discovery of cis-Regulatory Noncoding RNA in Prokaryotes

Author: Gary Stormo
Jeffrey Barrick
Martin Tompa
Ronald Breaker
Shane Neph
Walter L Ruzzo
Zasha Weinberg
Zizhen Yao
Publication venue: Public Library of Science
Publication date: 01/07/2007
Field of study

Noncoding RNAs (ncRNAs) are important functional RNAs that do not code for proteins. We present a highly efficient computational pipeline for discovering cis-regulatory ncRNA motifs de novo. The pipeline differs from previous methods in that it is structure-oriented, does not require a multiple-sequence alignment as input, and is capable of detecting RNA motifs with low sequence conservation. We also integrate RNA motif prediction with RNA homolog search, which improves the quality of the RNA motifs significantly. Here, we report the results of applying this pipeline to Firmicute bacteria. Our top-ranking motifs include most known Firmicute elements found in the RNA family database (Rfam). Comparing our motif models with Rfam's hand-curated motif models, we achieve high accuracy in both membership prediction and base-pair–level secondary structure prediction (at least 75% average sensitivity and specificity on both tasks). Of the ncRNA candidates not in Rfam, we find compelling evidence that some of them are functional, and analyze several potential ribosomal protein leaders in depth

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

CSpritz: accurate prediction of protein disorder segments with annotation for homology, secondary structure and linear motifs

Author: Alberto J. M. Martin
Albrecht
Alessandro Vullo
Ali
Altschul
Baldi
Berman
Cheng
Diella
Dosztanyi
Dunker
Dunker
Dyson
Fuxreiter
Gianluca Pollastri
Gibson
Gould
Hemsley
Ian Walsh
Jones
Linding
Lise
Lobanov
Lobanov
Marsella
McGuffin
Melamud
Mika
Mizianty
Noivirt-Brik
Obradovic
Pollastri
Pollastri
Russell
Schaefer
Schlessinger
Sickmeier
Siltberg-Liberles
Silvio C. E. Tosatto
Sirota
Sollich
Tompa
Tompa
Tompa
Tomàs Di Domenico
Trovato
Uversky
Vanhee
Vullo
Ward
Weiss
Wright
Xue
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

CSpritz is a web server for the prediction of intrinsic protein disorder. It is a combination of previous Spritz with two novel orthogonal systems developed by our group (Punch and ESpritz). Punch is based on sequence and structural templates trained with support vector machines. ESpritz is an efficient single sequence method based on bidirectional recursive neural networks. Spritz was extended to filter predictions based on structural homologues. After extensive testing, predictions are combined by averaging their probabilities. The CSpritz website can elaborate single or multiple predictions for either short or long disorder. The server provides a global output page, for download and simultaneous statistics of all predictions. Links are provided to each individual protein where the amino acid sequence and disorder prediction are displayed along with statistics for the individual protein. As a novel feature, CSpritz provides information about structural homologues as well as secondary structure and short functional linear motifs in each disordered segment. Benchmarking was performed on the very recent CASP9 data, where CSpritz would have ranked consistently well with a Sw measure of 49.27 and AUC of 0.828. The server, together with help and methods pages including examples, are freely available at URL: http://protein.bio.unipd.it/cspritz/

Crossref

PubMed Central

Archivio istituzionale della ricerca - Università di Padova

Algorithms for locating extremely conserved elements in multiple sequence alignments

Author: A Derti
A Siepel
A Visel
B Zhang
D Johnson
G Bejerano
Huei-Hun E Tseng
KM Chung
KY Chen
L Allison
L Wang
M Blanchette
M Garey
Martin Tompa
MH Goldwasser
S Prabhakar
WJ Kent
Y Sakuraba
Z Qin
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background In 2004, Bejerano <it>et al</it>. announced the startling discovery of hundreds of "ultraconserved elements", long genomic sequences perfectly conserved across human, mouse, and rat. Their announcement stimulated a flurry of subsequent research. Results We generalize the notion of ultraconserved element in a natural way from extraordinary human-rodent conservation to extraordinary conservation over an arbitrary set of species. We call these "Extremely Conserved Elements". There is a linear time algorithm to find all such Extremely Conserved Elements in any multiple sequence alignment, provided that the conservation is required to be across all the aligned species. For the general case of conservation across an arbitrary subset of the aligned species, we show that the question of whether there exists an Extremely Conserved Element is <it>NP</it>-complete. We illustrate the linear time algorithm by cataloguing all 177 Extremely Conserved Elements in the currently available 44-vertebrate whole-genome alignment, and point out some of the characteristics of these elements. Conclusions The <it>NP</it>-completeness in the case of conservation across an arbitrary subset of the aligned species implies that it is unlikely an efficient algorithm exists for this general case. Despite this fact, for the interesting case of conservation across all or most of the aligned species, our algorithm is efficient enough to be practical. The 177 Extremely Conserved Elements that we catalog demonstrate many of the characteristics of the original ultraconserved elements of Bejerano <it>et al</it>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A comprehensive assessment of long intrinsic protein disorder from the DisProt database

Author: Alfonso Valencia
Atkins
Bellay
Cilia
Damiano Piovesan
Davey
Dosztányi
Galzitskaya
Habchi
He
Hu
Ishida
Ishida
Jones
Joshi
Kozlowski
Linding
Linding
Marco Necci
Martin
Metallo
Mizianty
Mizianty
Monastyrskyy
Necci
Oldfield
Pancsa
Peng
Peng
Peter Tompa
Piovesan
Prilusky
Schlessinger
Sickmeier
Silvio C E Tosatto
Sormanni
The UniProt Consortium
Tompa
Tompa
Uversky
Uversky
van der Lee
Velankar
Vucetic
Vullo
Walsh
Walsh
Walsh
Wang
Xue
Xue
Xue
Yang
Zhang
Zsuzsanna Dosztányi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Motivation Intrinsic disorder (ID), i.e.The lack of a unique folded conformation at physiological conditions, is a common feature for many proteins, which requires specialized biochemical experiments that are not high-Throughput. Missing X-ray residues from the PDB have been widely used as a proxy for ID when developing computational methods. This may lead to a systematic bias, where predictors deviate from biologically relevant ID. Large benchmarking sets on experimentally validated ID are scarce. Recently, the DisProt database has been renewed and expanded to include manually curated ID annotations for several hundred new proteins. This provides a large benchmark set which has not yet been used for training ID predictors. Results Here, we describe the first systematic benchmarking of ID predictors on the new DisProt dataset. In contrast to previous assessments based on missing X-ray data, this dataset contains mostly long ID regions and a significant amount of fully ID proteins. The benchmarking shows that ID predictors work quite well on the new dataset, especially for long ID segments. However, a large fraction of ID still goes virtually undetected and the ranking of methods is different than for PDB data. In particular, many predictors appear to confound ID and regions outside X-ray structures. This suggests that the ID prediction methods capture different flavors of disorder and can benefit from highly accurate curated examples. © The Author 2017

Crossref

Repository of the Academy's Library

Archivio istituzionale della ricerca - Università di Padova