Search CORE

Integration of curated databases to identify genotype-phenotype associations

Author: Gerstein Mark
Gianoulis Tara A
Goh Chern-Sing
Li Jianrong
Liu Yang
Lussier Yves A
Paccanaro Alberto
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The ability to rapidly characterize an unknown microorganism is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism's phenotype based on the molecules encoded by its genome. However, the link between molecular composition (i.e. genotype) and phenotype for microbes is not obvious. While there have been several studies that address this challenge, none have yet proposed a large-scale method integrating curated biological information. Here we utilize a systematic approach to discover genotype-phenotype associations that combines phenotypic information from a biomedical informatics database, GIDEON, with the molecular information contained in National Center for Biotechnology Information's Clusters of Orthologous Groups database (NCBI COGs). RESULTS: Integrating the information in the two databases, we are able to correlate the presence or absence of a given protein in a microbe with its phenotype as measured by certain morphological characteristics or survival in a particular growth media. With a 0.8 correlation score threshold, 66% of the associations found were confirmed by the literature and at a 0.9 correlation threshold, 86% were positively verified. CONCLUSION: Our results suggest possible phenotypic manifestations for proteins biochemically associated with sugar metabolism and electron transport. Moreover, we believe our approach can be extended to linking pathogenic phenotypes with functionally related proteins

Columbia University Academic Commons

Directory of Open Access Journals

Clustering of Pseudomonas aeruginosa transcriptomes from planktonic cultures, developing and mature biofilms reveals distinct expression profiles

Author: Curtis Michael A
Hurst Jacob M
Littler Eddie
Paccanaro Alberto
Papakonstantinopoulou Anastasia
Saqi Mansoor
Waite Richard D
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Pseudomonas aeruginosa is a genetically complex bacterium which can adopt and switch between a free-living or biofilm lifestyle, a versatility that enables it to thrive in many different environments and contributes to its success as a human pathogen. RESULTS: Transcriptomes derived from growth states relevant to the lifestyle of P. aeruginosa were clustered using three different methods (K-means, K-means spectral and hierarchical clustering). The culture conditions used for this study were; biofilms incubated for 8, 14, 24 and 48 hrs, and planktonic culture (logarithmic and stationary phase). This cluster analysis revealed the existence and provided a clear illustration of distinct expression profiles present in the dataset. Moreover, it gave an insight into which genes are up-regulated in planktonic, developing biofilm and confluent biofilm states. In addition, this analysis confirmed the contribution of quorum sensing (QS) and RpoS regulated genes to the biofilm mode of growth, and enabled the identification of a 60.69 Kbp region of the genome associated with stationary phase growth (stationary phase planktonic culture and confluent biofilms). CONCLUSION: This is the first study to use clustering to separate a large P. aeruginosa microarray dataset consisting of transcriptomes obtained from diverse conditions relevant to its growth, into different expression profiles. These distinct expression profiles not only reveal novel aspects of P. aeruginosa gene expression but also provide a growth specific transcriptomic reference dataset for the research community

Queen Mary Research Online

Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction

Author: A. Paccanaro
A. Petrini
E. Casiraghi
E. Vergani
G. Grossi
G. Valentini
J. Gliozzo
M. Frasca
M. Mesiti
M. Re
P. Perlasca
V. Vallacchi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce a novel network-based approach named Patient-Net (P-Net) in which biomolecular profiles of patients are modeled in a graph-structured space that represents gene expression relationships between patients. Then a kernel-based semi-supervised transductive algorithm is applied to the graph to explore the overall topology of the graph and to predict the phenotype/clinical outcome of patients. Experimental tests involving several publicly available datasets of patients afflicted with pancreatic, breast, colon and colorectal cancer show that our proposed method is competitive with state-of-the-art supervised and semi-supervised predictive systems. Importantly, P-Net also provides interpretable models that can be easily visualized to gain clues about the relationships between patients, and to formulate hypotheses about their stratification

AIR Universita degli studi di Milano

jClust: a clustering and visualization toolbox

Author: Andreopoulos
C. N. Moschopoulos
Enright
G. A. Pavlopoulos
Gavin
Paccanaro
R. Schneider
S. D. Hooper
S. Kossida
Winter
Zhao
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

jClust is a user-friendly application which provides access to a set of widely used clustering and clique finding algorithms. The toolbox allows a range of filtering procedures to be applied and is combined with an advanced implementation of the Medusa interactive visualization module. These implemented algorithms are k-Means, Affinity propagation, Bron–Kerbosch, MULIC, Restricted neighborhood search cluster algorithm, Markov clustering and Spectral clustering, while the supported filtering procedures are haircut, outside–inside, best neighbors and density control operations. The combination of a simple input file format, a set of clustering and filtering algorithms linked together with the visualization tool provides a powerful tool for data analysis and information extraction

Open Repository and Bibliography - Luxembourg

SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale

Author: A Biegert
A Murzin
A Paccanaro
A Ruepp
AJ Enright
AJ Enright
Alberto Paccanaro
AY Ng
B Everitt
D Arthur
D Ballard
EL Hong
G Wang
JJ Forman
JM Chandonia
K Verkhedkar
LJ Jensen
M Ashburner
M Meilă
M Newman
N Kannan
O Krishnadev
P Pipenbacher
P Shannon
Rajkumar Sasidharan
RB Lehoucq
S van Dongen
SF Altschul
SF Altschul
T Fruchterman
Tamás Nepusz
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community. Results SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast <it>Saccharomyces cerevisiae </it>(6,690 sequences). Conclusions Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at <url>http://www.paccanarolab.org/software/scps</url>.</p

Royal Holloway Research Online

Directory of Open Access Journals

Digital Repository @ Iowa State University (ISU)

Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling

Author: A Paccanaro
AP Dempster
AY Ng
C Caragea
C Caragea
C Yan
Cornelia Caragea
Drena Dobbs
H Berman
IS Dhillon
J Allers
J Davis
J Shi
JH Kim
Jivko Sinapov
M Terribilini
MI Jordan
N Qian
P Baldi
R Duda
S Russell
TG Dietterich
TG Diettrich
TM Mitchell
Vasant Honavar
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Identification of functionally important sites in biomolecular sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental determination of such sites lags far behind the number of known biomolecular sequences. Hence, there is a need to develop reliable computational methods for identifying functionally important sites from biomolecular sequences. Results: We present a mixture of experts approach to biomolecular sequence labeling that takes into account the global similarity between biomolecular sequences. Our approach combines unsupervised and supervised learning techniques. Given a set of sequences and a similarity measure defined on pairs of sequences, we learn a mixture of experts model by using spectral clustering to learn the hierarchical structure of the model and by using bayesian techniques to combine the predictions of the experts. We evaluate our approach on two biomolecular sequence labeling problems: RNA-protein and DNA-protein interface prediction problems. The results of our experiments show that global sequence similarity can be exploited to improve the performance of classifiers trained to label biomolecular sequence data. Conclusion: The mixture of experts model helps improve the performance of machine learning methods for identifying functionally important sites in biomolecular sequences.This is a proceeding from IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 10 (2009): S4, doi: 10.1186/1471-2105-10-S4-S4. Posted with permission.</p

The Jackson Laboratory: The Mouseion at the JAXlibrary

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

Author: Blau Hannah
Bramante Carolyn T
Buse John B
Callahan Tiffany J
Casiraghi Elena
Chan Lauren E
Coleman Ben D
Evans Michael D
Hall Margaret
Huling Jared D
Johnson Steven G
Laraway Bryan
Moffitt Richard A
Notaro Marco
Paccanaro Alberto
Raymond Shao Yu
Reese Justin
Robinson Peter N
Stürmer Til
Tronieri Jena S
Valentini Giorgio
Wilkins Kenneth J
Wong Rachel
Publication venue: The Mouseion at the JAXlibrary
Publication date: 01/03/2023
Field of study

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging

Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms

Author: A Paccanaro
AD Gonzalez Perez
AE Kel
AG Perez
AJ Enright
Andreas Tauch
CO Pabo
DJ Galas
I Brune
I Matic
J Baumbach
J Baumbach
J Baumbach
J Baumbach
J Baumbach
J Baumbach
Jan Baumbach
K Brinkrolf
LM Hellman
LV Sun
M Beckstette
M Madan Babu
M Tompa
RL Tatusov
S Balaji
S Balaji
S Rahmann
S Rahmann
SA Teichmann
SF Altschul
Sven Rahmann
T Wittkop
V Espinosa
WB Alkema
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Baumbach J, Rahmann S, Tauch A. Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms. BMC Systems Biology. 2009;3(1):8.Background: Transcriptional regulation of gene activity is essential for any living organism. Transcription factors therefore recognize specific binding sites within the DNA to regulate the expression of particular target genes. The genome-scale reconstruction of the emerging regulatory networks is important for biotechnology and human medicine but cost-intensive, time-consuming, and impossible to perform for any species separately. By using bioinformatics methods one can partially transfer networks from well-studied model organisms to closely related species. However, the prediction quality is limited by the low level of evolutionary conservation of the transcription factor binding sites, even within organisms of the same genus. Results: Here we present an integrated bioinformatics workflow that assures the reliability of transferred gene regulatory networks. Our approach combines three methods that can be applied on a large-scale: re-assessment of annotated binding sites, subsequent binding site prediction, and homology detection. A gene regulatory interaction is considered to be conserved if (1) the transcription factor, (2) the adjusted binding site, and (3) the target gene are conserved. The power of the approach is demonstrated by transferring gene regulations from the model organism Corynebacterium glutamicum to the human pathogens C. diphtheriae, C. jeikeium, and the biotechnologically relevant C. efficiens. For these three organisms we identified reliable transcriptional regulations for similar to 40% of the common transcription factors, compared to similar to 5% for which knowledge was available before. Conclusion: Our results suggest that trustworthy genome-scale transfer of gene regulatory networks between organisms is feasible in general but still limited by the level of evolutionary conservation