Search CORE

2 research outputs found

Predicting genetic interactions in Caenorhabditis elegans using machine learning

Author: Missiuro Patrycja Vasilyev, 1976-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2010
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 191-204).The presented work develops a set of machine learning and other computational techniques to investigate and predict gene properties across a variety of biological datasets. In particular, our main goal is the discovery of genetic interactions based on sparse and incomplete information. In our development, we use gene data from two model organisms, Caenorhabditis elegans and Saccharomyces cerevisiae. Our first method, information flow, uses circuit theory to evaluate the importance of a protein in an interactome. We find that proteins with high i-flow scores mediate information exchange between functional modules. We also show that increasing information flow scores strongly correlate with the likelihood of observing lethality or pleiotropy as well as observing genetic interactions. Our metric significantly outperforms other established network metrics such as degree or betweenness. Next, we show how Bayesian sets can be applied to gain intuition as to which datasets are the most relevant for predicting genetic interactions. In order to directly apply this method to microarray data, we extend Bayesian sets to handle continuous variables. Using Bayesian sets, we show that genetically interacting genes tend to share phenotypes but are not necessarily co-localized. Additionally, they have similar development and aging temporal expression profiles. One of the major difficulties in dealing with biological data is the problem of incomplete datasets. We describe a novel application of collaborative filtering (CF) in order to predict missing values in the biological datasets.(cont.) We adapt the factorization-based and the neighborhood-aware CF [13] to deal with a mixture of continuous and discrete entries. We use collaborative filtering to input missing values, assess how much information relevant to genetic interactions is present, and, finally, to predict genetic interactions. We also show how CF can reduce input dimensionality. Our last development is the application of Support Vector Machines (SVM), an adapted machine learning classification method, to predicting genetic interactions. We find that SVM with nonlinear radial basis function (RBF) kernel has greater predictive power over CF. Its performance, however, greatly benefits from using CF to fill in missing entries in the input data. We show that SVM performance further improves if we constrain the group of genes to a specific functional category. Throughout this thesis, we emphasize the features of the studied datasets and explain our findings from a biological perspective. In this respect, we hope that this work possesses an independent biological significance. The final step would be to confirm our predictions experimentally. This would allow us to gain new insights into C. elegans biology: specific genes orchestrating developmental and regulatory pathways, response to stress, etc.by Patrycja Vasilyev Missiuro.Ph.D

DSpace@MIT

Information Flow Analysis of Interactome Networks

Author: A Warner
AC Gavin
AHY Tong
AL Barabasi
AM Dudley
B Lehner
Brian C. Ross
C Stark
C von Mering
D Dupuy
DG Moerman
Donna Slonim
DS Fay
DS Goldberg
EW Dijkstra
F Simmer
G Zhao
GF Berriz
Guoyan Zhao
H Ge
H Jeong
H Yu
H Yu
HR Horvitz
Hui Ge
J Ahringer
J Ptacek
JD Han
JF Rual
JF Rual
JL Hartman IV
JS Bader
Jun S. Liu
KC Gunsalus
Kesheng Liu
L Giot
L Zou
LC Freeman
Lihua Zou
LR Baugh
M Girvan
MEJ Newman
MP Joy
MW Hahn
N Simonis
NJ Krogan
NN Batada
Patrycja Vasilyev Missiuro
PG Doyle
PR Towers
RS Kamath
S Li
S Suthram
SJ McKay
SK Kim
T Beissbarth
U Stelzl
Y Qi
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/09/2008
Field of study

Recent studies of cellular networks have revealed modular organizations of genes and proteins. For example, in interactome networks, a module refers to a group of interacting proteins that form molecular complexes and/or biochemical pathways and together mediate a biological process. However, it is still poorly understood how biological information is transmitted between different modules. We have developed information flow analysis, a new computational approach that identifies proteins central to the transmission of biological information throughout the network. In the information flow analysis, we represent an interactome network as an electrical circuit, where interactions are modeled as resistors and proteins as interconnecting junctions. Construing the propagation of biological signals as flow of electrical current, our method calculates an information flow score for every protein. Unlike previous metrics of network centrality such as degree or betweenness that only consider topological features, our approach incorporates confidence scores of protein–protein interactions and automatically considers all possible paths in a network when evaluating the importance of each protein. We apply our method to the interactome networks of Saccharomyces cerevisiae and Caenorhabditis elegans. We find that the likelihood of observing lethality and pleiotropy when a protein is eliminated is positively correlated with the protein's information flow score. Even among proteins of low degree or low betweenness, high information scores serve as a strong predictor of loss-of-function lethality or pleiotropy. The correlation between information flow scores and phenotypes supports our hypothesis that the proteins of high information flow reside in central positions in interactome networks. We also show that the ranks of information flow scores are more consistent than that of betweenness when a large amount of noisy data is added to an interactome. Finally, we combine gene expression data with interaction data in C. elegans and construct an interactome network for muscle-specific genes. We find that genes that rank high in terms of information flow in the muscle interactome network but not in the entire network tend to play important roles in muscle function. This framework for studying tissue-specific networks by the information flow model can be applied to other tissues and other organisms as well

Public Library of Science (PLOS)

CiteSeerX

DSpace@MIT

Crossref

Harvard University - DASH

Directory of Open Access Journals

PubMed Central