57 research outputs found

    Exploring hierarchical and overlapping modular structure in the yeast protein interaction network

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Developing effective strategies to reveal modular structures in protein interaction networks is crucial for better understanding of molecular mechanisms of underlying biological processes. In this paper, we propose a new density-based algorithm (ADHOC) for clustering vertices of a protein interaction network using a novel subgraph density measurement.</p> <p>Results</p> <p>By statistically evaluating several independent criteria, we found that ADHOC could significantly improve the outcome as compared with five previously reported density-dependent methods. We further applied ADHOC to investigate the hierarchical and overlapping modular structure in the yeast PPI network. Our method could effectively detect both protein modules and the overlaps between them, and thus greatly promote the precise prediction of protein functions. Moreover, by further assaying the intermodule layer of the yeast PPI network, we classified hubs into two types, module hubs and inter-module hubs. Each type presents distinct characteristics both in network topology and biological functions, which could conduce to the better understanding of relationship between network architecture and biological implications.</p> <p>Conclusions</p> <p>Our proposed algorithm based on the novel subgraph density measurement makes it possible to more precisely detect hierarchical and overlapping modular structures in protein interaction networks. In addition, our method also shows a strong robustness against the noise in network, which is quite critical for analyzing such a high noise network.</p

    Identifying protein complexes and disease genes from biomolecular networks

    Get PDF
    With advances in high-throughput measurement techniques, large-scale biological data, such as protein-protein interaction (PPI) data, gene expression data, gene-disease association data, cellular pathway data, and so on, have been and will continue to be produced. Those data contain insightful information for understanding the mechanisms of biological systems and have been proved useful for developing new methods in disease diagnosis, disease treatment and drug design. This study focuses on two main research topics: (1) identifying protein complexes and (2) identifying disease genes from biomolecular networks. Firstly, protein complexes are groups of proteins that interact with each other at the same time and place within living cells. They are molecular entities that carry out cellular processes. The identification of protein complexes plays a primary role for understanding the organization of proteins and the mechanisms of biological systems. Many previous algorithms are designed based on the assumption that protein complexes are densely connected sub-graphs in PPI networks. In this research, a dense sub-graph detection algorithm is first developed following this assumption by using clique seeds and graph entropy. Although the proposed algorithm generates a large number of reasonable predictions and its f-score is better than many previous algorithms, it still cannot identify many known protein complexes. After that, we analyze characteristics of known yeast protein complexes and find that not all of the complexes exhibit dense structures in PPI networks. Many of them have a star-like structure, which is a very special case of the core-attachment structure and it cannot be identified by many previous core-attachment-structure-based algorithms. To increase the prediction accuracy of protein complex identification, a multiple-topological-structure-based algorithm is proposed to identify protein complexes from PPI networks. Four single-topological-structure-based algorithms are first employed to detect raw predictions with clique, dense, core-attachment and star-like structures, respectively. A merging and trimming step is then adopted to generate final predictions based on topological information or GO annotations of predictions. A comprehensive review about the identification of protein complexes from static PPI networks to dynamic PPI networks is also given in this study. Secondly, genetic diseases often involve the dysfunction of multiple genes. Various types of evidence have shown that similar disease genes tend to lie close to one another in various biomolecular networks. The identification of disease genes via multiple data integration is indispensable towards the understanding of the genetic mechanisms of many genetic diseases. However, the number of known disease genes related to similar genetic diseases is often small. It is not easy to capture the intricate gene-disease associations from such a small number of known samples. Moreover, different kinds of biological data are heterogeneous and no widely acceptable criterion is available to standardize them to the same scale. In this study, a flexible and reliable multiple data integration algorithm is first proposed to identify disease genes based on the theory of Markov random fields (MRF) and the method of Bayesian analysis. A novel global-characteristic-based parameter estimation method and an improved Gibbs sampling strategy are introduced, such that the proposed algorithm has the capability to tune parameters of different data sources automatically. However, the Markovianity characteristic of the proposed algorithm means it only considers information of direct neighbors to formulate the relationship among genes, ignoring the contribution of indirect neighbors in biomolecular networks. To overcome this drawback, a kernel-based MRF algorithm is further proposed to take advantage of the global characteristics of biological data via graph kernels. The kernel-based MRF algorithm generates predictions better than many previous disease gene identification algorithms in terms of the area under the receiver operating characteristic curve (AUC score). However, it is very time-consuming, since the Gibbs sampling process of the algorithm has to maintain a long Markov chain for every single gene. Finally, to reduce the computational time of the MRF-based algorithm, a fast and high performance logistic-regression-based algorithm is developed for identifying disease genes from biomolecular networks. Numerical experiments show that the proposed algorithm outperforms many existing methods in terms of the AUC score and running time. To summarize, this study has developed several computational algorithms for identifying protein complexes and disease genes from biomolecular networks, respectively. These proposed algorithms are better than many other existing algorithms in the literature

    Beyond hairballs: depicting complexity of a kinase-phosphatase network in the budding yeast

    Full text link
    Les kinases et les phosphatases (KP) reprĂ©sentent la plus grande famille des enzymes dans la cellule. Elles rĂ©gulent les unes les autres ainsi que 60 % du protĂ©ome, formant des rĂ©seaux complexes kinase-phosphatase (KP-Net) jouant un rĂŽle essentiel dans la signalisation cellulaire. Ces rĂ©seaux caractĂ©risĂ©s d’une organisation de type commandes-exĂ©cutions possĂšdent gĂ©nĂ©ralement une structure hiĂ©rarchique. MalgrĂ© les nombreuse Ă©tudes effectuĂ©es sur le rĂ©seau KP-Net chez la levure, la structure hiĂ©rarchique ainsi que les principes fonctionnels sont toujours peux connu pour ce rĂ©seau. Dans ce contexte, le but de cette thĂšse consistait Ă  effectuer une analyse d’intĂ©gration des donnĂ©es provenant de diffĂ©rentes sources avec la structure hiĂ©rarchique d’un rĂ©seau KP-Net de haute qualitĂ© chez la levure, S. cerevisiae, afin de gĂ©nĂ©rer des hypothĂšses concernant les principes fonctionnels de chaque couche de la hiĂ©rarchie du rĂ©seau KP-Net. En se basant sur une curation de donnĂ©es d’interactions effectuĂ©e dans la prĂ©sente et dans d’autres Ă©tudes, le plus grand et authentique rĂ©seau KP-Net reconnu jusqu’à ce jour chez la levure a Ă©tĂ© assemblĂ© dans cette Ă©tude. En Ă©valuant le niveau hiĂ©rarchique du KP-Net en utilisant la mĂ©trique de la centralisation globale et en Ă©lucidant sa structure hiĂ©rarchique en utilisant l'algorithme vertex-sort (VS), nous avons trouvĂ© que le rĂ©seau KP-Net possĂšde une structure hiĂ©rarchique ayant la forme d’un sablier, formĂ©e de trois niveaux disjoints (supĂ©rieur, central et infĂ©rieur). En effet, le niveau supĂ©rieur du rĂ©seau, contenant un nombre Ă©levĂ© de KPs, Ă©tait enrichi par des KPs associĂ©es Ă  la rĂ©gulation des signaux cellulaire; le niveau central, formĂ© d’un nombre limitĂ© de KPs fortement connectĂ©es les unes aux autres, Ă©tait enrichi en KPs impliquĂ©es dans la rĂ©gulation du cycle cellulaire; et le niveau infĂ©rieur, composĂ© d’un nombre important de KPs, Ă©tait enrichi en KPs impliquĂ©es dans des processus cellulaires diversifiĂ©s. En superposant une grande multitude de propriĂ©tĂ©s biologiques des KPs sur le rĂ©seau KP-Net, le niveau supĂ©rieur Ă©tait enrichi en phosphatases alors que le niveau infĂ©rieur en Ă©tait appauvri, suggĂ©rant que les phosphatases seraient moins rĂ©gulĂ©es par phosphorylation et dĂ©phosphorylation que les kinases. De plus, le niveau central Ă©tait enrichi en KPs reprĂ©sentant des « bottlenecks », participant Ă  plus d’une voie de signalisation, codĂ©es par des gĂšnes essentiels et en KPs qui Ă©taient les plus strictement rĂ©gulĂ©es dans l’espace et dans le temps. Ceci implique que les KPs qui jouent un rĂŽle essentiel dans le rĂ©seau KP-Net devraient ĂȘtre Ă©troitement contrĂŽlĂ©es. En outre, cette Ă©tude a montrĂ© que les protĂ©ines des KPs classĂ©es au niveau supĂ©rieur du rĂ©seau sont exprimĂ©es Ă  des niveaux d’abondance plus Ă©levĂ©s et Ă  un niveau de bruit moins Ă©levĂ© que celles classĂ©es au niveau infĂ©rieur du rĂ©seau, suggĂ©rant que l’expression des enzymes Ă  des abondances Ă©levĂ©es invariables au niveau supĂ©rieur du rĂ©seau KP-Net pourrait ĂȘtre importante pour assurer un systĂšme robuste de signalisation. L’étude de l’algorithme VS a montrĂ© que le degrĂ© des nƓuds affecte leur classement dans les diffĂ©rents niveaux d’un rĂ©seau hiĂ©rarchique sans biaiser les rĂ©sultats biologiques du rĂ©seau Ă©tudiĂ©. En outre, une analyse de robustesse du rĂ©seau KP-Net a montrĂ© que les niveaus du rĂ©seau KP-Net sont modĂ©rĂ©ment stable dans des rĂ©seaux bruitĂ©s gĂ©nĂ©rĂ©s par ajout d’arrĂȘtes au rĂ©seau KP-Net. Cependant, les niveaux de ces rĂ©seaux bruitĂ©s et de ceux du rĂ©seau KP-Net se superposent significativement. De plus, les propriĂ©tĂ©s topologiques et biologiques du rĂ©seau KP-Net Ă©taient retenues dans les rĂ©seaux bruitĂ©s Ă  diffĂ©rents niveaux. Ces rĂ©sultats indiquant que bien qu’une robustesse partielle de nos rĂ©sultats ait Ă©tĂ© observĂ©e, ces derniers reprĂ©sentent l’état actuel de nos connaissances des rĂ©seaux KP-Nets. Finalement, l’amĂ©lioration des techniques dĂ©diĂ©es Ă  l’identification des substrats des KPs aideront davantage Ă  comprendre comment les rĂ©seaux KP-Nets fonctionnent. À titre d’exemple, je dĂ©cris, dans cette thĂšse, une stratĂ©gie que nous avons conçu et qui permet Ă  dĂ©terminer les interactions KP-substrats et les sous-unitĂ©s rĂ©gulatrices sur lesquelles ces interactions dĂ©pendent. Cette stratĂ©gie est basĂ©e sur la complĂ©mentation des fragments de protĂ©ines basĂ©e sur la cytosine dĂ©saminase chez la levure (OyCD PCA). L’OyCD PCA reprĂ©sente un essai in vivo Ă  haut dĂ©bit qui promet une description plus prĂ©cise des rĂ©seaux KP-Nets complexes. En l’appliquant pour dĂ©terminer les substrats de la kinase cycline-dĂ©pendante de type 1 (Cdk1, appelĂ©e aussi Cdc28) chez la levure et l’implication des cyclines dans la phosphorylation de ces substrats par Cdk1, l’essai OyCD PCA a montrĂ© un comportement compensatoire collectif des cyclines pour la majoritĂ© des substrats. De plus, cet essai a montrĂ© que la tubuline- Îł est phosphorylĂ©e spĂ©cifiquement par Clb3-Cdk1, Ă©tablissant ainsi le moment pendant lequel cet Ă©vĂ©nement contrĂŽle l'assemblage du fuseau mitotique.Kinases and phosphatases (KP) form the largest family of enzymes in living cells. They regulate each other and 60 % of the proteome forming complex kinase-phosphatase networks (KP-Net) essential for cell signaling. Such networks having the command-execution aspect tend to have a hierarchical structure. Despite the extensive study of the KP-Net in the budding yeast, the hierarchical structure as well as the functional principles of this network are still not known. In this context, this thesis aims to perform an integrative analysis of multi-omics data with the hierarchical structure of a bona fide KP-Net in the budding yeast Saccharomyces cerevisiae, in order to generate hypotheses about the functional principles of each layer in the KP-Net hierarchy. Based on a literature curation effort accomplished in this and in other studies, the largest bona fide KP-Net of the S. cerevisiae known to date was assembled in this thesis. By assessing the hierarchical level of the KP-Net using the global reaching centrality and by elucidating the its hierarchical structure using the vertex-sort (VS) algorithm, we found that the KP-Net has a moderate hierarchical structure made of three disjoint layers (top, core and bottom) resembling a bow tie shape. The top layer having a large size was found enriched for signaling regulation; the core layer made of few strongly connected KPs was found enriched mostly for cell cycle regulation; and the bottom layer having a large size was found enriched for diverse biological processes. On overlaying a wide range of KP biological properties on top of the KP-Net hierarchical structure, the top layer was found enriched for and the bottom layer was found depleted for phosphatases, suggesting that phosphatases are less regulated by phosphorylation and dephosphoryation interactions (PDI) than kinases. Moreover, the core layer was found enriched for KPs representing bottlenecks, pathway-shared components, essential genes and for the most tightly regulated KPs in time and space, implying that KPs playing an essential role in the KP-Net should be firmly controlled. Interestingly, KP proteins in the top layer were found more abundant and less noisy than those of the bottom layer, suggesting that availability of enzymes at invariable protein expression level at the top of the network might be important to ensure a robust signaling. Analysis of the VS algorithm showed that node degrees affect their classification in the different layers of a network hierarchical structure without biasing biological results of the sorted network. Robustness analysis of the KP-Net showed that KP-Net layers are moderately stable in noisy networks generated by adding edges to the KP-Net. However, layers of these noisy overlap significantly with those of the KP-Net. Moreover, topological and biological properties of the KP-Net were retained in the noisy networks to different levels. These findings indicate that despite the observed partial robustness of our results, they mostly represent our current knowledge about KP-Nets. Finally, enhancement of techniques dedicated to identify KPs substrates will enhance our understanding about how KP-Nets function. As an example, I describe here a strategy that we devised to help in determining KP-substrate interactions and the regulatory subunits on which these interactions depend. The strategy is based on a protein-fragment complementation assay based on the optimized yeast cytosine deaminase (OyCD PCA). The OyCD PCA represents a large scale in vivo screen that promises a substantial improvement in delineating the complex KP-Nets. We applied the strategy to determine substrates of the cyclin-dependent kinase 1 (Cdk1; also called Cdc28) and cyclins implicated in phosphorylation of these substrates by Cdk1 in S. cerevisiae. The OyCD PCA showed a wide compensatory behavior of cyclins for most of the substrates and the phosphorylation of Îł-tubulin specifically by Clb3-Cdk1, thus establishing the timing of the latter event in controlling assembly of the mitotic spindle

    Discovering meaning from biological sequences: focus on predicting misannotated proteins, binding patterns, and G4-quadruplex secondary

    Get PDF
    Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters, and molecular machines in cells. Experimental determination of protein function is expensive in time and resources compared to computational methods. Hence, assigning proteins function, predicting protein binding patterns, and understanding protein regulation are important problems in functional genomics and key challenges in bioinformatics. This dissertation comprises of three studies. In the first two papers, we apply machine-learning methods to (1) identify misannotated sequences and (2) predict the binding patterns of proteins. The third paper is (3) a genome-wide analysis of G4-quadruplex sequences in the maize genome. The first two papers are based on two-stage classification methods. The first stage uses machine-learning approaches that combine composition-based and sequence-based features. We use either a decision trees (HDTree) or support vector machines (SVM) as second-stage classifiers and show that classification performance reaches or outperforms more computationally expensive approaches. For study (1) our method identified potential misannotated sequences within a well-characterized set of proteins in a popular bioinformatics database. We identified misannotated proteins and show the proteins have contradicting AmiGO and UniProt annotations. For study (2), we developed a three-phase approach: Phase I classifies whether a protein binds with another protein. Phase II determines whether a protein-binding protein is a hub. Phase III classifies hub proteins based on the number of binding sites and the number of concurrent binding partners. For study (3), we carried out a computational genome-wide screen to identify non-telomeric G4-quadruplex (G4Q) elements in maize to explore their potential role in gene regulation for flowering plants. Analysis of G4Q-containing genes uncovered a striking tendency for their enrichment in genes of networks and pathways associated with electron transport, sugar degradation, and hypoxia responsiveness. The maize G4Q elements may play a previously unrecognized role in coordinating global regulation of gene expression in response to hypoxia to control carbohydrate metabolism for anaerobic metabolism. We demonstrated that our three studies have the ability to predict and provide new insights in classifying misannotated proteins, understanding protein binding patterns, and identifying a potentially new model for gene regulation

    Mining real-world networks in systems biology and economics

    No full text
    Recent advances in biotechnology have yielded an explosion of data describing biological systems, creating rich opportunities for new insights into cellular inner-workings and therapeutic discoveries. To keep up with this rapid growth and increase in data complexity, we need novel static, integrative, and dynamic methodologies to continue mining these networked systems. In this thesis we introduce new static, integrative, and dynamic computational frameworks for network analysis, and combine existing ones in new ways, to elucidate the biotechnological biases and functional principles governing molecular interactions and their implications in disease. We focus on mining new knowledge from the yeast and human interactomes, since these are currently the most complete data in biology. We perform three lines of experimental work: 1) the macro-scale study, where we model the yeast and human interactomes and show that their interactome data are growing in structurally and functionally principled ways, characterised by a non-random dual topological nature; 2) the micro-scale study, where we zoom into the specifics of wiring patterns around individual genes and uncover a unique core sub-structure within the human interactome, which contains driver genes dubbed to be the main triggers for disease onset; and 3) the data integration study, where we introduce a new computational framework for fusing multiple types of molecular interaction data and use it to construct the first unified model of the cell’s functional organisation and cross-communication lines. Similarly, a new field of systems economics has gained recent attention, with more financial and economic network data emerging at an increasing pace. Hence, we introduce a new computational methodology for tracking network dynamics and use it to quantify the micro- and macro-scale topological changes in the world trade network over the past 50 years, and to demonstrate the fundamental relationship between topological perturbations and indicators of countries’ political and economic stabilities.Open Acces

    Improving biomarker list stability by integration of biological knowledge in the learning process

    Get PDF
    BACKGROUND: The identification of robust lists of molecular biomarkers related to a disease is a fundamental step for early diagnosis and treatment. However, methodologies for biomarker discovery using microarray data often provide results with limited overlap. It has been suggested that one reason for these inconsistencies may be that in complex diseases, such as cancer, multiple genes belonging to one or more physiological pathways are associated with the outcomes. Thus, a possible approach to improve list stability is to integrate biological information from genomic databases in the learning process; however, a comprehensive assessment based on different types of biological information is still lacking in the literature. In this work we have compared the effect of using different biological information in the learning process like functional annotations, protein-protein interactions and expression correlation among genes. RESULTS: Biological knowledge has been codified by means of gene similarity matrices and expression data linearly transformed in such a way that the more similar two features are, the more closely they are mapped. Two semantic similarity matrices, based on Biological Process and Molecular Function Gene Ontology annotation, and geodesic distance applied on protein-protein interaction networks, are the best performers in improving list stability maintaining almost equal prediction accuracy. CONCLUSIONS: The performed analysis supports the idea that when some features are strongly correlated to each other, for example because are close in the protein-protein interaction network, then they might have similar importance and are equally relevant for the task at hand. Obtained results can be a starting point for additional experiments on combining similarity matrices in order to obtain even more stable lists of biomarkers. The implementation of the classification algorithm is available at the link: http://www.math.unipd.it/~dasan/biomarkers.html

    Searching for novel gene functions in yeast : identification of thousands of novel molecular interactions by protein-fragment complementation assay followed by automated gene function prediction and high-throughput lipidomics

    Get PDF
    La comprĂ©hension de processus biologiques complexes requiert des approches expĂ©rimentales et informatiques sophistiquĂ©es. Les rĂ©cents progrĂšs dans le domaine des stratĂ©gies gĂ©nomiques fonctionnelles mettent dorĂ©navant Ă  notre disposition de puissants outils de collecte de donnĂ©es sur l’interconnectivitĂ© des gĂšnes, des protĂ©ines et des petites molĂ©cules, dans le but d’étudier les principes organisationnels de leurs rĂ©seaux cellulaires. L’intĂ©gration de ces connaissances au sein d’un cadre de rĂ©fĂ©rence en biologie systĂ©mique permettrait la prĂ©diction de nouvelles fonctions de gĂšnes qui demeurent non caractĂ©risĂ©es Ă  ce jour. Afin de rĂ©aliser de telles prĂ©dictions Ă  l’échelle gĂ©nomique chez la levure Saccharomyces cerevisiae, nous avons dĂ©veloppĂ© une stratĂ©gie innovatrice qui combine le criblage interactomique Ă  haut dĂ©bit des interactions protĂ©ines-protĂ©ines, la prĂ©diction de la fonction des gĂšnes in silico ainsi que la validation de ces prĂ©dictions avec la lipidomique Ă  haut dĂ©bit. D’abord, nous avons exĂ©cutĂ© un dĂ©pistage Ă  grande Ă©chelle des interactions protĂ©ines-protĂ©ines Ă  l’aide de la complĂ©mentation de fragments protĂ©iques. Cette mĂ©thode a permis de dĂ©celer des interactions in vivo entre les protĂ©ines exprimĂ©es par leurs promoteurs naturels. De plus, aucun biais liĂ© aux interactions des membranes n’a pu ĂȘtre mis en Ă©vidence avec cette mĂ©thode, comparativement aux autres techniques existantes qui dĂ©cĂšlent les interactions protĂ©ines-protĂ©ines. ConsĂ©quemment, nous avons dĂ©couvert plusieurs nouvelles interactions et nous avons augmentĂ© la couverture d’un interactome d’homĂ©ostasie lipidique dont la comprĂ©hension demeure encore incomplĂšte Ă  ce jour. Par la suite, nous avons appliquĂ© un algorithme d’apprentissage afin d’identifier huit gĂšnes non caractĂ©risĂ©s ayant un rĂŽle potentiel dans le mĂ©tabolisme des lipides. Finalement, nous avons Ă©tudiĂ© si ces gĂšnes et un groupe de rĂ©gulateurs transcriptionnels distincts, non prĂ©alablement impliquĂ©s avec les lipides, avaient un rĂŽle dans l’homĂ©ostasie des lipides. Dans ce but, nous avons analysĂ© les lipidomes des dĂ©lĂ©tions mutantes de gĂšnes sĂ©lectionnĂ©s. Afin d’examiner une grande quantitĂ© de souches, nous avons dĂ©veloppĂ© une plateforme Ă  haut dĂ©bit pour le criblage lipidomique Ă  contenu Ă©levĂ© des bibliothĂšques de levures mutantes. Cette plateforme consiste en la spectromĂ©trie de masse Ă  haute resolution Orbitrap et en un cadre de traitement des donnĂ©es dĂ©diĂ© et supportant le phĂ©notypage des lipides de centaines de mutations de Saccharomyces cerevisiae. Les mĂ©thodes expĂ©rimentales en lipidomiques ont confirmĂ© les prĂ©dictions fonctionnelles en dĂ©montrant certaines diffĂ©rences au sein des phĂ©notypes mĂ©taboliques lipidiques des dĂ©lĂ©tions mutantes ayant une absence des gĂšnes YBR141C et YJR015W, connus pour leur implication dans le mĂ©tabolisme des lipides. Une altĂ©ration du phĂ©notype lipidique a Ă©galement Ă©tĂ© observĂ© pour une dĂ©lĂ©tion mutante du facteur de transcription KAR4 qui n’avait pas Ă©tĂ© auparavant liĂ© au mĂ©tabolisme lipidique. Tous ces rĂ©sultats dĂ©montrent qu’un processus qui intĂšgre l’acquisition de nouvelles interactions molĂ©culaires, la prĂ©diction informatique des fonctions des gĂšnes et une plateforme lipidomique innovatrice Ă  haut dĂ©bit , constitue un ajout important aux mĂ©thodologies existantes en biologie systĂ©mique. Les dĂ©veloppements en mĂ©thodologies gĂ©nomiques fonctionnelles et en technologies lipidomiques fournissent donc de nouveaux moyens pour Ă©tudier les rĂ©seaux biologiques des eucaryotes supĂ©rieurs, incluant les mammifĂšres. Par consĂ©quent, le stratĂ©gie prĂ©sentĂ© ici dĂ©tient un potentiel d’application au sein d’organismes plus complexes.Understanding complex biological processes requires sophisticated experimental and computational approaches. The advances in functional genomics strategies provide powerful tools for collecting diverse types of information on interconnectivity of genes, proteins and small molecules for studying organizational principles of cellular networks. Integration of that knowledge into a systems biology framework enables prediction of novel functions of uncharacterized genes. For performing such predictions on a genome-wide scale in the yeast Saccharomyces cerevisiae, we have developed a novel strategy that combines high-throughput interactomics screen for protein-protein interactions, in silico gene function prediction, and validation of predictions with high-throughput lipidomics. We started by performing a large-scale screen for protein-protein interactions using a protein-fragment complementation assay. The method allowed to monitor interactions in vivo between proteins expressed from their natural promoters. Furthermore, the method did not suffer from bias against membrane interactions comparing to established genome-wide techniques for detecting protein interactions. As a result, we detected many novel interactions and increased coverage of an interactome of lipid homeostasis that has not been yet comprehensively explored. Next, we applied a machine learning algorithm to identify eight previously uncharacterized genes with a potential role in lipid metabolism. Finally, we investigated whether these genes and a set of distinct transcriptional regulators, not implicated previously with lipids, have a role in lipid homeostasis. For that purpose, we analyzed lipidome of deletion mutants of the selected genes. In order to probe a large number of strains, we have developed a high-throughput platform for high-content lipidomic screening of yeast mutant libraries that consists of high-resolution Orbitrap mass spectrometry and a dedicated data processing framework to support lipid phenotyping across hundreds of Saccharomyces cerevisiae mutants. Lipidomics experiments confirmed functional predictions by demonstrating differences of the lipid metabolic phenotypes of deletion mutants lacking YBR141C and YJR015W genes predicted to be involved in lipid metabolism. An altered lipid phenotype was also observed for a deletion mutant of the transcription factor KAR4 that has not been linked previously with lipid metabolism. These results demonstrate that a workflow that integrates the acquisition of novel molecular interactions, computational gene function prediction and novel high-throughput shotgun lipidomics platform is a valuable contribution to an arsenal of methods for systems biology. The developments of functional genomic methods and lipidomics technologies provide means to study biological networks of higher eukaryotes, including mammals. Therefore, the presented workflow has a potential to find its applications in more complex organisms

    Developing an integrated system for biological network exploration

    Get PDF
    Network analysis and visualization have been used in systems biology to extract biological insight from complex datasets. Many existing network analysis tools either focus on visualization but have limited scalability, or focus on analysis but have limited visualizations. The separation of analyzing the raw data from visualizing the analysis results causes systems biologists to jump between forming a question, building a massive network, identifying a subnetwork for visualization, and using the visualization as feedback and inspiration for the next question. This iterative process can take several days, making it difficult for researchers to maintain the mental map of the questions queried. In addition, biological data is stored in different formats and has differing annotations, thus systems biologists often run into hurdles when merging large or heterogeneous networks. The polymorphic nature of the datasets presents a challenge for researchers to integrate data to answer biological questions. A more systematic method for merging networks, resolving data conflicts, and analyzing networks may improve the efficiency and scalability of heterogeneous multi-network analysis. Towards improving and pushing forward multi-network analysis to help a researcher easily combine multiple heterogeneous biological data networks to answer biological questions, this dissertation reports several accomplishments that provide (i) a set of standard multi-network operations, (ii) standard merging rules for heterogeneous networks, (iii) standard methods to reproduce network analyses, (iv) a single integrated software environment that allows users to visualize and explore the network analysis results and (v) several examples applying these methods in biological analysis. These efforts have culminated in three academic publications

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si Ăš dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilitĂ  dei risultati Ăš cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilitĂ  per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneitĂ  di malattie complesse, caratterizzate da alterazioni di piĂč pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre Ăš ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilitĂ  di selezionare un insieme piĂč dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilitĂ  delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria Ăš possibile avere un’insufficiente stabilitĂ  anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilitĂ  dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilitĂ  delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione puĂČ essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilitĂ  nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilitĂ  delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilitĂ  dei risultati e offre un modo alternativo per verificare la riproducibilitĂ  delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilitĂ  delle liste. Infine, l’interpretabilitĂ  dei risultati Ăš fortemente influenzata dalla qualitĂ  dell’informazione biologica disponibile e l’analisi delle eterogeneitĂ  delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilitĂ  delle proprietĂ  biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificitĂ  di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre piĂč approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni
    • 

    corecore