Search CORE

73 research outputs found

DETECTING CANCER-RELATED GENES AND GENE-GENE INTERACTIONS BY MACHINE LEARNING METHODS

Author: Han Bing
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2011
Field of study

To understand the underlying molecular mechanisms of cancer and therefore to improve pathogenesis, prevention, diagnosis and treatment of cancer, it is necessary to explore the activities of cancer-related genes and the interactions among these genes. In this dissertation, I use machine learning and computational methods to identify differential gene relations and detect gene-gene interactions. To identify gene pairs that have different relationships in normal versus cancer tissues, I develop an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets. The experimental results demonstrate that my method can find meaningful alterations in gene relations. For gene-gene interaction detection, I propose to use two Bayesian Network based methods: DASSO-MB (Detection of ASSOciations using Markov Blanket) and EpiBN (Epistatic interaction detection using Bayesian Network model) to address the two critical challenges: searching and scoring. DASSO-MB is based on the concept of Markov Blanket in Bayesian Networks. In EpiBN, I develop a new scoring function, which can reflect higher-order gene-gene interactions and detect the true number of disease markers, and apply a fast Branch-and-Bound (B&B) algorithm to learn the structure of Bayesian Network. Both DASSO-MB and EpiBN outperform some other commonly-used methods and are scalable to genome-wide data

KU ScholarWorks

Exploring causal networks underlying fat deposition and muscularity in pigs through the integration of phenotypic, genotypic and transcriptomic data

Author: A Breslin
B Liu
Bruno D. Valente
C Ovilo
Catherine W. Ernst
CS Haley
DB Edwards
DB Edwards
DB Edwards
E Chaibub Neto
E Chaibub Neto
EE Schadt
EE Schadt
Francisco Peñagaricano
GA Churchill
Guilherme JM Rosa
Hasan Khatib
HN Kadarmideen
I Tur
J Pearl
Juan P. Steibel
K Suzuki
L Varona
M Civelek
M Damon
M Scutari
P Affentranger
RC Jansen
RH Li
Ronald O. Bates
SM Lonergan
Y Benjamini
ZL Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

Author: Glaab Enrico
Publication venue
Publication date: 15/10/2011
Field of study

Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

Nottingham eTheses

Network-Based Biomarker Discovery : Development of Prognostic Biomarkers for Personalized Medicine by Integrating Data and Prior Knowledge

Author: Cun Yupeng
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Advances in genome science and technology offer a deeper understanding of biology while at the same time improving the practice of medicine. The expression profiling of some diseases, such as cancer, allows for identifying marker genes, which could be able to diagnose a disease or predict future disease outcomes. Marker genes (biomarkers) are selected by scoring how well their expression levels can discriminate between different classes of disease or between groups of patients with different clinical outcome (e.g. therapy response, survival time, etc.). A current challenge is to identify new markers that are directly related to the underlying disease mechanism

bonndoc – Der Publikationsserver der Universität Bonn

Bayesian networks for omics data analysis

Author: Gavai A.K.
Publication venue: S.n.
Publication date: 01/01/2009
Field of study

This thesis focuses on two aspects of high throughput technologies, i.e. data storage and data analysis, in particular in transcriptomics and metabolomics. Both technologies are part of a research field that is generally called ‘omics’ (or ‘-omics’, with a leading hyphen), which refers to genomics, transcriptomics, proteomics, or metabolomics. Although these techniques study different entities (genes, gene expression, proteins, or metabolites), they all have in common that they use high-throughput technologies such as microarrays and mass spectrometry, and thus generate huge amounts of data. Experiments conducted using these technologies allow one to compare different states of a living cell, for example a healthy cell versus a cancer cell or the effect of food on cell condition, and at different levels. The tools needed to apply omics technologies, in particular microarrays, are often manufactured by different vendors and require separate storage and analysis software for the data generated by them. Moreover experiments conducted using different technologies cannot be analyzed simultaneously to answer a biological question. Chapter 3 presents MADMAX, our software system which supports storage and analysis of data from multiple microarray platforms. It consists of a vendor-independent database which is tightly coupled with vendor-specific analysis tools. Upcoming technologies like metabolomics, proteomics and high-throughput sequencing can easily be incorporated in this system. Once the data are stored in this system, one obviously wants to deduce a biological relevant meaning from these data and here statistical and machine learning techniques play a key role. The aim of such analysis is to search for relationships between entities of interest, such as genes, metabolites or proteins. One of the major goals of these techniques is to search for causal relationships rather than mere correlations. It is often emphasized in the literature that "correlation is not causation" because people tend to jump to conclusions by making inferences about causal relationships when they actually only see correlations. Statistics are often good in finding these correlations; techniques called linear regression and analysis of variance form the core of applied multivariate statistics. However, these techniques cannot find causal relationships, neither are they able to incorporate prior knowledge of the biological domain. Graphical models, a machine learning technique, on the other hand do not suffer from these limitations. Graphical models, a combination of graph theory, statistics and information science, are one of the most exciting things happening today in the field of machine learning applied to biological problems (see chapter 2 for a general introduction). This thesis deals with a special type of graphical models known as probabilistic graphical models, belief networks or Bayesian networks. The advantage of Bayesian networks over classical statistical techniques is that they allow the incorporation of background knowledge from a biological domain, and that analysis of data is intuitive as it is represented in the form of graphs (nodes and edges). Standard statistical techniques are good in describing the data but are not able to find non-linear relations whereas Bayesian networks allow future prediction and discovering nonlinear relations. Moreover, Bayesian networks allow hierarchical representation of data, which makes them particularly useful for representing biological data, since most biological processes are hierarchical by nature. Once we have such a causal graph made either by a computer program or constructed manually we can predict the effects of a certain entity by manipulating the state of other entities, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very difficult and CPU-expensive, and in such cases approximate methods are used. Chapter 4 demonstrates the use of Bayesian networks to determine the metabolic state of feeding and fasting mice to determine the effect of a high fat diet on gene expression. This chapter also shows how selection of genes based on key biological processes generates more informative results than standard statistical tests. In chapter 5 the use of Bayesian networks is shown on the combination of gene expression data and clinical parameters, to determine the effect of smoking on gene expression and which genes are responsible for the DNA damage and the raise in plasma cotinine levels of blood of a smoking population. This study was conducted at Maastricht University where 22 twin smokers were profiled. Chapter 6 presents the reconstruction of a key metabolic pathway which plays an important role in ripening of tomatoes, thus showing the versatility of the use of Bayesian networks in metabolomics data analysis. The general trend in research shows a flood of data emerging from sequencing and metabolomics experiments. This means that to perform data mining on these data one requires intelligent techniques that are computationally feasible and able to take the knowledge of experts into account to generate relevant results. Graphical models fit this paradigm well and we expect them to play a key role in mining the data generated from omics experiments. <br/

Wageningen University & Research Publications

Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

Author: Glaab Enrico
Publication venue
Publication date
Field of study

Nottingham ePrints

Recommended from our members

Machine Learning Methods for Cancer Immunology

Author: Chlon Leon
Publication venue: University of Cambridge
Publication date: 02/11/2017
Field of study

Tumours are highly heterogeneous collections of tissues characterised by a repertoire of heavily mutated and rapidly proliferating cells. Evading immune destruction is a fundamental hallmark of cancer, and elucidating the contextual basis of tumour-infiltrating leukocytes is pivotal for improving immunotherapy initiatives. However, progress in this domain is hindered by an incomplete characterisation of the regulatory mechanisms involved in cancer immunity. Addressing this challenge, this thesis is formulated around a fundamental line of inquiry: how do we quantitatively describe the immune system with respect to tumour heterogeneity? Describing the molecular interactions between cancer cells and the immune system is a fundamental goal of cancer immunology. The first part of this thesis describes a three-stage association study to address this challenge in pancreatic ductal adenocarcinoma (PDAC). Firstly, network-based approaches are used to characterise PDAC on the basis of transcription factor regulators of an oncogenic KRAS signature. Next, gene expression tools are used to resolve the leukocyte subset mixing proportions, stromal contamination, immune checkpoint expression and immune pathway dysregulation from the data. Finally, partial correlations are used to characterise immune features in terms of KRAS master regulator activity. The results are compared across two independent cohorts for consistency. Moving beyond associations, the second part of the dissertation introduces a causal modelling approach to infer directed interactions between signaling pathway activity and immune agency. This is achieved by anchoring the analysis on somatic genomic changes. In particular, copy number profiles, transcriptomic data, image data and a protein-protein interaction network are integrated using graphical modelling approaches to infer directed relationships. Generated models are compared between independent cohorts and orthogonal datasets to evaluate consistency. Finally, proposed mechanisms are cross-referenced against literature examples to test for legitimacy. In summary, this dissertation provides methodological contributions, at the levels of associative and causal inference, for inferring the contextual basis for tumour-specific immune agency.This PhD was supported by the Cancer Research UK and Engineering and Physical Sciences Research Council Imaging Centre in Cambridge and Manchester (C197/A16465

Apollo (Cambridge)

An Integrated, Module-based Biomarker Discovery Framework

Author: Huang Grace T.
Publication venue
Publication date: 09/01/2014
Field of study

Identification of biomarkers that contribute to complex human disorders is a principal and challenging task in computational biology. Prognostic biomarkers are useful for risk assessment of disease progression and patient stratification. Since treatment plans often hinge on patient stratification, better disease subtyping has the potential to significantly improve survival for patients. Additionally, a thorough understanding of the roles of biomarkers in cancer pathways facilitates insights into complex disease formation, and provides potential druggable targets in the pathways. Many statistical methods have been applied toward biomarker discovery, often combining feature selection with classification methods. Traditional approaches are mainly concerned with statistical significance and fail to consider the clinical relevance of the selected biomarkers. Two additional problems impede meaningful biomarker discovery: gene multiplicity (several maximally predictive solutions exist) and instability (inconsistent gene sets from different experiments or cross validation runs). Motivated by a need for more biologically informed, stable biomarker discovery method, I introduce an integrated module-based biomarker discovery framework for analyzing high- throughput genomic disease data. The proposed framework addresses the aforementioned challenges in three components. First, a recursive spectral clustering algorithm specifically 4 tailored toward high-dimensional, heterogeneous data (ReKS) is developed to partition genes into clusters that are treated as single entities for subsequent analysis. Next, the problems of gene multiplicity and instability are addressed through a group variable selection algorithm (T-ReCS) based on local causal discovery methods. Guided by the tree-like partition created from the clustering algorithm, this algorithm selects gene clusters that are predictive of a clinical outcome. We demonstrate that the group feature selection method facilitate the discovery of biologically relevant genes through their association with a statistically predictive driver. Finally, we elucidate the biological relevance of the biomarkers by leveraging available prior information to identify regulatory relationships between genes and between clusters, and deliver the information in the form of a user-friendly web server, mirConnX

D-Scholarship@Pitt

Converging models for transcriptome studies of human diseases : the case of oculopharyngeal muscular dystrophy

Author: Anvar S.A.
Publication venue
Publication date: 06/06/2012
Field of study

This dissertation mainly focuses on interdisciplinary approaches for biomedical knowledge discovery. This required special efforts in developing systematic strategies to integrate various data sources and techniques, leading to improved discovery of mechanistic insights on human diseases. Chapter one looks at the possibility in which combining various bioinformatics-based strategies can significantly improve the characterization of the OPMD mouse model. We discuss that this approach in knowledge discovery, on the basis of our extensive analysis, helped us to shed some light on how this model system relates to OPMD pathophysiology in human. In Chapter two, we expand on this combinatory approach by conducting a cross-species data analysis. In this study, we have looked for common patterns that emerge by assessing the transcriptome data from three OPMD model systems and patients. This strategy led to unravelling the most prominent molecular pathway involved in OPMD pathology. The third chapter achieves a similar goal to identify similar molecular and pathophysiological features between OPMD and the common process of skeletal muscle ageing. Engaging in a study in which the focus was made on the universality of biological processes, in the light of evolutionary mechanisms and common functional features, led to novel discoveries. This work helped us uncover remarkable insights on molecular mechanisms of ageing muscles and protein aggregation. Chapters four and five take a different route by tackling the field of computational biology. These chapters aim to extend network inference by providing novel strategies for the exploitation and integration of multiple data sources. We show that these developments allow us to infer more robust regulatory mechanisms to be identified while translations and predictions are made across very different datasets, platforms, and organisms. Finally, the dissertation is concluded by providing an outlook on ways the field of systems biology can evolve in order to offer enhanced, diversified and robust strategies for knowledge discovery.UBL - phd migration 201

Leiden University Scholary Publications

Graphical models for de novo and pathway-based network prediction over multi-modal high-throughput biological data

Author: Sedgewick Andrew
Publication venue
Publication date: 07/09/2016
Field of study

It is now a standard practice in the study of complex disease to perform many high-throughput -omic experiments (genome wide SNP, copy number, mRNA and miRNA expression) on the same set of patient samples. These multi-modal data should allow researchers to form a more complete, systems-level picture of a sample, but this is only possible if they have a suitable model for integrating the data. Due to the variety of data modalities and possible combinations of data, general, flexible integration methods that will be widely applicable in many settings are desirable. In this dissertation I will present my work using graphical models for de novo structure learning of both undirected and directed sparse graphs over a mixture of Gaussian and categorical variables. Using synthetic and biological data I will show that these models are useful for both variable selection and inference. Selecting the regularization parameters is an important challenge for these models so I will also cover stability based methods for efficiently setting these parameters, and for controlling the false discovery rate of edge predictions. I will also show results from a biological application to data from metastatic melanoma patients where our methods identified a PARP1 slice site variant that is predictive of response to chemotherapy. Finally, I present work incorporating miRNA into a pathway based graphical model called PARADIGM. This extension of the model allows us to study patient-specific changes in miRNA induced silencing in cancer

D-Scholarship@Pitt