4,786 research outputs found

    Detection of regulator genes and eQTLs in gene networks

    Full text link
    Genetic differences between individuals associated to quantitative phenotypic traits, including disease states, are usually found in non-coding genomic regions. These genetic variants are often also associated to differences in expression levels of nearby genes (they are "expression quantitative trait loci" or eQTLs for short) and presumably play a gene regulatory role, affecting the status of molecular networks of interacting genes, proteins and metabolites. Computational systems biology approaches to reconstruct causal gene networks from large-scale omics data have therefore become essential to understand the structure of networks controlled by eQTLs together with other regulatory genes, and to generate detailed hypotheses about the molecular mechanisms that lead from genotype to phenotype. Here we review the main analytical methods and softwares to identify eQTLs and their associated genes, to reconstruct co-expression networks and modules, to reconstruct causal Bayesian gene and module networks, and to validate predicted networks in silico.Comment: minor revision with typos corrected; review article; 24 pages, 2 figure

    Statistical inference from large-scale genomic data

    Get PDF
    This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Multi-omics data integration for the detection and characterization of smoking related lung diseases

    Get PDF
    Lung cancer is the leading cause of death from cancer in the world. First, we hypothesized that microRNA expression is altered in the bronchial epithelium of patients with lung cancer and that incorporating microRNA expression into an existing mRNA biomarker may improve its performance. Using bronchial brushings collected from current and former smokers, we profiled microRNA expression via small RNA sequencing for 347 patients with available mRNA data. We found that four microRNAs were under-expressed in cancer patients compared to controls (p<0.002, FDR<0.2). We explored the role of these microRNAs and their gene targets in cancer. In addition, we found that adding a microRNA feature to an existing 23-gene biomarker significantly improves its performance (AUC) in a test set (p<0.05). Next, we generalized the biomarker discovery process, and developed a visualization tool for biomarker selection. We built upon an existing biomarker discovery pipeline and created a web-based interface to visualize the performance of multiple predictors. The “visualization” component is the key to sorting through a thousand potential biomarkers, and developing clinically useful molecular predictors. Finally, we explored the molecular events leading to the development of COPD and ILD, two heterogeneous diseases with high mortality. We hypothesized that integrative genetic and expression networks can help identify drivers and elucidate mechanisms of genetic susceptibility. We utilized 262 lung tissue specimens profiled with microRNA sequencing, microarray gene expression and SNP chip genotyping. Next, we built condition specific integrative networks using a causality inference test for predicting SNP-microRNA-mRNA associations, where the microRNA is a predicted mediator of the SNP’s effect on gene expression. We identified the microRNAs predicted to affect the most genes within each network. Members of miR-34/449 family, known to promote airway differentiation by repressing the Notch pathway, were among the top ranked microRNAs in COPD and ILD networks, but not in the non-disease network. In addition, the miR-34/449 gene module was enriched among genes that increase in expression over time when airway basal cells are differentiated at an air-liquid interface and among genes that increase in expression with the airway wall thickening in patients with emphysema.2019-07-31T00:00:00

    Inference of Temporally Varying Bayesian Networks

    Get PDF
    When analysing gene expression time series data an often overlooked but crucial aspect of the model is that the regulatory network structure may change over time. Whilst some approaches have addressed this problem previously in the literature, many are not well suited to the sequential nature of the data. Here we present a method that allows us to infer regulatory network structures that may vary between time points, utilising a set of hidden states that describe the network structure at a given time point. To model the distribution of the hidden states we have applied the Hierarchical Dirichlet Process Hideen Markov Model, a nonparametric extension of the traditional Hidden Markov Model, that does not require us to fix the number of hidden states in advance. We apply our method to exisiting microarray expression data as well as demonstrating is efficacy on simulated test data
    corecore