571 research outputs found

    ENHANCED INTER-STUDY PREDICTION AND BIOMARKER DETECTION IN MICROARRAY WITH APPLICATION TO CANCER STUDIES

    Get PDF
    Although microarray technology has been widely applied to the analysis of many malignancies, integrative analyses across multiple studies are rarely investigated, especially for studies of different platforms or studies of different diseases. Difficulties with the technology include issues such as different experimental designs between studies, gene matching, inter-study normalization and disease heterogeneity. This dissertation is motivated by these issues and investigates two aspects of inter-study analysis.First, we aimed to enhance the inter-study prediction of microarray data from different platforms. Normalization is a critical step for direct inter-study prediction because it applies a prediction model established in one study to data in another study. We found that gene-specific discrepancies in the expression intensity levels across studies often exist even after proper sample-wise normalization, which cause major difficulties in direct inter-study prediction. We proposed a sample-wise normalization followed by a ratio-adjusted gene-wise normalization (SN+rGN) method to solve this issue. Taking into account both binary classification and survival risk predictions, simulation results, as well as applications to three lung cancer data sets and two prostate cancer data sets, showed a significant and robust improvement in our method.Second, we performed an integrative analysis on the expression profiles of four published studies to detect the common biomarkers among them. The identified predictive biomarkers achieved high predictive accuracy similar to using whole genome sequence in the within-cancer-type prediction. They also performed superior to the method using whole genome sequences in inter-cancer-type prediction. The results suggest that the compact lists of predictive biomarkers are important in cancer development and represent common signatures of malignancies of multiple cancer types. Pathway analysis revealed important tumorogenesis functional categories.Our research improved predictions across clinical centers and across diseases and is a necessary step for clinical translation research

    The ROS wheel: refining ROS transcriptional footprints

    Get PDF
    In the last decade, microarray studies have delivered extensive inventories of transcriptome-wide changes in messenger RNA levels provoked by various types of oxidative stress in Arabidopsis (Arabidopsis thaliana). Previous cross-study comparisons indicated how different types of reactive oxygen species (ROS) and their subcellular accumulation sites are able to reshape the transcriptome in specific manners. However, these analyses often employed simplistic statistical frameworks that are not compatible with large-scale analyses. Here, we reanalyzed a total of 79 Affymetrix ATH1 microarray studies of redox homeostasis perturbation experiments. To create hierarchy in such a high number of transcriptomic data sets, all transcriptional profiles were clustered on the overlap extent of their differentially expressed transcripts. Subsequently, meta-analysis determined a single magnitude of differential expression across studies and identified common transcriptional footprints per cluster. The resulting transcriptional footprints revealed the regulation of various metabolic pathways and gene families. The RESPIRATORY BURST OXIDASE HOMOLOG F-mediated respiratory burst had a major impact and was a converging point among several studies. Conversely, the timing of the oxidative stress response was a determining factor in shaping different transcriptome footprints. Our study emphasizes the need to interpret transcriptomic data sets in a systematic context, where initial, specific stress triggers can converge to common, aspecific transcriptional changes. We believe that these refined transcriptional footprints provide a valuable resource for assessing the involvement of ROS in biological processes in plants

    MyoMiner: explore gene co-expression in normal and pathological muscle

    Get PDF
    International audienceBackground: High-throughput transcriptomics measures mRNA levels for thousands of genes in a biological sample. Most gene expression studies aim to identify genes that are differentially expressed between different biological conditions, such as between healthy and diseased states. However, these data can also be used to identify genes that are co-expressed within a biological condition. Gene co-expression is used in a guilt-by-association approach to prioritize candidate genes that could be involved in disease, and to gain insights into the functions of genes, protein relations, and signaling pathways. Most existing gene co-expression databases are generic, amalgamating data for a given organism regardless of tissue-type.Methods: To study muscle-specific gene co-expression in both normal and pathological states, publicly available gene expression data were acquired for 2376 mouse and 2228 human striated muscle samples, and separated into 142 categories based on species (human or mouse), tissue origin, age, gender, anatomic part, and experimental condition. Co-expression values were calculated for each category to create the MyoMiner database.Results: Within each category, users can select a gene of interest, and the MyoMiner web interface will return all correlated genes. For each co-expressed gene pair, adjusted p-value and confidence intervals are provided as measures of expression correlation strength. A standardized expression-level scatterplot is available for every gene pair r-value. MyoMiner has two extra functions: (a) a network interface for creating a 2-shell correlation network, based either on the most highly correlated genes or from a list of genes provided by the user with the option to include linked genes from the database and (b) a comparison tool from which the users can test whether any two correlation coefficients from different conditions are significantly different.Conclusions: These co-expression analyses will help investigators to delineate the tissue-, cell-, and pathology-specific elements of muscle protein interactions, cell signaling and gene regulation. Changes in co-expression between pathologic and healthy tissue may suggest new disease mechanisms and help define novel therapeutic targets. Thus, MyoMiner is a powerful muscle-specific database for the discovery of genes that are associated with related functions based on their co-expression. MyoMiner is freely available at https://www.sys-myo.com/myominer

    Small data: practical modeling issues in human-model -omic data

    Get PDF
    This thesis is based on the following articles: Chapter 2: Holsbø, E., Perduca, V., Bongo, L.A., Lund, E. & Birmelé, E. (Manuscript). Stratified time-course gene preselection shows a pre-diagnostic transcriptomic signal for metastasis in blood cells: a proof of concept from the NOWAC study. Available at https://doi.org/10.1101/141325. Chapter 3: Bøvelstad, H.M., Holsbø, E., Bongo, L.A. & Lund, E. (Manuscript). A Standard Operating Procedure For Outlier Removal In Large-Sample Epidemiological Transcriptomics Datasets. Available at https://doi.org/10.1101/144519. Chapter 4: Holsbø, E. & Perduca, V. (2018). Shrinkage estimation of rate statistics. Case Studies in Business, Industry and Government Statistics 7(1), 14-25. Also available at http://hdl.handle.net/10037/14678.Human-model data are very valuable and important in biomedical research. Ethical and economical constraints limit the access to such data, and consequently these datasets rarely comprise more than a few hundred observations. As measurements are comparatively cheap, the tendency is to measure as many things as possible for the few, valuable participants in a study. With -omics technologies it is cheap and simple to make hundreds of thousands of measurements simultaneously. This few observations–many measurements setting is a high-dimensional problem in the technical language. Most gene expression experiments measure the expression levels of 10 000–15 000 genes for fewer than 100 subjects. I refer to this as the small data setting. This dissertation is an exercise in practical data analysis as it happens in a large epidemiological cohort study. It comprises three main projects: (i) predictive modeling of breast cancer metastasis from whole-blood transcriptomics measurements; (ii) standardizing a microarray data quality assessment in the Norwegian Women and Cancer (NOWAC) postgenome cohort; and (iii) shrinkage estimation of rates. These three are all small data analyses for various reasons. Predictive modeling in the small data setting is very challenging. There are several modern methods built to tackle high-dimensional data, but there is a need to evaluate these methods against one another when analyzing data in practice. Through the metastasis prediction work we learned first-hand that common practices in machine learning can be inefficient or harmful, especially for small data. I will outline some of the more important issues. In a large project such as NOWAC there is a need to centralize and disseminate knowledge and procedures. The standardization of NOWAC quality assessment was a project born of necessity. The standard operating procedure for outlier removal was developed so that preprocessing of the NOWAC microarray material should happen in the same way every time. We take this procedure from an archaic R-script that resided in peoples email inboxes to a well-documented, open-source R-package and present the NOWAC guidelines for microarray quality control. The procedure is built around the inherent high value of a singleobservation. Small data are plagued by high variance. Working with small data it is usually profitable to bias models by shrinkage or borrowing of information from elsewhere. We present a pseudo-Bayesian estimator of rates in an informal crime rate study. We exhibit the value of such procedures in a small data setting and demonstrate some novel considerations about the coverage properties of such a procedure. In short I gather some common practices in predictive modeling as applied to small data and assess their practical implications. I argue that with more focus on human-based datasets in biomedicine there is a need for particular consideration of these data in a small data paradigm to allow for reliable analysis. I will present what I believe to be sensible guidelines

    Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other Tumors.

    Get PDF
    The methylation status of the O(6)-methylguanine-DNA methyltransferase (MGMT) gene is an important predictive biomarker for benefit from alkylating agent therapy in glioblastoma. Our model MGMT-STP27 allows prediction of the methylation status of the MGMT promoter using data from the Illumina's Human Methylation BeadChips (HM-27K and HM-450K) that is publically available for many cancer data sets. Here, we investigate the impact of the context of genetic and epigenetic alterations and tumor type on the classification and report on technical aspects, such as robustness of cutoff definition and preprocessing of the data. The association between gene copy number variation, predicted MGMT methylation, and MGMT expression revealed a gene dosage effect on MGMT expression in lower grade glioma (World Health Organization grade II/III) that in contrast to glioblastoma usually carry two copies of chromosome 10 on which MGMT resides (10q26.3). This implies some MGMT expression, potentially conferring residual repair function blunting the therapeutic effect of alkylating agents. A sensitivity analyses corroborated the performance of the original cutoff for various optimization criteria and for most data preprocessing methods. Finally, we propose an R package mgmtstp27 that allows prediction of the methylation status of the MGMT promoter and calculation of appropriate confidence and/or prediction intervals. Overall, MGMT-STP27 is a robust model for MGMT classification that is independent of tumor type and is adapted for single sample prediction

    Bioinformatics applied to human genomics and proteomics: development of algorithms and methods for the discovery of molecular signatures derived from omic data and for the construction of co-expression and interaction networks

    Get PDF
    [EN] The present PhD dissertation develops and applies Bioinformatic methods and tools to address key current problems in the analysis of human omic data. This PhD has been organised by main objectives into four different chapters focused on: (i) development of an algorithm for the analysis of changes and heterogeneity in large-scale omic data; (ii) development of a method for non-parametric feature selection; (iii) integration and analysis of human protein-protein interaction networks and (iv) integration and analysis of human co-expression networks derived from tissue expression data and evolutionary profiles of proteins. In the first chapter, we developed and tested a new robust algorithm in R, called DECO, for the discovery of subgroups of features and samples within large-scale omic datasets, exploring all feature differences possible heterogeneity, through the integration of both data dispersion and predictor-response information in a new statistic parameter called h (heterogeneity score). In the second chapter, we present a simple non-parametric statistic to measure the cohesiveness of categorical variables along any quantitative variable, applicable to feature selection in all types of big data sets. In the third chapter, we describe an analysis of the human interactome integrating two global datasets from high-quality proteomics technologies: HuRI (a human protein-protein interaction network generated by a systematic experimental screening based on Yeast-Two-Hybrid technology) and Cell-Atlas (a comprehensive map of subcellular localization of human proteins generated by antibody imaging). This analysis aims to create a framework for the subcellular localization characterization supported by the human protein-protein interactome. In the fourth chapter, we developed a full integration of three high-quality proteome-wide resources (Human Protein Atlas, OMA and TimeTree) to generate a robust human co-expression network across tissues assigning each human protein along the evolutionary timeline. In this way, we investigate how old in evolution and how correlated are the different human proteins, and we place all them in a common interaction network. As main general comment, all the work presented in this PhD uses and develops a wide variety of bioinformatic and statistical tools for the analysis, integration and enlighten of molecular signatures and biological networks using human omic data. Most of this data corresponds to sample cohorts generated in recent biomedical studies on specific human diseases

    Unified Transcriptomic Signature of Arbuscular Mycorrhiza Colonization in Roots of Medicago truncatula by Integration of Machine Learning, Promoter Analysis, and Direct Merging Meta-Analysis

    Get PDF
    Plant root symbiosis with Arbuscular mycorrhizal (AM) fungi improves uptake of water and mineral nutrients, improving plant development under stressful conditions. Unraveling the unified transcriptomic signature of a successful colonization provides a better understanding of symbiosis. We developed a framework for finding the transcriptomic signature of Arbuscular mycorrhiza colonization and its regulating transcription factors in roots of Medicago truncatula. Expression profiles of roots in response to AM species were collected from four separate studies and were combined by direct merging meta-analysis. Batch effect, the major concern in expression meta-analysis, was reduced by three normalization steps: Robust Multi-array Average algorithm, Z-standardization, and quartiling normalization. Then, expression profile of 33685 genes in 18 root samples of Medicago as numerical features, as well as study ID and Arbuscular mycorrhiza type as categorical features, were mined by seven models: RELIEF, UNCERTAINTY, GINI INDEX, Chi Squared, RULE, INFO GAIN, and INFO GAIN RATIO. In total, 73 genes selected by machine learning models were up-regulated in response to AM (Z-value difference > 0.5). Feature weighting models also documented that this signature is independent from study (batch) effect. The AM inoculation signature obtained was able to differentiate efficiently between AM inoculated and non-inoculated samples. The AP2 domain class transcription factor, GRAS family transcription factors, and cyclin-dependent kinase were among the highly expressed meta-genes identified in the signature. We found high correspondence between the AM colonization signature obtained in this study and independent RNA-seq experiments on AM colonization, validating the repeatability of the colonization signature. Promoter analysis of upregulated genes in the transcriptomic signature led to the key regulators of AM colonization, including the essential transcription factors for endosymbiosis establishment and development such as NF-YA factors. The approach developed in this study offers three distinct novel features: (I) it improves direct merging meta-analysis by integrating supervised machine learning models and normalization steps to reduce study-specific batch effects; (II) seven attribute weighting models assessed the suitability of each gene for the transcriptomic signature which contributes to robustness of the signature (III) the approach is justifiable, easy to apply, and useful in practice. Our integrative framework of meta-analysis, promoter analysis, and machine learning provides a foundation to reveal the transcriptomic signature and regulatory circuits governing Arbuscular mycorrhizal symbiosis and is transferable to the other biological settings

    Feature selection and modelling methods for microarray data from acute coronary syndrome

    Get PDF
    Acute coronary syndrome (ACS) represents a leading cause of mortality and morbidity worldwide. Providing better diagnostic solutions and developing therapeutic strategies customized to the individual patient represent societal and economical urgencies. Progressive improvement in diagnosis and treatment procedures require a thorough understanding of the underlying genetic mechanisms of the disease. Recent advances in microarray technologies together with the decreasing costs of the specialized equipment enabled affordable harvesting of time-course gene expression data. The high-dimensional data generated demands for computational tools able to extract the underlying biological knowledge. This thesis is concerned with developing new methods for analysing time-course gene expression data, focused on identifying differentially expressed genes, deconvolving heterogeneous gene expression measurements and inferring dynamic gene regulatory interactions. The main contributions include: a novel multi-stage feature selection method, a new deconvolution approach for estimating cell-type specific signatures and quantifying the contribution of each cell type to the variance of the gene expression patters, a novel approach to identify the cellular sources of differential gene expression, a new approach to model gene expression dynamics using sums of exponentials and a novel method to estimate stable linear dynamical systems from noisy and unequally spaced time series data. The performance of the proposed methods was demonstrated on a time-course dataset consisting of microarray gene expression levels collected from the blood samples of patients with ACS and associated blood count measurements. The results of the feature selection study are of significant biological relevance. For the first time is was reported high diagnostic performance of the ACS subtypes up to three months after hospital admission. The deconvolution study exposed features of within and between groups variation in expression measurements and identified potential cell type markers and cellular sources of differential gene expression. It was shown that the dynamics of post-admission gene expression data can be accurately modelled using sums of exponentials, suggesting that gene expression levels undergo a transient response to the ACS events before returning to equilibrium. The linear dynamical models capturing the gene regulatory interactions exhibit high predictive performance and can serve as platforms for system-level analysis, numerical simulations and intervention studies

    Genes and Gene Networks Related to Age-associated Learning Impairments

    Get PDF
    The incidence of cognitive impairments, including age-associated spatial learning impairment (ASLI), has risen dramatically in past decades due to increasing human longevity. To better understand the genes and gene networks involved in ASLI, data from a number of past gene expression microarray studies in rats are integrated and used to perform a meta- and network analysis. Results from the data selection and preprocessing steps show that for effective downstream analysis to take place both batch effects and outlier samples must be properly removed. The meta-analysis undertaken in this research has identified significant differentially expressed genes across both age and ASLI in rats. Knowledge based gene network analysis shows that these genes affect many key functions and pathways in aged compared to young rats. The resulting changes might manifest as various neurodegenerative diseases/disorders or syndromic memory impairments at old age. Other changes might result in altered synaptic plasticity, thereby leading to normal, non-syndromic learning impairments such as ASLI. Next, I employ the weighted gene co-expression network analysis (WGCNA) on the datasets. I identify several reproducible network modules each highly significant with genes functioning in specific biological functional categories. It identifies a “learning and memory” specific module containing many potential key ASLI hub genes. Functions of these ASLI hub genes link a different set of mechanisms to learning and memory formation, which meta-analysis was unable to detect. This study generates some new hypotheses related to the new candidate genes and networks in ASLI, which could be investigated through future research

    Chemometrics and statistical analysis in raman spectroscopy-based biological investigations

    Get PDF
    As mentioned in the chapter 1, chemometrics has become an essential tool in Raman spectroscopy-based biological investigations and significantly enhanced the sensitivity of Raman spectroscopy-based detection. However, there are some open issues on applying chemometrics in Raman spectroscopy-based biological investigations. An automatic proce- dure is needed to optimize the parameters of the mathematical baseline correction. Spectral reconstruction algorithm is required to recover a fluorescence-free Raman spectrum from the two Raman spectra measured with different excitation wavelengths for the shifted-excitation Raman difference spectroscopy (SERDS) technique. Guidelines are necessary for reliable model optimization and rigorous model evaluation to ensure high accuracy and robustness in Raman spectroscopy-based biological detection. Computational methods are required to enable a trained model to successfully predict new data that is significantly different from the training data due to inter-replicate variations. These tasks were tackled in this thesis. The related investigations were related to three main topics: baseline correction, statistical modeling, and model transfer.Wie im Kapitel 1 erwähnt, ist die Chemometrie zu einem essentiellen Werkzeug für biolo- gische Untersuchungen mittels der Raman-Spektroskopie geworden und hat die Sensitivität der Raman-spektroskopischen Detektion erheblich verbessert. Es gibt jedoch einige offene Fragen, welche die Anwendung der Chemometrie in Raman-spektroskopischen Untersuchun- gen biologischer Proben betreffen. Zum Beispiel wird eine automatische Prozedur benötigt, um die Parameter einer mathematischen Basislinienkorrektur zu optimieren. Ein SERDS- Rekonstruktionsalgorithmus ist erforderlich, um ein Fluoreszenz-freies Raman-Spektrum aus den zwei Raman-Spektren zu extrahieren, welche bei der Shifted-excitation-Raman-Differenz- Spektroskopie (SERDS) gemessen werden. Des Weiteren sind Richtlinien erforderlich, welche eine zuverlässige Modelloptimierung und eine rigorose Modellevaluation erlauben. Durch diese Richtlinien wird eine hohe Genauigkeit und Robustheit der Raman-spektroskopischen Detektion biologischer Proben gewährleistet. Computergestützte Methoden sind nötig, um mit einem trainierten Modell erfolgreich neue Daten, die sich aufgrund von Inter-Replikat- Variationen signifikant von den Trainingsdaten unterscheiden, vorherzusagen. Diese vier Probleme sind Beispiele für offene Fragen in der Chemometrie und diese vier Probleme wur- den in dieser Arbeit behandelt. Die damit verbundenen Untersuchungen bezogen sich auf drei Hauptthemen: die Basislinienkorrektur, die statistische Modellierung und der Modell- transfer
    corecore