26 research outputs found

    Gridsemble: Selective Ensembling for False Discovery Rates

    Full text link
    In this paper, we introduce Gridsemble, a data-driven selective ensembling algorithm for estimating local false discovery rates (fdr) in large-scale multiple hypothesis testing. Existing methods for estimating fdr often yield different conclusions, yet the unobservable nature of fdr values prevents the use of traditional model selection. There is limited guidance on choosing a method for a given dataset, making this an arbitrary decision in practice. Gridsemble circumvents this challenge by ensembling a subset of methods with weights based on their estimated performances, which are computed on synthetic datasets generated to mimic the observed data while including ground truth. We demonstrate through simulation studies and an experimental application that this method outperforms three popular R software packages with their default parameter values\unicode{x2014}common choices given the current landscape. While our applications are in the context of high throughput transcriptomics, we emphasize that Gridsemble is applicable to any use of large-scale multiple hypothesis testing, an approach that is utilized in many fields. We believe that Gridsemble will be a useful tool for computing reliable estimates of fdr and for improving replicability in the presence of multiple hypotheses by eliminating the need for an arbitrary choice of method. Gridsemble is implemented in an open-source R software package available on GitHub at jennalandy/gridsemblefdr.Comment: 12 pages, 3 figures (+ references and supplement). For open-source R software package, see https://github.com/jennalandy/gridsemblefdr. For all code used in the simulation studies and experimental application, see https://github.com/jennalandy/gridsemble_PAPE

    Predictions Generated from a Simulation Engine for Gene Expression Micro-arrays for use in Research Laboratories

    Get PDF
    In this paper we introduce the technical components, the biology and data science involved in the use of microarray technology in biological and clinical research. We discuss how laborious experimental protocols involved in obtaining this data used in laboratories could benefit from using simulations of the data. We discuss the approach used in the simulation engine from [7]. We use this simulation engine to generate a prediction tool in Power BI, a Microsoft, business intelligence tool for analytics and data visualization [22]. This tool could be used in any laboratory using micro-arrays to improve experimental design by comparing how predicted signal intensity compares to observed signal intensity. Signal intensity in micro-arrays is a proxy for level of gene expression in cells. We suggest further development avenues for the prediction tool

    CMRF: analyzing differential gene regulation in two group perturbation experiments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed.</p> <p>Results</p> <p>Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge.</p> <p>Conclusions</p> <p>The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.</p

    Robust Significance Analysis of Microarrays by Minimum ÎČ

    Get PDF

    Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster

    Get PDF
    Comparison of normalization methods across conditions. Boxplots show the differences in the coefficient of variation across flies in each genotype/sex/environment condition. (PDF 245 kb

    Mining large collections of gene expression data to elucidate transcriptional regulation of biological processes

    Get PDF
    A vast amount of gene expression data is available to biological researchers. As of October 2010, the GEO database has 45,777 chips of publicly available gene expression pro ling data from the Affymetrix (HGU133v2) GeneChip platform, representing 2.5 billion numerical measurements. Given this wealth of data, `meta-analysis' methods allowing inferences to be made from combinations of samples from different experiments are critically important. This thesis explores the application of localized pattern-mining approaches, as exemplified by biclustering, for large-scale gene expression analysis. Biclustering methods are particularly attractive for the analysis of large compendia of gene expression data as they allow the extraction of relationships that occur only across subsets of genes and samples. Standard correlation methods, however, assume a single correlation relationship between two genes occurs across all samples in the data. There are a number of existing biclustering methods, but as these did not prove suitable for large scale analysis, a novel method named `IslandCluster' was developed. This method provided a framework for investigating the results of different approaches to biclustering meta-analysis. The biclustering methods used in this work involve preprocessing of gene expression data into a unified scale in order to assess the significance of expression patterns. A novel discretisation approach is shown to identify distinct classes of genes' expression values more appropriately than approaches reported in the literature. A Gene Expression State Transformation (`GESTr') introduced as the first reported modelling of the biological state of expression on a unified scale and is shown to facilitate effective meta-analysis. Localised co-dependency analysis is introduced, a paradigm for identifying transcriptional relationships from gene expression data. Tools implementing this analysis were developed and used to analyse specificity of transcriptional relationships, to distinguish related subsets within a set of transcription factor (TF) targets and to tease apart combinatorial regulation of a set of targets by multiple TFs. The state of pluripotency, from which a mammalian cell has the potential to differentiate into any cell from any of the three adult germ layers, is maintained by forced expression of Nanog and may be induced from a non-pluripotent state by the expression of Oct4, Sox2, Klf4 and cMyc. Analysis of cMyc regulatory targets shed light on a recent proposition that cMyc induces an `embryonic stem cell like' transcriptional signature outside embryonic stem (ES) cells, revealing a cMyc-responsive subset of the signature and identifying ES cell expressed targets with evidence of broad cMyc-induction. Regulatory targets through which cMyc, Oct4, Sox2 and Nanog may maintain or induce pluripotency were identified, offering insight into transcriptional mechanisms involved in the control of pluripotency and demonstrating the utility of the novel analysis approaches presented in this work

    Etude des mécanismes moléculaires de chimiorésistance du mélanome malin aux vinca-alcaloïdes et aux inhibiteurs de kinases par une approche transcriptomique

    Get PDF
    Malignant melanoma (MM), one of the most intrinsically resistant cancers to anticancer agents and presenting a strong ability to develop acquired resistance, remains a therapeutic challenge. A better understanding of the mechanisms involved in MM chemoresistance should provide therapeutic targets or guide therapeutic choice for improved efficiency. This thesis has focused on the identification of new molecular determinants of MM acquired resistance to (i) vinca alkaloids (VAs, conventional chemotherapy), and to (ii) MAP kinases inhibitors (MAPKi, targeted therapy). In the first study, MM cell lines resistant to VAs (CAL1R-VAs) were established (continuous exposure, 12 months, of CAL.1-wt parental line to the VCR, VDS or VRB: CAL1R-VCR, CAL1R- VDS and CAL1R-VRB respectively). Comparison of expression patterns led to distinguish two groups of cell lines (CAL1R-VCR and CAL1R-VDS; CAL1R-VRB and CAL.1-wt), suggesting a differential resistance of MM to VAs: one the one hand to VCR and VDS, on the other hand to VRB only. The analysis of transcriptome data by a process involving successively three methods - RMA (Robust Multi-array Average), RDAM (Rank Difference Analysis of Microarrays) and MGSA (model-based gene set analysis) – allowed the identification of functions altered during the resistant cell line selection, and therefore potentially involved in resistance mechanisms of these cell lines. In vitro functional analyzes confirmed the involvement of the lysosomes and of the response to endoplasmic reticulum (ER) stress (unfolded protein response, UPR) in the differential resistance of CAL1 cells to VAs. Thus, an under-expression of cathepsins B and L (bioinformatics), and a reduction of the acidic compartment volume (in vitro) were specifically observed in the first cell group (CAL1R-VCR and CAL1R-VDS), suggesting a reduced sensitivity of these lines to the lysosomal pathway of apoptosis. Furthermore, UPR inhibition using tauroursodeoxycholic acid (TUDCA) induced a differential sensitization of all the CAL1 lines to VAs, suggesting the involvement of this pathway in the primary and acquired differential resistance to VAs. Moreover, TUDCA-inhibition of UPR induced sensitization another MM cell line, MDA-MB-435, to VCR and VDS but not to VRB. Thus, a UPR up-regulation could to be a significant mechanism of differential resistance of MM to VAs. This mechanism could involve autophagy, whose flow was significantly increased in the first group of lines. The same transcriptome analysis strategy was applied to study (ii) the molecular mechanisms of MM acquired resistance to MAPKi. MM cell lines resistant to the three major MAPKi were established by continuous exposure of the parental A375-wt line, carrying the activating mutation BRAF V600E, to vemurafenib (VMF, BRAF inhibitor), dabrafenib (DBF, BRAF inhibitor), or trametinib (TMT, MEK inhibitor): A375R-VMF, A375R-DBF and A375R-TMT, respectively. Comparison of transcriptomic profiles showed separate expression profiles, suggesting that the molecular mechanisms responsible for resistance to VMF, DBF or TMT were different. These mechanisms cannot therefore be common to the targeted pathway (MAPK) or to the molecular target (BRAF or MEK). The identification of the altered cellular functions will provide a rationale for mechanistic studies of new determinants of MM resistance to MAPKi.Le mĂ©lanome malin (MM) mĂ©tastatique, un des cancers les plus intrinsĂšquement rĂ©sistants aux agents anti-cancĂ©reux et prĂ©sentant une forte capacitĂ© Ă  dĂ©velopper des rĂ©sistances acquises, constitue un dĂ©fi thĂ©rapeutique. La meilleure comprĂ©hension des mĂ©canismes impliquĂ©s dans cette chimiorĂ©sistance permettrait d'identifier des cibles thĂ©rapeutiques ou de guider le choix du traitement pour une meilleure efficacitĂ©. Les travaux rĂ©alisĂ©s durant cette thĂšse se sont focalisĂ©s sur l'identification de nouveaux dĂ©terminants molĂ©culaires de la rĂ©sistance acquise du MM vis-Ă -vis (i) des vinca-alcaloĂŻdes (VAs, chimiothĂ©rapie classique), (ii) des inhibiteurs de MAP kinases (iMAPK, thĂ©rapie ciblĂ©e). Pour la premiĂšre Ă©tude, un modĂšle de lignĂ©es cellulaires de MM rĂ©sistantes aux VAs (CAL1R-VAs) a Ă©tĂ© Ă©tabli (exposition continue, 12 mois, de la lignĂ©e parentale CAL1-wt Ă  la VCR, la VDS ou la VRB : CAL1R-VCR, CAL1R-VDS et CAL1R-VRB respectivement). La comparaison des profils d'expression a permis de distinguer deux groupes de lignĂ©es cellulaires (CAL1R-VCR et CAL1R-VDS ; CAL1R-VRB et CAL1-wt), suggĂ©rant une rĂ©sistance diffĂ©rentielle du MM aux VAs : d'une part Ă  la VCR et Ă  la VDS, d'autre part Ă  la VRB. L'analyse des donnĂ©es transcriptomiques par une dĂ©marche associant successivement trois mĂ©thodes - RMA (Robust Multi-array Average), RDAM (Rank Difference Analysis of Microarrays) et MGSA (model-based gene set analysis) – a permis d'identifier des fonctions cellulaires altĂ©rĂ©es lors de la sĂ©lection des lignĂ©es CAL1R-VAs, et donc potentiellement Ă  l'origine de la rĂ©sistance de ces lignĂ©es. Des analyses fonctionnelles in vitro ont permis de confirmer l'implication des lysosomes et de la rĂ©ponse au stress du rĂ©ticulum endoplasmique (RE) dans la rĂ©sistance diffĂ©rentielle des cellules CAL1 aux VAs. Ainsi, une sous-expression des cathepsines B et L (bioinformatique) et une rĂ©duction du volume du compartiment acide (in vitro) ont Ă©tĂ© observĂ©es spĂ©cifiquement dans le premier groupe de lignĂ©es (CAL1R-VCR et CAL1R-VDS), suggĂ©rant une sensibilitĂ© rĂ©duite de ces lignĂ©es Ă  la voie lysosomale de l'apoptose. Par ailleurs, l'inhibition de la voie de rĂ©ponse au stress du RE par l'acide tauroursodĂ©soxycholique (TUDCA) a induit une sensibilisation diffĂ©rentielle de l'ensemble des lignĂ©es CAL1 aux VAs, suggĂ©rant l'implication de cette voie dans la rĂ©sistance diffĂ©rentielle primaire et acquise aux VAs. De plus, l'inhibition de la rĂ©ponse au stress du RE a induit une sensibilisation d'une autre lignĂ©e cellulaire de MM, MDA-MB-435, Ă  la VCR et Ă  la VDS mais pas Ă  la VRB. Ainsi, la voie de rĂ©ponse au stress du RE semble impliquĂ©e dans la rĂ©sistance diffĂ©rentielle du MM aux VAs. Ce mĂ©canisme pourrait mettre en jeu l'autophagie, dont le flux Ă©tait significativement augmentĂ© dans le premier groupe de lignĂ©es. La mĂȘme dĂ©marche d'analyse transcriptomique a Ă©tĂ© appliquĂ©e pour l'Ă©tude des mĂ©canismes molĂ©culaires de rĂ©sistance acquise du MM aux iMAPK. Des lignĂ©es cellulaires de MM rĂ©sistantes aux trois iMAPK majeurs ont Ă©tĂ© Ă©tablies par exposition continue de la lignĂ©e parentale A375-wt, portant la mutation activatrice BRAF V600E, au vĂ©murafenib (VMF, inhibiteur de BRAF), dabrafenib (DBF, inhibiteur de BRAF), et trametinib (TMT, inhibiteur de MEK): A375R-VMF, A375R-DBF et A375R-TMT respectivement. La comparaison des profils transcriptomiques n'a pas permis de regrouper les lignĂ©es rĂ©sistantes entre elles, suggĂ©rant que les mĂ©canismes de rĂ©sistance au VMF, au DBF ou au TMT sont diffĂ©rents. Ces mĂ©canismes ne seraient donc communs ni Ă  la voie ciblĂ©e (MAPK), ni Ă  la cible molĂ©culaire (BRAF ou MEK). L'identification des fonctions cellulaires altĂ©rĂ©es procurera un rationnel pour l'Ă©tude mĂ©canistique de nouveaux dĂ©terminants de la rĂ©sistance du MM aux iMAPK

    nformation Theoretical Prediction of Alternative Splicing with Application to Type-2 Diabetes Mellitus.

    No full text
    FĂŒr die biomedizinische Grundlagenforschung ist es von besonderem Interesse, die AktivitĂ€t von Genen in verschiedenen Geweben eines Organismus zu bestimmen. Die GenaktivitĂ€t wird hier bestimmt durch die Menge der direkten Produkte eines Gens, die Transkripte. Die HĂ€ufigkeit der Transkripte wird durch experimentelle Technologien quantifiziert und als Genexpression bezeichnet. Aber ein Gen produziert nicht immer nur ein Transkript, sondern kann mehrere Transkripte herstellen mittels der parallelen Kodierung, dem sogenannten alternativen Spleissen. Solch ein Mechanismus ist notwendig um die grosse Zahl an Proteinen und die verhĂ€ltnismĂ€ssig kleine Anzahl an Genen zu erklĂ€ren: 25 000 Gene im Menschen gegenĂŒber 20 000 im Fadenwurm caenorhabditis elegans. Alternatives Spleissen kontrolliert die Expression von verschiedenen Transkriptvarianten unter verschiedenen Bedingungen. Es ist nicht ĂŒberraschend, dass auch kleine Fehler beim Spleissen pathologische Wirkung entfalten, d.h. Krankheiten auslösen können. Da Organismen wie der des Menschen etwa 25 000 verschiedene Gene besitzen, war es notwendig, fĂŒr die Analyse der globalen Genexpression Hochdurchsatzmethoden zur Datengenerierung zu entwickeln. Mit dem alternativen Spleissen stehen all diesen Genen mehrere Transkripte gegenĂŒber. Erst seit kurzem kann die notwendige Menge an Daten generiert werden durch Technologien wie z.Bsp. Microarrays oder Sequenzierungstechnologie der neuesten Generation. Gleichzeitig mit dem technischen Fortschritt mĂŒssen die Datenanalyseverfahren mithalten, um neuen Forschungsfragen zu entsprechen. Im Laufe dieser Arbeit wird eine Softwarepipeline vorgestellt fĂŒr die Analyse von alternativem Spleissen sowie differentieller Genexpression. Sie wurde entwickelt und implementiert in der Programmiersprache und Statistik-Software R und BioConductor und umfasst die Schritte QualitĂ€tskontrolle, Vorverarbeitung, statistische Auswertung der ExpressionsverĂ€nderungen und Genmengenauswertung. FĂŒr die Erkennung von alternativem Spleissen wird die Informationstheorie in das Gebiet der Genexpression eingefĂŒhrt. Die vorgestellte Lösung besteht aus einer Erweiterung der Shannon-Entropie auf die Erkennung verĂ€nderter TranskripthĂ€ufigkeiten und heisst ARH – Alternatives Spleissen Robuste Vorhersage mittels Entropie. Der Nutzen der entwickelten Methoden und Implementierungen wird aufgezeigt am Beispiel von Daten zum Typ-2 Diabetes Mellitus. Mittels Datenintegration und Metaanalyse von unterschiedlichen Datenquellen werden Markergene bestimmt mit Fokus auf differentielle Expression. Danach wird alternatives Spleissen untersucht mit speziellem Fokus auf die Markergene und funktionelle Genmengen, d.h. Stoffwechselwegen
    corecore