1,845 research outputs found

    Comprehensive evaluation of RNA-seq quantification methods for linearity

    Get PDF
    Figure S3. Concordant analysis between rank of estimated quantifications and rank of measured abundance value at gene level (a) and isoform level (b). The fitted value in the y-axis is estimated from model Dāˆ¼mƗA+nƗB+Īµ. Ranks were normalized by the number of quantifications in each plot. (PDF 5950 kb

    Deconvolving the contributions of cell-type heterogeneity on cortical gene expression

    Get PDF
    Complexity of cell-type composition has created much skepticism surrounding the interpretation of bulk tissue transcriptomic studies. Recent studies have shown that deconvolution algorithms can be applied to computationally estimate cell-type proportions from gene expression data of bulk blood samples, but their performance when applied to brain tissue is unclear. Here, we have generated an immunohistochemistry (IHC) dataset for five major cell-types from brain tissue of 70 individuals, who also have bulk cortical gene expression data. With the IHC data as the benchmark, this resource enables quantitative assessment of deconvolution algorithms for brain tissue. We apply existing deconvolution algorithms to brain tissue by using marker sets derived from human brain single cell and cell-sorted RNA-seq data. We show that these algorithms can indeed produce informative estimates of constituent cell-type proportions. In fact, neuronal subpopulations can also be estimated from bulk brain tissue samples. Further, we show that including the cell-type proportion estimates as confounding factors is important for reducing false associations between Alzheimer\u27s disease phenotypes and gene expression. Lastly, we demonstrate that using more accurate marker sets can substantially improve statistical power in detecting cell-type specific expression quantitative trait loci (eQTLs)

    Clonal structures and cell interactions in cancer

    Get PDF
    Despite sharing an identical genome, cells of higher order multicellular organisms display a large degree of phenotypic diversity. This diversity is maintained by a sophisticated regulatory machinery that integrates information from both intrinsic and extrinsic factors, ultimately coordinating the appropriate gene expression. Sequencing methods such as RNA and DNA sequencing have become indispensable tools in the pursuit to understand gene regulation. In recent years, the integration of single-cell sequencing techniques and CRISPR-based methods has ushered in a new era of genomic exploration, providing unprecedented opportunities to investigate the intricate interplay between genes, cellular processes, and disease progression. These cutting-edge advances have transformed the research landscape, enabling in-depth studies of gene regulation in single cells, and paving the way for future discoveries in both healthy and malignant tissues. While cancer has traditionally been studied as a genetic disease, it is now evident that mutations alone do not determine cancer initiation or progression. This notion is supported by two key observations: first, cancer-driving mutations do not always lead to malignancy; and second, identical mutations can yield different outcomes depending on the cell type in which they occur. Consequently, a deeper understanding of gene regulation and the various ways it is modulated is critical for deciphering the complex relationship between genetic changes and cancer initiation. In this thesis we aimed to develop novel single-cell methodologies applicable to studying biological complex systems. We have developed four techniques: CIM-seq, DNTR-seq, Smart3-ATAC, and ACTIseq, described in papers I-IV, respectively. The methods all capture additional modalities in combination with single-cell RNA-seq data, including spatial information, whole genome sequencing, accessible chromatin, and direct read out of guide RNAs. We applied these methods to investigate biological systems at the single-cell level, offering a more comprehensive understanding of cellular behavior in health and disease. Our approaches have allowed us to characterize stem cell niches and regeneration dynamics in the epithelial layer of the colon, and delve into the effects of gene dosage, quantifying how mutational changes impact transcriptional output. Furthermore, we have explored the complex landscape of gene regulation within pancreatic ductal adenocarcinomas, identifying mechanisms that enable cancer growth and proliferation. This body of work emphasizes the importance of multimodal and integrative approaches for unraveling the complexities of biological systems at a cellular level. The methods we've developed represent a significant step forward, promising to facilitate the discovery of molecular targets for cancer therapeutics

    Decloud: an unsupervised deconvolution tool for building gene expression profiles

    Get PDF
    Deconvolution is the process of decomposing a mixed signal into its originating elements. For my thesis I created a clustering application, named DeCloud, with the intent to replace the unsupervised clustering step in the deconvolution tool, Deblender. Utilizing clustering packages in R such as optCluster, the application was built to allow for a range of new clustering algorithms. In this thesis the scope has been set to test Hierarchical clustering and two variations of PAM. A novel filtering function was created, providing a different approach to handling clusters. The novel approach has been implemented for use with the PAM clustering method, but could be applied to other algorithms as well. We have tested the resulting pipeline on the data sets used to benchmark Deblender and other tools. Comparing the results obtained by Deblender and by DeCloud, shows that DeCloud obtains marked better results on two of the three datasets used for testing. The last dataset is a complicated case, none of the applications are able to effectively cluster and deconvolve. The novel filter function applied to the PAM algorithm has been shown to be the best performer in each of the two successful deconvolution datasets.Master's Thesis in InformaticsINF39

    Mendelian Randomization And Single Cell Deconvolution: Two Problems In Statistics Genetics

    Get PDF
    Finding interpretable targets within the genome for diseases is a primary goal of biomedical research. This thesis focuses on developing statistical models and methods for analysis of high throughput genomic and transcriptomic sequencing data with the goal of finding actionable targets of two types, disease-associated genes and disease-implicated cell types. Traditional genome wide association studies(GWAS) focus on finding the association between genetic variants and diseases. However, GWAS results are often difficult to interpret, and they do not directly lead to an understanding of the true biological mechanism of diseases. Following GWAS findings, we can study the causal effect by Mendelian randomization(MR), which uses segregating genomic loci as instrumental variables to estimate the causal effect of a given exposure to disease outcome. In this thesis, we introduced the concept of ``localizable exposures\u27\u27, which are exposures that can be localized, or mapped, to a specific region in the genome, such as the expression of a single gene or the methylation of a specific loci. With sequencing technology, allele specific reads are observable for localizable exposures, which allow their quantifications in an allele-specific manner. In the first part of this thesis, we present a new model, ASMR, uses allele-specific information for Mendelian randomization. This thesis also develops methods for finding cell types implicated in disease through the joint analysis of bulk and single cell RNA sequencing data. Bulk tissue sequencing is often used to probe genes that have tissue-level expression changes between biological cohorts. However, tissue are usually a mixture of multiple distinct cell types and the tissue-level changes are due to shifts of cell type proportions as well as cell type specific expression changes. Single-cell RNA sequencing (scRNA-seq) allows the investigation of the roles of individual cell types during disease initiation and development. We present MuSiC, a method that utilizes cell-type specific gene expression from single-cell RNA sequencing (RNA-seq) data to characterize cell type compositions from bulk RNA-seq data in complex tissues. When applied to pancreatic islet and whole kidney expression data in human, mouse, and rats, MuSiC outperforms existing methods, especially for tissues with closely related cell types. With MuSiC-estimated cell type proportions, we propose a reverse estimation procedure that can detect cell type specific differential expression, allowing for the elucidation of the roles of genes and cell types, as well as their interactions, on disease phenotypes

    Distance-based methods for the analysis of Next-Generation sequencing data

    Get PDF
    Die Analyse von NGS Daten ist ein zentraler Aspekt der modernen genomischen Forschung. Bei der Extraktion von Daten aus den beiden am hƤufigsten verwendeten Quellorganismen bestehen jedoch vielfƤltige Problemstellungen. Im ersten Kapitel wird ein neuartiger Ansatz vorgestellt welcher einen Abstand zwischen Krebszellinienkulturen auf Grundlage ihrer kleinen genomischen Varianten bestimmt um die Kulturen zu identifizieren. Eine Voll-Exom sequenzierte Kultur wird durch paarweise Vergleiche zu ReferenzdatensƤtzen identifiziert so ein gemessener Abstand geringer ist als dies bei nicht verwandten Kulturen zu erwarten wƤre. Die Wirksamkeit der Methode wurde verifiziert, jedoch verbleiben EinschrƤnkung da nur das Sequenzierformat des Voll-Exoms unterstĆ¼tzt wird. Daher wird im zweiten Kapitel eine publizierte Modifikation des Ansatzes vorgestellt welcher die UnterstĆ¼tzung der weitlƤufig genutzten Bulk RNA sowie der Panel-Sequenzierung ermƶglicht. Die Ausweitung der Technologiebasis fĆ¼hrt jedoch zu einer VerstƤrkung von Stƶreffekten welche zu Verletzungen der mathematischen Konditionen einer Abstandsmetrik fĆ¼hren. Daher werden die entstandenen Verletzungen durch statistische Verfahren zuerst quantifiziert und danach durch dynamische Schwellwertanpassungen erfolgreich kompensiert. Das dritte Kapitel stellt eine neuartige Daten-Aufwertungsmethode (Data-Augmentation) vor welche das Trainieren von maschinellen Lernmodellen in Abwesenheit von neoplastischen Trainingsdaten ermƶglicht. Ein abstraktes AbstandsmaƟ wird zwischen neoplastischen EntitƤten sowie EntitƤten gesundem Ursprungs mittels einer transkriptomischen Dekonvolution hergestellt. Die Ausgabe der Dekonvolution erlaubt dann das effektive Vorhersagen von klinischen Eigenschaften von seltenen jedoch biologisch vielfƤltigen Krebsarten wobei die prƤdiktive Kraft des Verfahrens der des etablierten Goldstandard ebenbĆ¼rtig ist.The analysis of NGS data is a central aspect of modern Molecular Genetics and Oncology. The first scientific contribution is the development of a method which identifies Whole-exome-sequenced CCL via the quantification of a distance between their sets of small genomic variants. A distinguishing aspect of the method is that it was designed for the computer-based identification of NGS-sequenced CCL. An identification of an unknown CCL occurs when its abstract distance to a known CCL is smaller than is expected due to chance. The method performed favorably during benchmarks but only supported the Whole-exome-sequencing technology. The second contribution therefore extended the identification method by additionally supporting the Bulk mRNA-sequencing technology and Panel-sequencing format. However, the technological extension incurred predictive biases which detrimentally affected the quantification of abstract distances. Hence, statistical methods were introduced to quantify and compensate for confounding factors. The method revealed a heterogeneity-robust benchmark performance at the trade-off of a slightly reduced sensitivity compared to the Whole-exome-sequencing method. The third contribution is a method which trains Machine-Learning models for rare and diverse cancer types. Machine-Learning models are subsequently trained on these distances to predict clinically relevant characteristics. The performance of such-trained models was comparable to that of models trained on both the substituted neoplastic data and the gold-standard biomarker Ki-67. No proliferation rate-indicative features were utilized to predict clinical characteristics which is why the method can complement the proliferation rate-oriented pathological assessment of biopsies. The thesis revealed that the quantification of an abstract distance can address sources of erroneous NGS data analysis

    Bioinformatics tools for cancer metabolomics

    Get PDF
    It is well known that significant metabolic change take place as cells are transformed from normal to malignant. This review focuses on the use of different bioinformatics tools in cancer metabolomics studies. The article begins by describing different metabolomics technologies and data generation techniques. Overview of the data pre-processing techniques is provided and multivariate data analysis techniques are discussed and illustrated with case studies, including principal component analysis, clustering techniques, self-organizing maps, partial least squares, and discriminant function analysis. Also included is a discussion of available software packages

    Glia Cell Morphology Analysis Using the Fiji GliaMorph Toolkit

    Get PDF
    Glial cells are the support cells of the nervous system. Glial cells typically have elaborate morphologies that facilitate close contacts with neighboring neurons, synapses, and the vasculature. In the retina, MĆ¼ller glia (MG) are the principal glial cell type that supports neuronal function by providing a myriad of supportive functions via intricate cell morphologies and precise contacts. Thus, complex glial morphology is critical for glial function, but remains challenging to resolve at a sub-cellular level or reproducibly quantify in complex tissues. To address this issue, we developed GliaMorph as a Fiji-based macro toolkit that allows 3D glial cell morphology analysis in the developing and mature retina. As GliaMorph is implemented in a modular fashion, here we present guides to (a) setup of GliaMorph, (b) data understanding in 3D, including z-axis intensity decay and signal-to-noise ratio, (c) pre-processing data to enhance image quality, (d) performing and examining image segmentation, and (e) 3D quantification of MG features, including apicobasal texture analysis. To allow easier application, GliaMorph tools are supported with graphical user interfaces where appropriate, and example data are publicly available to facilitate adoption. Further, GliaMorph can be modified to meet usersā€™ morphological analysis needs for other glial or neuronal shapes. Finally, this article provides users with an in-depth understanding of data requirements and the workflow of GliaMorph. Ā© 2023 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Download and installation of GliaMorph components including example data Basic Protocol 2: Understanding data properties and quality 3Dā€”essential for subsequent analysis and capturing data property issues early Basic Protocol 3: Pre-processing AiryScan microscopy data for analysis Alternate Protocol: Pre-processing confocal microscopy data for analysis Basic Protocol 4: Segmentation of glial cells Basic Protocol 5: 3D quantification of glial cell morpholog

    Computational solutions for addressing heterogeneity in DNA methylation data

    Get PDF
    DNA methylation, a reversible epigenetic modification, has been implicated with various bi- ological processes including gene regulation. Due to the multitude of datasets available, it is a premier candidate for computational tool development, especially for investigating hetero- geneity within and across samples. We differentiate between three levels of heterogeneity in DNA methylation data: between-group, between-sample, and within-sample heterogeneity. Here, we separately address these three levels and present new computational approaches to quantify and systematically investigate heterogeneity. Epigenome-wide association studies relate a DNA methylation aberration to a phenotype and therefore address between-group heterogeneity. To facilitate such studies, which necessar- ily include data processing, exploratory data analysis, and differential analysis of DNA methy- lation, we extended the R-package RnBeads. We implemented novel methods for calculating the epigenetic age of individuals, novel imputation methods, and differential variability analysis. A use-case of the new features is presented using samples from Ewing sarcoma patients. As an important driver of epigenetic differences between phenotypes, we systematically investigated associations between donor genotypes and DNA methylation states in methylation quantitative trait loci (methQTL). To that end, we developed a novel computational framework ā€“MAGARā€“ for determining statistically significant associations between genetic and epigenetic variations. We applied the new pipeline to samples obtained from sorted blood cells and complex bowel tissues of healthy individuals and found that tissue-specific and common methQTLs have dis- tinct genomic locations and biological properties. To investigate cell-type-specific DNA methylation profiles, which are the main drivers of within-group heterogeneity, computational deconvolution methods can be used to dissect DNA methylation patterns into latent methylation components. Deconvolution methods require pro- files of high technical quality and the identified components need to be biologically interpreted. We developed a computational pipeline to perform deconvolution of complex DNA methyla- tion data, which implements crucial data processing steps and facilitates result interpretation. We applied the protocol to lung adenocarcinoma samples and found indications of tumor in- filtration by immune cells and associations of the detected components with patient survival. Within-sample heterogeneity (WSH), i.e., heterogeneous DNA methylation patterns at a ge- nomic locus within a biological sample, is often neglected in epigenomic studies. We present the first systematic benchmark of scores quantifying WSH genome-wide using simulated and experimental data. Additionally, we created two novel scores that quantify DNA methyla- tion heterogeneity at single CpG resolution with improved robustness toward technical biases. WSH scores describe different types of WSH in simulated data, quantify differential hetero- geneity, and serve as a reliable estimator of tumor purity. Due to the broad availability of DNA methylation data, the levels of heterogeneity in DNA methylation data can be comprehensively investigated. We contribute novel computational frameworks for analyzing DNA methylation data with respect to different levels of hetero- geneity. We envision that this toolbox will be indispensible for understanding the functional implications of DNA methylation patterns in health and disease.DNA Methylierung ist eine reversible, epigenetische Modifikation, die mit verschiedenen biologischen Prozessen wie beispielsweise der Genregulation in Verbindung steht. Eine Vielzahl von DNA MethylierungsdatensƤtzen bildet die perfekte Grundlage zur Entwicklung von Softwareanwendungen, insbesondere um HeterogenitƤt innerhalb und zwischen Proben zu beschreiben. Wir unterscheiden drei Ebenen von HeterogenitƤt in DNA Methylierungsdaten: zwischen Gruppen, zwischen Proben und innerhalb einer Probe. Hier betrachten wir die drei Ebenen von HeterogenitƤt in DNA Methylierungsdaten unabhƤngig voneinander und prƤsentieren neue AnsƤtze um die HeterogenitƤt zu beschreiben und zu quantifizieren. Epigenomweite Assoziationsstudien verknĆ¼pfen eine DNA MethylierungsverƤnderung mit einem PhƤnotypen und beschreiben HeterogenitƤt zwischen Gruppen. Um solche Studien, welche Datenprozessierung, sowie exploratorische und differentielle Datenanalyse beinhalten, zu vereinfachen haben wir die R-basierte Softwareanwendung RnBeads erweitert. Die Erweiterungen beinhalten neue Methoden, um das epigenetische Alter vorherzusagen, neue SchƤtzungsmethoden fĆ¼r fehlende Datenpunkte und eine differentielle VariabilitƤtsanalyse. Die Analyse von Ewing-Sarkom Patientendaten wurde als Anwendungsbeispiel fĆ¼r die neu entwickelten Methoden gewƤhlt. Wir untersuchten Assoziationen zwischen Genotypen und DNA Methylierung von einzelnen CpGs, um sogenannte methylation quantitative trait loci (methQTL) zu definieren. Diese stellen einen wichtiger Faktor dar, der epigenetische Unterschiede zwischen Gruppen induziert. Hierzu entwickelten wir ein neues Softwarepaket (MAGAR), um statistisch signifikante Assoziationen zwischen genetischer und epigenetischer Variation zu identifizieren. Wir wendeten diese Pipeline auf Blutzelltypen und komplexe Biopsien von gesunden Individuen an und konnten gemeinsame und gewebespezifische methQTLs in verschiedenen Bereichen des Genoms lokalisieren, die mit unterschiedlichen biologischen Eigenschaften verknĆ¼pft sind. Die Hauptursache fĆ¼r HeterogenitƤt innerhalb einer Gruppe sind zelltypspezifische DNA Methylierungsmuster. Um diese genauer zu untersuchen kann Dekonvolutionssoftware die DNA Methylierungsmatrix in unabhƤngige Variationskomponenten zerlegen. Dekonvolutionsmethoden auf Basis von DNA Methylierung benƶtigen technisch hochwertige Profile und die identifizierten Komponenten mĆ¼ssen biologisch interpretiert werden. In dieser Arbeit entwickelten wir eine computerbasierte Pipeline zur DurchfĆ¼hrung von Dekonvolutionsexperimenten, welche die Datenprozessierung und Interpretation der Resultate beinhaltet. Wir wendeten das entwickelte Protokoll auf Lungenadenokarzinome an und fanden Anzeichen fĆ¼r eine Tumorinfiltration durch Immunzellen, sowie Verbindungen zum Ɯberleben der Patienten. HeterogenitƤt innerhalb einer Probe (within-sample heterogeneity, WSH), d.h. heterogene Methylierungsmuster innerhalb einer Probe an einer genomischen Position, wird in epigenomischen Studien meist vernachlƤssigt. Wir prƤsentieren den ersten Vergleich verschiedener, genomweiter WSH MaƟe auf simulierten und experimentellen Daten. ZusƤtzlich entwickelten wir zwei neue MaƟe um WSH fĆ¼r einzelne CpGs zu berechnen, welche eine verbesserte Robustheit gegenĆ¼ber technischen Faktoren aufweisen. WSH MaƟe beschreiben verschiedene Arten von WSH, quantifizieren differentielle HeterogenitƤt und sagen Tumorreinheit vorher. Aufgrund der breiten VerfĆ¼gbarkeit von DNA Methylierungsdaten kƶnnen die Ebenen der HeterogenitƤt ganzheitlich beschrieben werden. In dieser Arbeit prƤsentieren wir neue Softwarelƶsungen zur Analyse von DNA Methylierungsdaten in Bezug auf die verschiedenen Ebenen der HeterogenitƤt. Wir sind davon Ć¼berzeugt, dass die vorgestellten Softwarewerkzeuge unverzichtbar fĆ¼r das VerstƤndnis von DNA Methylierung im kranken und gesunden Stadium sein werden
    • ā€¦
    corecore