17 research outputs found

    Cell type identification, differential expression analysis and trajectory inference in single-cell transcriptomics

    Get PDF
    Single-cell RNA-sequencing (scRNA-seq) is a cutting-edge technology that enables to quantify the transcriptome, the set of expressed RNA transcripts, of a group of cells at the single-cell level. It represents a significant upgrade from bulk RNA-seq, which measures the combined signal of thousands of cells. Measuring gene expression by bulk RNA-seq is an invaluable tool for biomedical researchers who want to understand how cells alter their gene expression due to an illness, differentiation, ternal stimulus, or other events. Similarly, scRNA-seq has become an essential method for biomedical researchers, and it has brought several new applications previously unavailable with bulk RNA-seq. scRNA-seq has the same applications as bulk RNA-seq. However, the single-cell resolution also enables cell annotation based on gene markers of clusters, that is, cell populations that have been identified based on machine learning to be, on average, dissimilar at the transcriptomic level. Researchers can use the cell clusters to detect cell-type-specific gene expression changes between conditions such as case and control groups. Clustering can sometimes even discover entirely new cell types. Besides the cluster-level representation, the single-cell resolution also enables to model cells as a trajectory, representing how the cells are related at the cell level and what is the dynamic differentiation process that the cells undergo in a tissue. This thesis introduces new computational methods for cell type identification and trajectory inference from scRNA-seq data. A new cell type identification method (ILoReg) was proposed, which enables high-resolution clustering of cells into populations with subtle transcriptomic differences. In addition, two new trajectory inference methods were developed: scShaper, which is an accurate and robust method for inferring linear trajectories; and Totem, which is a user-friendly and flexible method for inferring tree-shaped trajectories. In addition, one of the works benchmarked methods for detecting cell-type-specific differential states from scRNA-seq data with multiple subjects per comparison group, requiring tailored methods to confront false discoveries. KEYWORDS: Single-cell RNA sequencing, transcriptome, cell type identification, trajectory inference, differential expressionYksisoluinen RNA-sekvensointi on huipputeknologia, joka mahdollistaa transkriptomin eli ilmentyneiden RNA-transkriptien laskennallisen määrittämisen joukolle soluja yhden solun tarkkuudella, ja sen kehittäminen oli merkittävä askel eteenpäin perinteisestä bulkki-RNA-sekvensoinnista, joka mittaa tuhansien solujen yhteistä signaalia. Bulkki-RNA-sekvensointi on tärkeä työväline biolääketieteen tutkijoille, jotka haluavat ymmärtää miten solut muuttavat geenien ilmentymistä sairauden, erilaistumisen, ulkoisen ärsykkeen tai muun tapahtuman seurauksena. Yksisoluisesta RNA-sekvensoinnista on vastaavasti kehittynyt tärkeä työväline tutkijoille, ja se on tuonut useita uusia sovelluksia. Yksisoluisella RNA-sekvensoinnilla on samat sovellukset kuin bulkki-RNA-sekvensoinnilla, mutta sen lisäksi se mahdollistaa solujen tunnistamisen geenimarkkerien perusteella. Geenimarkkerit etsitään tilastollisin menetelmin solupopulaatioille, joiden on tunnistettu koneoppimisen menetelmin muodostavan transkriptomitasolla keskenään erilaisia joukkoja eli klustereita. Tutkijat voivat hyödyntää soluklustereita tutkimaan geeniekspressioeroja solutyyppien sisällä esimerkiksi sairaiden ja terveiden välillä, ja joskus klusterointi voi jopa tunnistaa uusia solutyyppejä. Yksisolutason mittaukset mahdollistavat myös solujen mallintamisen trajektorina, joka esittää kuinka solut kehittyvät dynaamisesti toisistaan geenien ilmentymistä vaativien prosessien aikana. Tämä väitöskirja esittelee uusia laskennallisia menetelmiä solutyyppien ja trajektorien tunnistamiseen yksisoluisesta RNA-sekvensointidatasta. Väitöskirja esittelee uuden solutyyppitunnistusmenetelmän (ILoReg), joka mahdollistaa hienovaraisia geeniekspressioeroja sisältävien solutyyppien tunnistamisen. Sen lisäksi väitöskirjassa kehitettiin kaksi uutta trajektorin tunnistusmenetelmää: scShaper, joka on tarkka ja robusti menetelmä lineaaristen trajektorien tunnistamiseen, sekä Totem, joka on käyttäjäystävällinen ja joustava menetelmä puumallisten trajektorien tunnistamiseen. Lopuksi väitöskirjassa vertailtiin menetelmiä solutyyppien sisäisten geeniekspressioerojen tunnistamiseen ryhmien välillä, joissa on useita koehenkilöitä tai muita biologisia replikaatteja, mikä vaatii erityisiä menetelmiä väärien positiivisten löydösten vähentämiseen. ASIASANAT: yksisoluinen RNA-sekvensointi, klusterointi, trajektorin tunnistus, geeniekspressi

    Comparing deep belief networks with support vector machines for classifying gene expression data from complex disorders

    Get PDF
    Genomics data provide great opportunities for translational research and the clinical practice, for example, for predicting disease stages. However, the classification of such data is a challenging task due to their high dimensionality, noise, and heterogeneity. In recent years, deep learning classifiers generated much interest, but due to their complexity, so far, little is known about the utility of this method for genomics. In this paper, we address this problem by studying a computational diagnostics task by classification of breast cancer and inflammatory bowel disease patients based on high-dimensional gene expression data. We provide a comprehensive analysis of the classification performance of deep belief networks (DBNs) in dependence on its multiple model parameters and in comparison with support vector machines (SVMs). Furthermore, we investigate combined classifiers that integrate DBNs with SVMs. Such a classifier utilizes a DBN as representation learner forming the input for a SVM. Overall, our results provide guidelines for the complex usage of DBN for classifying gene expression data from complex diseases

    Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data

    Get PDF
    Single-cell RNA-sequencing (scRNA-seq) enables researchers to quantify transcriptomes of thousands of cells simultaneously and study transcriptomic changes between cells. scRNA-seq datasets increasingly include multisubject, multicondition experiments to investigate cell-type-specific differential states (DS) between conditions. This can be performed by first identifying the cell types in all the subjects and then by performing a DS analysis between the conditions within each cell type. Naive single-cell DS analysis methods that treat cells statistically independent are subject to false positives in the presence of variation between biological replicates, an issue known as the pseudoreplicate bias. While several methods have already been introduced to carry out the statistical testing in multisubject scRNA-seq analysis, comparisons that include all these methods are currently lacking. Here, we performed a comprehensive comparison of 18 methods for the identification of DS changes between conditions from multisubject scRNA-seq data. Our results suggest that the pseudobulk methods performed generally best. Both pseudobulks and mixed models that model the subjects as a random effect were superior compared with the naive single-cell methods that do not model the subjects in any way. While the naive models achieved higher sensitivity than the pseudobulk methods and the mixed models, they were subject to a high number of false positives. In addition, accounting for subjects through latent variable modeling did not improve the performance of the naive methods.</p

    scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

    Get PDF
    MotivationComputational models are needed to infer a representation of the cells, i.e. a trajectory, from single-cell RNA-sequencing data that model cell differentiation during a dynamic process. Although many trajectory inference methods exist, their performance varies greatly depending on the dataset and hence there is a need to establish more accurate, better generalizable methods.ResultsWe introduce scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes. We demonstrate that scShaper is able to infer accurate trajectories for a variety of trigonometric trajectories, including many for which the commonly used principal curves method fails. A comprehensive benchmarking with state-of-the-art methods revealed that scShaper achieved superior accuracy of the cell ordering and, in particular, the differentially expressed genes. Moreover, scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering.Availability and implementationscShaper is available as an R package at https://github.com/elolab/scshaper. The test data are available at https://doi.org/10.5281/zenodo.5734488.</p

    ILoReg: a tool for high-resolution cell population identification from single-cell RNA-seq data

    Get PDF
    Single-cell RNA-seq allows researchers to identify cell populations based on unsupervised clustering of the transcriptome. However, subpopulations can have only subtle transcriptomic differences and the high dimensionality of the data makes their identification challenging.\nWe introduce ILoReg, an R package implementing a new cell population identification method that improves identification of cell populations with subtle differences through a probabilistic feature extraction step that is applied before clustering and visualization. The feature extraction is performed using a novel machine learning algorithm, called iterative clustering projection (ICP), that uses logistic regression and clustering similarity comparison to iteratively cluster data. Remarkably, ICP also manages to integrate feature selection with the clustering through L1-regularization, enabling the identification of genes that are differentially expressed between cell populations. By combining solutions of multiple ICP runs into a single consensus solution, ILoReg creates a representation that enables investigating cell populations with a high resolution. In particular, we show that the visualization of ILoReg allows segregation of immune and pancreatic cell populations in a more pronounced manner compared with current state-of-the-art methods.\nILoReg is available as an R package at https://bioconductor.org/packages/ILoReg.\nSupplementary data are available at Supplementary Information and Supplementary Files 1 and 2.\nMOTIVATION\nRESULTS\nAVAILABILITY\nSUPPLEMENTARY INFORMATIO

    Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data

    Get PDF
    BackgroundDetection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection.ResultHere, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method.ConclusionsOur comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.</div

    How does management affect soil C sequestration and greenhouse gas fluxes in boreal and temperate forests? : A review

    Get PDF
    Acknowledgements This review has been supported by the grant Holistic management practices, modelling and monitoring for European forest soils – HoliSoils (EU Horizon 2020 Grant Agreement No 101000289) and the Academy of Finland Fellow project (330136, B. Adamczyk). In addition to the HoliSoils consortium partners, Dr. Abramoff contributed on this study and her work was supported by the United States Department of Energy, Office of Science, Office of Biological and Environmental Research. Oak Ridge National Laboratory is managed by UT-Battelle, LLC, for the United States Department of Energy under contract DE-AC05-00OR22725.Peer reviewedPublisher PD

    Peripheral blood DNA methylation differences in twin pairs discordant for Alzheimer's disease

    Get PDF
    Background Alzheimer's disease results from a neurodegenerative process that starts well before the diagnosis can be made. New prognostic or diagnostic markers enabling early intervention into the disease process would be highly valuable. Environmental and lifestyle factors largely modulate the disease risk and may influence the pathogenesis through epigenetic mechanisms, such as DNA methylation. As environmental and lifestyle factors may affect multiple tissues of the body, we hypothesized that the disease-associated DNA methylation signatures are detectable in the peripheral blood of discordant twin pairs. Results Comparison of 23 disease discordant Finnish twin pairs with reduced representation bisulfite sequencing revealed peripheral blood DNA methylation differences in 11 genomic regions with at least 15.0% median methylation difference and FDR adjusted p value Conclusions DNA methylation differences can be detected in the peripheral blood of twin pairs discordant for Alzheimer's disease. These DNA methylation signatures may have value as disease markers and provide insights into the molecular mechanisms of pathogenesis. We found no evidence that the DNA methylation marks would be associated with gene expression in blood. Further studies are needed to elucidate the potential importance of the associated genes in neuronal functions and to validate the prognostic or diagnostic value of the individual marks or marker panels.</p

    Peripheral blood DNA methylation differences in twin pairs discordant for Alzheimer's disease

    Get PDF
    Background Alzheimer's disease results from a neurodegenerative process that starts well before the diagnosis can be made. New prognostic or diagnostic markers enabling early intervention into the disease process would be highly valuable. Environmental and lifestyle factors largely modulate the disease risk and may influence the pathogenesis through epigenetic mechanisms, such as DNA methylation. As environmental and lifestyle factors may affect multiple tissues of the body, we hypothesized that the disease-associated DNA methylation signatures are detectable in the peripheral blood of discordant twin pairs. Results Comparison of 23 disease discordant Finnish twin pairs with reduced representation bisulfite sequencing revealed peripheral blood DNA methylation differences in 11 genomic regions with at least 15.0% median methylation difference and FDR adjusted p valuePeer reviewe
    corecore