3,531 research outputs found

    Generative Models of Biological Variations in Bulk and Single-cell RNA-seq

    Get PDF
    The explosive growth of next-generation sequencing data enhances our ability to understand biological process at an unprecedented resolution. Meanwhile organizing and utilizing this tremendous amount of data becomes a big challenge. High-throughput technology provides us a snapshot of all underlying biological activities, but this kind of extremely high-dimensional data is hard to interpret. Due to the curse of dimensionality, the measurement is sparse and far from enough to shape the actual manifold in the high-dimensional space. On the other hand, the measurements may contain structured noise such as technical or nuisance biological variation which can interfere downstream interpretation. Generative modeling is a powerful tool to make sense of the data and generate compact representations summarizing the embedded biological information. This thesis introduces three generative models that help amplifying biological signals buried in the noisy bulk and single-cell RNA-seq data. In Chapter 2, we propose a semi-supervised deconvolution framework called PLIER which can identify regulations in cell-type proportions and specific pathways that control gene expression. PLIER has inspired the development of MultiPLIER and has been used to infer context-specific genotype effects in the brain. In Chapter 3, we construct a supervised transformation named DataRemix to normalize bulk gene expression profiles in order to maximize the biological findings with respect to a variety of downstream tasks. By reweighing the contribution of hidden factors, we are able to reveal the hidden biological signals without any external dataset-specific knowledge. We apply DataRemix to the ROSMAP dataset and report the first replicable trans-eQTL effect in human brain. In Chapter 4, we focus on scRNA-seq and introduce NIFA which is an unsupervised decomposition framework that combines the desired properties of PCA, ICA and NMF. It simultaneously models uni- and multi-modal factors isolating discrete cell-type identity and continuous pathway-level variations into separate components. The work presented in Chapter 2 has been published as a journal article. The work in Chapter 3 and Chapter 4 are under submission and they are available as preprints on bioRxiv

    Sparse multi-view matrix factorisation: a multivariate approach to multiple tissue comparisons

    Full text link
    Gene expression levels in a population vary extensively across tissues. Such heterogeneity is caused by genetic variability and environmental factors, and is expected to be linked to disease development. The abundance of experimental data now enables the identification of features of gene expression profiles that are shared across tissues, and those that are tissue-specific. While most current research is concerned with characterising differential expression by comparing mean expression profiles across tissues, it is also believed that a significant difference in a gene expression's variance across tissues may also be associated to molecular mechanisms that are important for tissue development and function. We propose a sparse multi-view matrix factorisation (sMVMF) algorithm to jointly analyse gene expression measurements in multiple tissues, where each tissue provides a different "view" of the underlying organism. The proposed methodology can be interpreted as an extension of principal component analysis in that it provides the means to decompose the total sample variance in each tissue into the sum of two components: one capturing the variance that is shared across tissues, and one isolating the tissue-specific variances. sMVMF has been used to jointly model mRNA expression profiles in three tissues - adipose, skin and LCL - which are available for a large and well-phenotyped twins cohort, TwinsUK. Using sMVMF, we are able to prioritise genes based on whether their variation patterns are specific to each tissue. Furthermore, using DNA methylation profiles available, we provide supporting evidence that adipose-specific gene expression patterns may be driven by epigenetic effects.Comment: in Bioinformatics 201

    A Multi-Omics Analysis of Transcription Control by BRD4

    Get PDF
    RNA polymerase II (Pol II) regulation during early elongation has emerged as a regulatory hub in the gene expression of multicellular organisms. Prior research links the BRD4 protein to this control point, regulating the release of paused Pol II into productive elongation. However, the exact roles and mechanisms by which BRD4 influences this and potentially other post-initiation regulatory processes remain unknown. This study combines rapid BRD4 protein degradation and multi-omics approaches, including nascent elongating transcript sequencing (NET-seq), to uncover BRD4’s direct protein functions. Applying NET-seq in comparative studies required experimental adaptations. First, analyses with spiked-in mouse cells proved essential for reliable normalization. Second, the study identified a disproportional enrichment of a chromatin-associated RNA class as NET-seq’s major limitation. Incorporating an additional enrichment step solved this problem and significantly increased Pol II coverage. The resulting high-sensitivity NET-seq method confirmed BRD4’s proposed role in early elongation by revealing a global defect in Pol II pause release upon BRD4 degradation. Observations from proteomics and chromatin immunoprecipitation followed by sequencing (ChIP-seq) experiments suggest that the failed recruitment of Pol II-associated factors (PAF) causes an assembly defect of a competent elongation complex. Interestingly, the elongation defect also affected transcribed enhancers. Pol II occupancy increased in a region proximal to the enhancer center, strikingly similar to the impaired Pol II pause release at genes. An integrated multi-omics analysis that included genome-wide 3D genome information revealed reduced interactions between these enhancers and other regulatory regions. Another unexpected result was the widespread Pol II readthrough transcription quantified by the developed readthrough index, revealing an apparent transcriptional termination defect. The implementation of long-read nascent RNA-sequencing (nascONT-seq) combined with a 3’-RNA cleavage efficiency test detected impaired 3’-RNA processing. Notably, those 3’-RNA cleavage defects correlated with the observed termination defects. A potential explanation is the BRD4-dependent recruitment of general 3’-RNA processing factors to the 5’-control region. These observations start to establish regulatory links between 5’ and 3’ control that require further validation. Overall, the results indicate a general BRD4-dependent 5’ elongation control point required for 3’-RNA processing and termination

    Mucin and Splice Variant Profiles of Pancreatic Adenocarcinoma Predict Patient Survival and Subtyping

    Get PDF
    PDAC is a pancreatic epithelial malignancy and demonstrates aggressive progression and bleak patient prognosis. Despite decades of research, the evolution of novel diagnostics and intervention modalities for PDAC is stagnant. This dissertation explores the characteristic aberrant and elevated expression of mucins in PDAC. Beginning with the hypothesis that mucins are associated with disease aggressiveness, analysis of PDAC patient survival in TCGA revealed no associations between single mucin expression and patient survival. This led to the underlying issue of PDAC tumor cellularity since this disease demonstrates variability in the proportion of cancer cells within the tumor. Tumor purity assessed with the ABSOLUTE computational algorithm is reported for all patient samples in the TCGA PDAC dataset. Using these purity scores, a mathematical correction of epithelial-specific mucin expression was devised. Again, no significant association between PDAC patient survival and mucin expression was found. Therefore, I investigated combinatorial expression of mucins by Spearman’s nonparametric PCA, which resulted in four groups of mutual expression: Group One= MUC7/12/17, Group Two= MUC1/3/13/19/20, Group Three= MUC6/15/22, and Group Four= MUC2/4/5AC/5B/16/21. These four groups were associated significantly with survival outcomes. To determine the biological implications of vi these four groups, PCA scores for all patients were correlated to whole transcriptomes. Significantly correlated genes were assessed for biological pathway upregulation. The four pathway composites revealed potential pathological signatures unrelated to previous PDAC classifications, representing novel PDAC subtypes. The role of mucin splice variants (SVs) was assessed and correlated to PDAC patient survival. Bioinformatic studies revealed 12 total mucin SVs significantly associated with survival. Better survival was correlated with expression of four MUC1, one MUC13, and one MUC20 SVs. High expression of two MUC4, one MUC15, one MUC16, one MUC21, and one MUC22 SVs were correlated with worse survival. The correlation between MUC4-sv-215 and MUC13-sv-201 SVs and survival were PCR validated in PDAC patient samples. These MUC4Δ6 prognostic findings contributed to in vitro studies and the development of a novel nanoparticle assay that detects MUC4-sv-215 in patient biofluids. The cumulative impact of the results described here may advance the clinical utility of mucins and associated SVs for improved diagnosis of PDAC
    • …
    corecore