8 research outputs found
Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs
Single-cell RNA-seq datasets are growing in size and complexity, enabling the
study of cellular composition changes in various biological/clinical contexts.
Scalable dimensionality reduction techniques are in need to disentangle
biological variation in them, while accounting for technical and biological
confounders. In this work, we extend a popular approach for probabilistic
non-linear dimensionality reduction, the Gaussian process latent variable
model, to scale to massive single-cell datasets while explicitly accounting for
technical and biological confounders. The key idea is to use an augmented
kernel which preserves the factorisability of the lower bound allowing for fast
stochastic variational inference. We demonstrate its ability to reconstruct
latent signatures of innate immunity recovered in Kumasaka et al. (2021) with
9x lower training time. We further analyze a COVID dataset and demonstrate
across a cohort of 130 individuals, that this framework enables data
integration while capturing interpretable signatures of infection.
Specifically, we explore COVID severity as a latent dimension to refine patient
stratification and capture disease-specific gene expression.Comment: Machine Learning and Computational Biology Symposium (Oral), 202
Bridging the gaps in statistical models of protein alignment
SUMMARY: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
On the reliability and the limits of inference of amino acid sequence alignments
MOTIVATION: Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments. RESULTS: By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the ‘daylight’, ‘twilight’ and ‘midnight’ zones for interpreting residue–residue correspondences from sequence information alone. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Recommended from our members
Human SARS-CoV-2 challenge uncovers local and systemic response dynamics
The COVID-19 pandemic is an ongoing global health threat, yet our understanding of the dynamics of early cellular responses to this disease remains limited1. Here in our SARS-CoV-2 human challenge study, we used single-cell multi-omics profiling of nasopharyngeal swabs and blood to temporally resolve abortive, transient and sustained infections in seronegative individuals challenged with pre-Alpha SARS-CoV-2. Our analyses revealed rapid changes in cell-type proportions and dozens of highly dynamic cellular response states in epithelial and immune cells associated with specific time points and infection status. We observed that the interferon response in blood preceded the nasopharyngeal response. Moreover, nasopharyngeal immune infiltration occurred early in samples from individuals with only transient infection and later in samples from individuals with sustained infection. High expression of HLA-DQA2 before inoculation was associated with preventing sustained infection. Ciliated cells showed multiple immune responses and were most permissive for viral replication, whereas nasopharyngeal T cells and macrophages were infected non-productively. We resolved 54 T cell states, including acutely activated T cells that clonally expanded while carrying convergent SARS-CoV-2 motifs. Our new computational pipeline Cell2TCR identifies activated antigen-responding T cells based on a gene expression signature and clusters these into clonotype groups and motifs. Overall, our detailed time series data can serve as a Rosetta stone for epithelial and immune cell responses and reveals early dynamic responses associated with protection against infection
Recommended from our members
Human SARS-CoV-2 challenge uncovers local and systemic response dynamics.
The COVID-19 pandemic is an ongoing global health threat, yet our understanding of the dynamics of early cellular responses to this disease remains limited1. Here in our SARS-CoV-2 human challenge study, we used single-cell multi-omics profiling of nasopharyngeal swabs and blood to temporally resolve abortive, transient and sustained infections in seronegative individuals challenged with pre-Alpha SARS-CoV-2. Our analyses revealed rapid changes in cell-type proportions and dozens of highly dynamic cellular response states in epithelial and immune cells associated with specific time points and infection status. We observed that the interferon response in blood preceded the nasopharyngeal response. Moreover, nasopharyngeal immune infiltration occurred early in samples from individuals with only transient infection and later in samples from individuals with sustained infection. High expression of HLA-DQA2 before inoculation was associated with preventing sustained infection. Ciliated cells showed multiple immune responses and were most permissive for viral replication, whereas nasopharyngeal T cells and macrophages were infected non-productively. We resolved 54 T cell states, including acutely activated T cells that clonally expanded while carrying convergent SARS-CoV-2 motifs. Our new computational pipeline Cell2TCR identifies activated antigen-responding T cells based on a gene expression signature and clusters these into clonotype groups and motifs. Overall, our detailed time series data can serve as a Rosetta stone for epithelial and immune cell responses and reveals early dynamic responses associated with protection against infection
Stratification of amyotrophic lateral sclerosis patients: a crowdsourcing approach
Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease where substantial heterogeneity in clinical presentation urgently requires a better stratification of patients for the development of drug trials and clinical care. In this study we explored stratification through a crowdsourcing approach, the DREAM Prize4Life ALS Stratification Challenge. Using data from >10,000 patients from ALS clinical trials and 1479 patients from community-based patient registers, more than 30 teams developed new approaches for machine learning and clustering, outperforming the best current predictions of disease outcome. We propose a new method to integrate and analyze patient clusters across methods, showing a clear pattern of consistent and clinically relevant sub-groups of patients that also enabled the reliable classification of new patients. Our analyses reveal novel insights in ALS and describe for the first time the potential of a crowdsourcing to uncover hidden patient sub-populations, and to accelerate disease understanding and therapeutic development