Multicellular organisms require specialized cell types in order to function.
While a widely accepted definition does not exist, cell types are regarded
as groups of cells with similar properties, such as RNA expression, protein
abundance and epigenetic modification.
Single-cell RNA sequencing (scRNAseq) is a recent breakthrough for explor-
ing cell types, providing expression estimates for all genes in thousands of
individual cells. Using data-driven algorithms, such as unsupervised clus-
tering, scRNAseq has discovered new cell types and created large reference
data sets, next to other exploratory achievements. More recently, scRNA-
seq was applied to patient cohorts that include different groups, for example
disease and healthy or disease subtypes. These multi-sample multi-condition
data sets enable statistical inferences between groups, such as differential ex-
pression testing. In contrast to projects exploring unknown tissues or species,
patient cohorts often study known cell types defined by specific marker genes.
Here, I present Pooled Count Poisson Classification (PCPC), a novel cell
type classification approach designed for inference with multi-sample multi-
condition scRNAseq data sets. PCPC implements a statistical model that
allows researchers to distinguish cells according to marker-based cell type
definitions, enabling reproducible and comparable analysis between data sets
and technologies (e.g. scRNAseq and flow cytometry). Specifically, PCPC
pools marker gene counts across related cells to overcome technical noise,
and compares them to a user-defined threshold using the Poisson model.
In this work, I apply PCPC to three different data sets to demonstrate its
utility. The first application shows it is able to annotate all lineages in data
from human cord blood mononuclear cells (CBMCs), with a single marker
gene per cell type.
The second application shows PCPC is able to discriminate fine cell type sub-
sets, using data from a human tumor of mucosa-associated lymphoid tissue
(MALT). Many cell types in the MALT tumor microenvironment, and T cell
subsets in particular, are transcriptionally related, making their classification
difficult. In spite of this challenging complexity, PCPC can even use lowly
expressed marker genes, such as FOXP3 marking CD3E + CD4 + FOXP3 + reg-
ulatory T (T reg ) cells. Furthermore, I find T reg cells isolated from the MALT
tumor can further be subdivided into CCR7 + and ICOS + subsets, indicating
a mixture of naive-like and activated T reg cells. In comparison to unsuper-
vised clustering and the marker-based tool Garnett, classification with PCPC
has more flexibility and fewer misclassifications, respectively. Thus, PCPC
removes obstacles in studying complex tissues with scRNAseq, such as the
microenvironment in human tumors.
Furthermore, I demonstrate a multi-sample multi-condition comparison using
data from a patient cohort of aggressive and indolent lymphoma subtypes.
PCPC is applied to classify CD3E + CD8B + cytotoxic T cells, followed by
differential expression testing between the aggressive and indolent subtypes.
This uncovers significantly lower LGALS1 expression in indolent tumors,
further implicating this gene in tumor aggressiveness and T cell inhibition.
Currently, PCPC requires data generated with unique molecular identifiers
(UMI), as well as substantial manual work. Due to its ability to resolve com-
plex tissues with few marker genes, PCPC may bring clarity to transcrip-
tomic cell type definitions and prove useful for multi-sample multi-condition
comparisons in scRNAseq data