271 research outputs found
Graphical models for zero-inflated single cell gene expression
Bulk gene expression experiments relied on aggregations of thousands of cells
to measure the average expression in an organism. Advances in microfluidic and
droplet sequencing now permit expression profiling in single cells. This study
of cell-to-cell variation reveals that individual cells lack detectable
expression of transcripts that appear abundant on a population level, giving
rise to zero-inflated expression patterns. To infer gene co-regulatory networks
from such data, we propose a multivariate Hurdle model. It is comprised of a
mixture of singular Gaussian distributions. We employ neighborhood selection
with the pseudo-likelihood and a group lasso penalty to select and fit
undirected graphical models that capture conditional independences between
genes. The proposed method is more sensitive than existing approaches in
simulations, even under departures from our Hurdle model. The method is applied
to data for T follicular helper cells, and a high-dimensional profile of mouse
dendritic cells. It infers network structure not revealed by other methods; or
in bulk data sets. An R implementation is available at
https://github.com/amcdavid/HurdleNormal .Comment: Fixed error in software UR
Data Exploration, Quality Control and Testing in Single-Cell qPCR-Based Gene Expression Experiments
Cell populations are never truly homogeneous; individual cells exist in
biochemical states that define functional differences between them. New
technology based on microfluidic arrays combined with multiplexed quantitative
polymerase chain reactions (qPCR) now enables high-throughput single-cell gene
expression measurement, allowing assessment of cellular heterogeneity. However
very little analytic tools have been developed specifically for the statistical
and analytical challenges of single-cell qPCR data. We present a statistical
framework for the exploration, quality control, and analysis of single-cell
gene expression data from microfluidic arrays. We assess accuracy and
within-sample heterogeneity of single-cell expression and develop quality
control criteria to filter unreliable cell measurements. We propose a
statistical model accounting for the fact that genes at the single-cell level
can be on (and for which a continuous expression measure is recorded) or
dichotomously off (and the recorded expression is zero). Based on this model,
we derive a combined likelihood-ratio test for differential expression that
incorporates both the discrete and continuous components. Using an experiment
that examines treatment-specific changes in expression, we show that this
combined test is more powerful than either the continuous or dichotomous
component in isolation, or a t-test on the zero-inflated data. While developed
for measurements from a specific platform (Fluidigm), these tools are
generalizable to other multi-parametric measures over large numbers of events.Comment: 9 pages, 5 figure
Lateral gene transfer and ancient paralogy of operons containing redundant copies of tryptophan-pathway genes in Xylella species and in heterocystous cyanobacteria
BACKGROUND: Tryptophan-pathway genes that exist within an apparent operon-like organization were evaluated as examples of multi-genic genomic regions that contain phylogenetically incongruous genes and coexist with genes outside the operon that are congruous. A seven-gene cluster in Xylella fastidiosa includes genes encoding the two subunits of anthranilate synthase, an aryl-CoA synthetase, and trpR. A second gene block, present in the Anabaena/Nostoc lineage, but not in other cyanobacteria, contains a near-complete tryptophan operon nested within an apparent supraoperon containing other aromatic-pathway genes. RESULTS: The gene block in X. fastidiosa exhibits a sharply delineated low-GC content. This, as well as bias of codon usage and 3:1 dinucleotide analysis, strongly implicates lateral gene transfer (LGT). In contrast, parametric studies and protein tree phylogenies did not support the origination of the Anabaena/Nostoc gene block by LGT. CONCLUSIONS: Judging from the apparent minimal amelioration, the low-GC gene block in X. fastidiosa probably originated by LGT at a relatively recent time. The surprising inability to pinpoint a donor lineage still leaves room for alternative, albeit less likely, explanations other than LGT. On the other hand, the large Anabaena/Nostoc gene block does not seem to have arisen by LGT. We suggest that the contemporary Anabaena/Nostoc array of divergent paralogs represents an ancient ancestral state of paralog divergence, with extensive streamlining by gene loss occurring in the lineage of descent representing other (unicellular) cyanobacteria
Sequential Dirichlet Process Mixtures of Multivariate Skew t-distributions for Model-based Clustering of Flow Cytometry Data
39 pages, 11 figuresInternational audienceFlow cytometry is a high-throughput technology used to quantify multiple surface and intracellular markers at the level of a single cell. This enables to identify cell sub-types, and to determine their relative proportions. Improvements of this technology allow to describe millions of individual cells from a blood sample using multiple markers. This results in high-dimensional datasets, whose manual analysis is highly time-consuming and poorly reproducible. While several methods have been developed to perform automatic recognition of cell populations, most of them treat and analyze each sample independently. However, in practice, individual samples are rarely independent (e.g. longitudinal studies). Here, we propose to use a Bayesian nonparametric approach with Dirichlet process mixture (DPM) of multivariate skew -distributions to perform model based clustering of flow-cytometry data. DPM models directly estimate the number of cell populations from the data, avoiding model selection issues, and skew -distributions provides robustness to outliers and non-elliptical shape of cell populations. To accommodate repeated measurements, we propose a sequential strategy relying on a parametric approximation of the posterior. We illustrate the good performance of our method on simulated data, on an experimental benchmark dataset, and on new longitudinal data from the DALIA-1 trial which evaluates a therapeutic vaccine against HIV. On the benchmark dataset, the sequential strategy outperforms all other methods evaluated, and similarly, leads to improved performance on the DALIA-1 data. We have made the method available for the community in the R package NPflow
Combining Mixture Components for Clustering
International audienceModel-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K; these clusterings can be compared on substantive grounds. We illustrate the method with simulated data and a flow cytometry dataset
- …