574 research outputs found
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
Recommended from our members
A Haystack Heuristic for Autoimmune Disease Biomarker Discovery Using Next-Gen Immune Repertoire Sequencing Data.
Large-scale DNA sequencing of immunological repertoires offers an opportunity for the discovery of novel biomarkers for autoimmune disease. Available bioinformatics techniques however, are not adequately suited for elucidating possible biomarker candidates from within large immunosequencing datasets due to unsatisfactory scalability and sensitivity. Here, we present the Haystack Heuristic, an algorithm customized to computationally extract disease-associated motifs from next-generation-sequenced repertoires by contrasting disease and healthy subjects. This technique employs a local-search graph-theory approach to discover novel motifs in patient data. We apply the Haystack Heuristic to nine million B-cell receptor sequences obtained from nearly 100 individuals in order to elucidate a new motif that is significantly associated with multiple sclerosis. Our results demonstrate the effectiveness of the Haystack Heuristic in computing possible biomarker candidates from high throughput sequencing data and could be generalized to other datasets
Evolutionary Inference via the Poisson Indel Process
We address the problem of the joint statistical inference of phylogenetic
trees and multiple sequence alignments from unaligned molecular sequences. This
problem is generally formulated in terms of string-valued evolutionary
processes along the branches of a phylogenetic tree. The classical evolutionary
process, the TKF91 model, is a continuous-time Markov chain model comprised of
insertion, deletion and substitution events. Unfortunately this model gives
rise to an intractable computational problem---the computation of the marginal
likelihood under the TKF91 model is exponential in the number of taxa. In this
work, we present a new stochastic process, the Poisson Indel Process (PIP), in
which the complexity of this computation is reduced to linear. The new model is
closely related to the TKF91 model, differing only in its treatment of
insertions, but the new model has a global characterization as a Poisson
process on the phylogeny. Standard results for Poisson processes allow key
computations to be decoupled, which yields the favorable computational profile
of inference under the PIP model. We present illustrative experiments in which
Bayesian inference under the PIP model is compared to separate inference of
phylogenies and alignments.Comment: 33 pages, 6 figure
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested
that in a string graph or equivalently a unitig graph, any path spells a valid
assembly. As a string/unitig graph also encodes every valid assembly of reads,
such a graph, provided that it can be constructed correctly, is in fact a
lossless representation of reads. In principle, every analysis based on
whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion
(INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context
of resequencing, we developed a de novo assembler, fermi, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data, we
showed that in comparison to the standard pipeline, our approach yields similar
accuracy for SNP calling and better results for INDEL calling. It has higher
sensitivity than other de novo assembly based methods for variant calling. Our
work suggests that variant calling with de novo assembly be a beneficial
complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we proposed FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding all
super-maximal exact matches and one-pass construction of unitigs from an
FMD-index.
Availability: http://github.com/lh3/fermi
Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
- …