37,932 research outputs found
Coding over Sets for DNA Storage
In this paper, we study error-correcting codes for the storage of data in
synthetic deoxyribonucleic acid (DNA). We investigate a storage model where
data is represented by an unordered set of sequences, each of length .
Errors within that model are losses of whole sequences and point errors inside
the sequences, such as substitutions, insertions and deletions. We propose code
constructions which can correct these errors with efficient encoders and
decoders. By deriving upper bounds on the cardinalities of these codes using
sphere packing arguments, we show that many of our codes are close to optimal.Comment: 5 page
Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors
DNA as a data storage medium has several advantages, including far greater
data density compared to electronic media. We propose that schemes for data
storage in the DNA of living organisms may benefit from studying the
reconstruction problem, which is applicable whenever multiple reads of noisy
data are available. This strategy is uniquely suited to the medium, which
inherently replicates stored data in multiple distinct ways, caused by
mutations. We consider noise introduced solely by uniform tandem-duplication,
and utilize the relation to constant-weight integer codes in the Manhattan
metric. By bounding the intersection of the cross-polytope with hyperplanes, we
prove the existence of reconstruction codes with greater capacity than known
error-correcting codes, which we can determine analytically for any set of
parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio
The impact of different DNA extraction kits and laboratories upon the assessment of human gut microbiota composition by 16S rRNA gene sequencing
Peer reviewedPublisher PD
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested
that in a string graph or equivalently a unitig graph, any path spells a valid
assembly. As a string/unitig graph also encodes every valid assembly of reads,
such a graph, provided that it can be constructed correctly, is in fact a
lossless representation of reads. In principle, every analysis based on
whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion
(INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context
of resequencing, we developed a de novo assembler, fermi, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data, we
showed that in comparison to the standard pipeline, our approach yields similar
accuracy for SNP calling and better results for INDEL calling. It has higher
sensitivity than other de novo assembly based methods for variant calling. Our
work suggests that variant calling with de novo assembly be a beneficial
complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we proposed FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding all
super-maximal exact matches and one-pass construction of unitigs from an
FMD-index.
Availability: http://github.com/lh3/fermi
Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
- …