8 research outputs found
Construction of bio-constrained code for DNA data storage
With extremely high density and durable preservation, DNA data storage has become one of the most cutting-edge techniques for long-term data storage. Similar to traditional storage which impose restrictions on the form of encoded data, data stored in DNA storage systems are also subject to two biochemical constraints, i.e., maximum homopolymer run limit and balanced GC content limit. Previous studies used successive process to satisfy these two constraints. As a result, the process suffers low efficiency and high complexity. In this paper, we propose a novel content-balanced run-length limited (C-RLL) code with an efficient code construction method, which generates short DNA sequences that satisfy both constraints at one time. Besides, we develop an encoding method to map binary data into long DNA sequences for DNA data storage, which ensures both local and global stability in terms of satisfying the biochemical constraints. The proposed encoding method has high effective code rate of 1.917 bits per nucleotide and low coding complexity
On DNA Codes Over the Non-Chain Ring with
In this paper, we present a novel design strategy of DNA codes with length
over the non-chain ring
with elements and , where denotes the length of a code over
. We first study and analyze a distance conserving map defined over the ring
into the length- DNA sequences. Then, we derive some conditions on the
generator matrix of a linear code over , which leads to a DNA code with
reversible, reversible-complement, homopolymer -run-length, and
-GC-content constraints for integer ().
Finally, we propose a new construction of DNA codes using Reed-Muller type
generator matrices. This allows us to obtain DNA codes with reversible,
reversible-complement, homopolymer -run-length, and -GC-content
constraints.Comment: This paper has been presented in IEEE Information Theory Workshop
(ITW) 2022, Mumbai, INDI
Thermodynamically Stable DNA Code Design using a Similarity Significance Model
DNA code design aims to generate a set of DNA sequences (codewords) with
minimum likelihood of undesired hybridizations among sequences and their
reverse-complement (RC) pairs (cross-hybridization). Inspired by the distinct
hybridization affinities (or stabilities) of perfect double helix constructed
by individual single-stranded DNA (ssDNA) and its RC pair, we propose a novel
similarity significance (SS) model to measure the similarity between DNA
sequences. Particularly, instead of directly measuring the similarity of two
sequences by any metric/approach, the proposed SS works in a way to evaluate
how more likely will the undesirable hybridizations occur over the desirable
hybridizations in the presence of the two measured sequences and their RC
pairs. With this SS model, we construct thermodynamically stable DNA codes
subject to several combinatorial constraints using a sorting-based algorithm.
The proposed scheme results in DNA codes with larger code sizes and wider free
energy gaps (hence better cross-hybridization performance) compared to the
existing methods.Comment: To appear in ISIT 202
Iterative DNA Coding Scheme With GC Balance and Run-Length Constraints Using a Greedy Algorithm
In this paper, we propose a novel iterative encoding algorithm for DNA
storage to satisfy both the GC balance and run-length constraints using a
greedy algorithm. DNA strands with run-length more than three and the GC
balance ratio far from 50\% are known to be prone to errors. The proposed
encoding algorithm stores data at high information density with high
flexibility of run-length at most and GC balance between for
arbitrary and . More importantly, we propose a novel mapping method
to reduce the average bit error compared to the randomly generated mapping
method, using a greedy algorithm. The proposed algorithm is implemented through
iterative encoding, consisting of three main steps: randomization, M-ary
mapping, and verification. It has an information density of 1.8616 bits/nt in
the case of , which approaches the theoretical upper bound of 1.98
bits/nt, while satisfying two constraints. Also, the average bit error caused
by the one nt error is 2.3455 bits, which is reduced by , compared to
the randomized mapping.Comment: 19 page
Achievable Rates of Concatenated Codes in DNA Storage under Substitution Errors
In this paper, we study achievable rates of concatenated coding schemes over
a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates
the main features of DNA-based data storage. First, information is stored on
many, short DNA strands. Second, the strands are stored in an unordered fashion
inside the storage medium and each strand is replicated many times. Third, the
data is accessed in an uncontrollable manner, i.e., random strands are drawn
from the medium and received, possibly with errors. As one of our results, we
show that there is a significant gap between the channel capacity and the
achievable rate of a standard concatenated code in which one strand corresponds
to an inner block. This is in fact surprising as for other channels, such as
-ary symmetric channels, concatenated codes are known to achieve the
capacity. We further propose a modified concatenated coding scheme by combining
several strands into one inner block, which allows to narrow the gap and
achieve rates that are close to the capacity.Comment: Extended version of a paper submitted to International Symposium on
Information Theory and Its Applications (ISITA) 202
Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage
DNA strands serve as a storage medium for -ary data over the alphabet
. DNA data storage promises formidable information density,
long-term durability, and ease of replicability. However, information in this
intriguing storage technology might be corrupted. Experiments have revealed
that DNA sequences with long homopolymers and/or with low -content are
notably more subject to errors upon storage.
This paper investigates the utilization of the recently-introduced method for
designing lexicographically-ordered constrained (LOCO) codes in DNA data
storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet
with limited runs of identical symbols. These codes come with an
encoding-decoding rule we derive, which provides affordable encoding-decoding
algorithms. In terms of storage overhead, the proposed encoding-decoding
algorithms outperform those in the existing literature. Our algorithms are
readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows
us to achieve balancing over the entire DNA strand with minimal rate penalty.
Moreover, we propose four schemes to bridge consecutive codewords, three of
which guarantee single substitution error detection per codeword. We examine
the probability of undetecting errors. We also show that D-LOCO codes are
capacity-achieving and that they offer remarkably high rates at moderate
lengths.Comment: 14 pages (double column), 3 figures, submitted to the IEEE
Transactions on Molecular, Biological and Multi-scale Communications (TMBMC
Hidden Addressing Encoding for DNA Storage
DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes
Optimized code design for constrained DNA data storage with asymmetric errors
With ultra-high density and preservation longevity, deoxyribonucleic acid (DNA)-based data storage is becoming an emerging storage technology. Limited by the current biochemical techniques, data might be corrupted during the processes of DNA data storage. A hybrid coding architecture consisting of modified variable-length run-length limited (VL-RLL) codes and optimized protograph low-density parity-check (LDPC) codes is proposed in order to suppress error occurrence and correct asymmetric substitution errors. Based on the analyses of the different asymmetric DNA sequencer channel models, a series of the protograph LDPC codes are optimized using a modified extrinsic information transfer algorithm (EXIT). The simulation results show the better error performance of the proposed protograph LDPC codes over the conventional good codes and the codes used in the existing DNA data storage system. In addition, the theoretical analysis shows that the proposed hybrid coding scheme stores ~1.98 bits per nucleotide (bits/nt) with only 1% gap from the upper boundary (2 bits/nt)