8 research outputs found

    Construction of bio-constrained code for DNA data storage

    Get PDF
    With extremely high density and durable preservation, DNA data storage has become one of the most cutting-edge techniques for long-term data storage. Similar to traditional storage which impose restrictions on the form of encoded data, data stored in DNA storage systems are also subject to two biochemical constraints, i.e., maximum homopolymer run limit and balanced GC content limit. Previous studies used successive process to satisfy these two constraints. As a result, the process suffers low efficiency and high complexity. In this paper, we propose a novel content-balanced run-length limited (C-RLL) code with an efficient code construction method, which generates short DNA sequences that satisfy both constraints at one time. Besides, we develop an encoding method to map binary data into long DNA sequences for DNA data storage, which ensures both local and global stability in terms of satisfying the biochemical constraints. The proposed encoding method has high effective code rate of 1.917 bits per nucleotide and low coding complexity

    On DNA Codes Over the Non-Chain Ring Z4+uZ4+u2Z4\mathbb{Z}_4+u\mathbb{Z}_4+u^2\mathbb{Z}_4 with u3=1u^3=1

    Full text link
    In this paper, we present a novel design strategy of DNA codes with length 3n3n over the non-chain ring R=Z4+uZ4+u2Z4R=\mathbb{Z}_4+u\mathbb{Z}_4+u^2\mathbb{Z}_4 with 6464 elements and u3=1u^3=1, where nn denotes the length of a code over RR. We first study and analyze a distance conserving map defined over the ring RR into the length-33 DNA sequences. Then, we derive some conditions on the generator matrix of a linear code over RR, which leads to a DNA code with reversible, reversible-complement, homopolymer 22-run-length, and w3n\frac{w}{3n}-GC-content constraints for integer ww (0w3n0\leq w\leq 3n). Finally, we propose a new construction of DNA codes using Reed-Muller type generator matrices. This allows us to obtain DNA codes with reversible, reversible-complement, homopolymer 22-run-length, and 23\frac{2}{3}-GC-content constraints.Comment: This paper has been presented in IEEE Information Theory Workshop (ITW) 2022, Mumbai, INDI

    Thermodynamically Stable DNA Code Design using a Similarity Significance Model

    Full text link
    DNA code design aims to generate a set of DNA sequences (codewords) with minimum likelihood of undesired hybridizations among sequences and their reverse-complement (RC) pairs (cross-hybridization). Inspired by the distinct hybridization affinities (or stabilities) of perfect double helix constructed by individual single-stranded DNA (ssDNA) and its RC pair, we propose a novel similarity significance (SS) model to measure the similarity between DNA sequences. Particularly, instead of directly measuring the similarity of two sequences by any metric/approach, the proposed SS works in a way to evaluate how more likely will the undesirable hybridizations occur over the desirable hybridizations in the presence of the two measured sequences and their RC pairs. With this SS model, we construct thermodynamically stable DNA codes subject to several combinatorial constraints using a sorting-based algorithm. The proposed scheme results in DNA codes with larger code sizes and wider free energy gaps (hence better cross-hybridization performance) compared to the existing methods.Comment: To appear in ISIT 202

    Iterative DNA Coding Scheme With GC Balance and Run-Length Constraints Using a Greedy Algorithm

    Full text link
    In this paper, we propose a novel iterative encoding algorithm for DNA storage to satisfy both the GC balance and run-length constraints using a greedy algorithm. DNA strands with run-length more than three and the GC balance ratio far from 50\% are known to be prone to errors. The proposed encoding algorithm stores data at high information density with high flexibility of run-length at most mm and GC balance between 0.5±α0.5\pm\alpha for arbitrary mm and α\alpha. More importantly, we propose a novel mapping method to reduce the average bit error compared to the randomly generated mapping method, using a greedy algorithm. The proposed algorithm is implemented through iterative encoding, consisting of three main steps: randomization, M-ary mapping, and verification. It has an information density of 1.8616 bits/nt in the case of m=3m=3, which approaches the theoretical upper bound of 1.98 bits/nt, while satisfying two constraints. Also, the average bit error caused by the one nt error is 2.3455 bits, which is reduced by 20.5%20.5\%, compared to the randomized mapping.Comment: 19 page

    Achievable Rates of Concatenated Codes in DNA Storage under Substitution Errors

    Full text link
    In this paper, we study achievable rates of concatenated coding schemes over a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates the main features of DNA-based data storage. First, information is stored on many, short DNA strands. Second, the strands are stored in an unordered fashion inside the storage medium and each strand is replicated many times. Third, the data is accessed in an uncontrollable manner, i.e., random strands are drawn from the medium and received, possibly with errors. As one of our results, we show that there is a significant gap between the channel capacity and the achievable rate of a standard concatenated code in which one strand corresponds to an inner block. This is in fact surprising as for other channels, such as qq-ary symmetric channels, concatenated codes are known to achieve the capacity. We further propose a modified concatenated coding scheme by combining several strands into one inner block, which allows to narrow the gap and achieve rates that are close to the capacity.Comment: Extended version of a paper submitted to International Symposium on Information Theory and Its Applications (ISITA) 202

    Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

    Full text link
    DNA strands serve as a storage medium for 44-ary data over the alphabet {A,T,G,C}\{A,T,G,C\}. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low GCGC-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet {A,T,G,C}\{A,T,G,C\} with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.Comment: 14 pages (double column), 3 figures, submitted to the IEEE Transactions on Molecular, Biological and Multi-scale Communications (TMBMC

    Hidden Addressing Encoding for DNA Storage

    Get PDF
    DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes

    Optimized code design for constrained DNA data storage with asymmetric errors

    Get PDF
    With ultra-high density and preservation longevity, deoxyribonucleic acid (DNA)-based data storage is becoming an emerging storage technology. Limited by the current biochemical techniques, data might be corrupted during the processes of DNA data storage. A hybrid coding architecture consisting of modified variable-length run-length limited (VL-RLL) codes and optimized protograph low-density parity-check (LDPC) codes is proposed in order to suppress error occurrence and correct asymmetric substitution errors. Based on the analyses of the different asymmetric DNA sequencer channel models, a series of the protograph LDPC codes are optimized using a modified extrinsic information transfer algorithm (EXIT). The simulation results show the better error performance of the proposed protograph LDPC codes over the conventional good codes and the codes used in the existing DNA data storage system. In addition, the theoretical analysis shows that the proposed hybrid coding scheme stores ~1.98 bits per nucleotide (bits/nt) with only 1% gap from the upper boundary (2 bits/nt)
    corecore