Search CORE

8 research outputs found

Construction of bio-constrained code for DNA data storage

Author: Guan Yong Liang
Gunawan Erry
Noor-A-Rahim Md.
Poh Chueh Loo
Wang Yixin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2019
Field of study

With extremely high density and durable preservation, DNA data storage has become one of the most cutting-edge techniques for long-term data storage. Similar to traditional storage which impose restrictions on the form of encoded data, data stored in DNA storage systems are also subject to two biochemical constraints, i.e., maximum homopolymer run limit and balanced GC content limit. Previous studies used successive process to satisfy these two constraints. As a result, the process suffers low efficiency and high complexity. In this paper, we propose a novel content-balanced run-length limited (C-RLL) code with an efficient code construction method, which generates short DNA sequences that satisfy both constraints at one time. Besides, we develop an encoding method to map binary data into long DNA sequences for DNA data storage, which ensures both local and global stability in terms of satisfying the biochemical constraints. The proposed encoding method has high effective code rate of 1.917 bits per nucleotide and low coding complexity

Irish Universities

Cork Open Research Archive

On DNA Codes Over the Non-Chain Ring $\mathbb{Z}_4+u\mathbb{Z}_4+u^2\mathbb{Z}_4$ with $u^3=1$

Author: Banerjee Adrish
Benerjee Krishna Gopal
Das Shibsankar
Publication venue
Publication date: 25/11/2022
Field of study

In this paper, we present a novel design strategy of DNA codes with length

3n

over the non-chain ring

R=\mathbb{Z}_4+u\mathbb{Z}_4+u^2\mathbb{Z}_4

with

64

elements and

u^3=1

, where

n

denotes the length of a code over

R

. We first study and analyze a distance conserving map defined over the ring

R

into the length-

3

DNA sequences. Then, we derive some conditions on the generator matrix of a linear code over

R

, which leads to a DNA code with reversible, reversible-complement, homopolymer

2

-run-length, and

\frac{w}{3n}

-GC-content constraints for integer

w

(

0\leq w\leq 3n

). Finally, we propose a new construction of DNA codes using Reed-Muller type generator matrices. This allows us to obtain DNA codes with reversible, reversible-complement, homopolymer

2

-run-length, and

\frac{2}{3}

-GC-content constraints.Comment: This paper has been presented in IEEE Information Theory Workshop (ITW) 2022, Mumbai, INDI

arXiv.org e-Print Archive

Thermodynamically Stable DNA Code Design using a Similarity Significance Model

Author: Guan Yong Liang
Gunawan Erry
Noor-A-Rahim Md
Poh Chueh Loo
Wang Yixin
Publication venue
Publication date: 14/05/2020
Field of study

DNA code design aims to generate a set of DNA sequences (codewords) with minimum likelihood of undesired hybridizations among sequences and their reverse-complement (RC) pairs (cross-hybridization). Inspired by the distinct hybridization affinities (or stabilities) of perfect double helix constructed by individual single-stranded DNA (ssDNA) and its RC pair, we propose a novel similarity significance (SS) model to measure the similarity between DNA sequences. Particularly, instead of directly measuring the similarity of two sequences by any metric/approach, the proposed SS works in a way to evaluate how more likely will the undesirable hybridizations occur over the desirable hybridizations in the presence of the two measured sequences and their RC pairs. With this SS model, we construct thermodynamically stable DNA codes subject to several combinatorial constraints using a sorting-based algorithm. The proposed scheme results in DNA codes with larger code sizes and wider free energy gaps (hence better cross-hybridization performance) compared to the existing methods.Comment: To appear in ISIT 202

arXiv.org e-Print Archive

Crossref

Iterative DNA Coding Scheme With GC Balance and Run-Length Constraints Using a Greedy Algorithm

Author: Lee Yongwoo
No Jong-Seon
Park Seong-Joon
Publication venue
Publication date: 30/03/2021
Field of study

In this paper, we propose a novel iterative encoding algorithm for DNA storage to satisfy both the GC balance and run-length constraints using a greedy algorithm. DNA strands with run-length more than three and the GC balance ratio far from 50\% are known to be prone to errors. The proposed encoding algorithm stores data at high information density with high flexibility of run-length at most

m

and GC balance between

0.5\pm\alpha

for arbitrary

m

and

\alpha

. More importantly, we propose a novel mapping method to reduce the average bit error compared to the randomly generated mapping method, using a greedy algorithm. The proposed algorithm is implemented through iterative encoding, consisting of three main steps: randomization, M-ary mapping, and verification. It has an information density of 1.8616 bits/nt in the case of

m=3

, which approaches the theoretical upper bound of 1.98 bits/nt, while satisfying two constraints. Also, the average bit error caused by the one nt error is 2.3455 bits, which is reduced by

20.5\%

, compared to the randomized mapping.Comment: 19 page

arXiv.org e-Print Archive

Achievable Rates of Concatenated Codes in DNA Storage under Substitution Errors

Author: Lenz Andreas
Puchinger Sven
Welter Lorenz
Publication venue
Publication date: 30/04/2020
Field of study

In this paper, we study achievable rates of concatenated coding schemes over a deoxyribonucleic acid (DNA) storage channel. Our channel model incorporates the main features of DNA-based data storage. First, information is stored on many, short DNA strands. Second, the strands are stored in an unordered fashion inside the storage medium and each strand is replicated many times. Third, the data is accessed in an uncontrollable manner, i.e., random strands are drawn from the medium and received, possibly with errors. As one of our results, we show that there is a significant gap between the channel capacity and the achievable rate of a standard concatenated code in which one strand corresponds to an inner block. This is in fact surprising as for other channels, such as

q

-ary symmetric channels, concatenated codes are known to achieve the capacity. We further propose a modified concatenated coding scheme by combining several strands into one inner block, which allows to narrow the gap and achieve rates that are close to the capacity.Comment: Extended version of a paper submitted to International Symposium on Information Theory and Its Applications (ISITA) 202

arXiv.org e-Print Archive

Online Research Database In Technology

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

Author: Hareedy Ahmed
İrimağzı Canberk
Uslan Yusuf
Publication venue
Publication date: 14/11/2023
Field of study

DNA strands serve as a storage medium for

4

-ary data over the alphabet

\{A,T,G,C\}

. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted. Experiments have revealed that DNA sequences with long homopolymers and/or with low

GC

-content are notably more subject to errors upon storage. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage. This paper introduces DNA LOCO (D-LOCO) codes, over the alphabet

\{A,T,G,C\}

with limited runs of identical symbols. These codes come with an encoding-decoding rule we derive, which provides affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balancing over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates at moderate lengths.Comment: 14 pages (double column), 3 figures, submitted to the IEEE Transactions on Molecular, Biological and Multi-scale Communications (TMBMC

arXiv.org e-Print Archive

Hidden Addressing Encoding for DNA Storage

Author: Bin Wang
Lijun Sun
Penghao Wang
Shuqing Si
Ziniu Mu
Publication venue: 'Frontiers Media SA'
Publication date: 01/07/2022
Field of study

DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes

Directory of Open Access Journals

Optimized code design for constrained DNA data storage with asymmetric errors

Author: Deng Li
Guan Yong Liang
Gunawan Erry
Noor-A-Rahim Md.
Poh Chueh Loo
Shi Zhiping
Wang Yixin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

With ultra-high density and preservation longevity, deoxyribonucleic acid (DNA)-based data storage is becoming an emerging storage technology. Limited by the current biochemical techniques, data might be corrupted during the processes of DNA data storage. A hybrid coding architecture consisting of modified variable-length run-length limited (VL-RLL) codes and optimized protograph low-density parity-check (LDPC) codes is proposed in order to suppress error occurrence and correct asymmetric substitution errors. Based on the analyses of the different asymmetric DNA sequencer channel models, a series of the protograph LDPC codes are optimized using a modified extrinsic information transfer algorithm (EXIT). The simulation results show the better error performance of the proposed protograph LDPC codes over the conventional good codes and the codes used in the existing DNA data storage system. In addition, the theoretical analysis shows that the proposed hybrid coding scheme stores ~1.98 bits per nucleotide (bits/nt) with only 1% gap from the upper boundary (2 bits/nt)

Irish Universities

Cork Open Research Archive

DR-NTU (Digital Repository of NTU)

ScholarBank@NUS