4,248 research outputs found

    Capacity of DNA Data Embedding Under Substitution Mutations

    Full text link
    A number of methods have been proposed over the last decade for encoding information using deoxyribonucleic acid (DNA), giving rise to the emerging area of DNA data embedding. Since a DNA sequence is conceptually equivalent to a sequence of quaternary symbols (bases), DNA data embedding (diversely called DNA watermarking or DNA steganography) can be seen as a digital communications problem where channel errors are tantamount to mutations of DNA bases. Depending on the use of coding or noncoding DNA hosts, which, respectively, denote DNA segments that can or cannot be translated into proteins, DNA data embedding is essentially a problem of communications with or without side information at the encoder. In this paper the Shannon capacity of DNA data embedding is obtained for the case in which DNA sequences are subject to substitution mutations modelled using the Kimura model from molecular evolution studies. Inferences are also drawn with respect to the biological implications of some of the results presented.Comment: 22 pages, 13 figures; preliminary versions of this work were presented at the SPIE Media Forensics and Security XII conference (January 2010) and at the IEEE ICASSP conference (March 2010

    Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

    Full text link
    DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

    Extreme genetic fragility of the HIV-1 capsid

    Get PDF
    Genetic robustness, or fragility, is defined as the ability, or lack thereof, of a biological entity to maintain function in the face of mutations. Viruses that replicate via RNA intermediates exhibit high mutation rates, and robustness should be particularly advantageous to them. The capsid (CA) domain of the HIV-1 Gag protein is under strong pressure to conserve functional roles in viral assembly, maturation, uncoating, and nuclear import. However, CA is also under strong immunological pressure to diversify. Therefore, it would be particularly advantageous for CA to evolve genetic robustness. To measure the genetic robustness of HIV-1 CA, we generated a library of single amino acid substitution mutants, encompassing almost half the residues in CA. Strikingly, we found HIV-1 CA to be the most genetically fragile protein that has been analyzed using such an approach, with 70% of mutations yielding replication-defective viruses. Although CA participates in several steps in HIV-1 replication, analysis of conditionally (temperature sensitive) and constitutively non-viable mutants revealed that the biological basis for its genetic fragility was primarily the need to coordinate the accurate and efficient assembly of mature virions. All mutations that exist in naturally occurring HIV-1 subtype B populations at a frequency >3%, and were also present in the mutant library, had fitness levels that were >40% of WT. However, a substantial fraction of mutations with high fitness did not occur in natural populations, suggesting another form of selection pressure limiting variation in vivo. Additionally, known protective CTL epitopes occurred preferentially in domains of the HIV-1 CA that were even more genetically fragile than HIV-1 CA as a whole. The extreme genetic fragility of HIV-1 CA may be one reason why cell-mediated immune responses to Gag correlate with better prognosis in HIV-1 infection, and suggests that CA is a good target for therapy and vaccination strategies

    Noise and Uncertainty in String-Duplication Systems

    Get PDF
    Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into account the presence of other types of mutations. In this work, we consider the capacity of duplication mutations in the presence of point-mutation noise, and so quantify the generation power of these mutations. We show that if the number of point mutations is vanishingly small compared to the number of duplication mutations of a constant length, the generation capacity of these mutations is zero. However, if the number of point mutations increases to a constant fraction of the number of duplications, then the capacity is nonzero. Lower and upper bounds for this capacity are also presented. Another problem that we study is concerned with the mismatch between code design and channel in data storage in the DNA of living organisms with respect to duplication mutations. In this context, we consider the uncertainty of such a mismatched coding scheme measured as the maximum number of input codewords that can lead to the same output

    Data Hiding Based DNA Issues: A Review

    Get PDF
    يعد أمن المعلومات مصدر قلق رئيسي ، لا سيما مع نمو استخدام الإنترنت. بسبب هذا النمو ظهرت حالات اختراق للبيانات المرسلة منها الوصول غير المصرح به التي يتم التصدي له باستخدام تقنيات اتصال آمنة متنوعة  وهي ؛ التشفير وإخفاء البيانات. تتعلق الاتجاهات الحديثة بالحمض النووي المستخدم في التشفير وإخفاء البيانات كحامل للبيانات من خلال استغلال خصائصه الجزيئية الحيوية. تقدم هذه الورقة استبيانًا حول البحوث المنشورة المستندة إلى الحمض النووي لاخفاء البيانات المهمة  كحامي لها  والمنقولة عبر قناة غير آمنة  لمعرفة  نقاط القوة والضعف فيها. لمساعدة البحث المستقبلي في تصميم تقنيات أكثر كفاءة وأمانًا للاخفاء في الحمض نوويSecurity of Information are a key concern, particularly with the extension growth of internet usage. This growth comes the incidents of unauthorized access which are countered by the use of varied secure communication techniques, namely; cryptography and data hiding. More recent trends are concerned with DNA used for cryptography and data hiding as a carrier exploiting its bio-molecular properties. This paper provides a review about published DNA based data hiding techniques using the DNA as a safeguard to critical data that transmitted on an insecure channel, to find out the strength and weaknesses points of them. This will help the future research in designing of more efficient and secure data hiding techniques-based DNA

    Generative Language Models on Nucleotide Sequences of Human Genes

    Full text link
    Language models, primarily transformer-based ones, obtained colossal success in NLP. To be more precise, studies like BERT in NLU and works such as GPT-3 for NLG are very crucial. DNA sequences are very close to natural language in terms of structure, so if the DNA-related bioinformatics domain is concerned, discriminative models, like DNABert, exist. Yet, the generative side of the coin is mainly unexplored to the best of our knowledge. Consequently, we focused on developing an autoregressive generative language model like GPT-3 for DNA sequences. Because working with whole DNA sequences is challenging without substantial computational resources, we decided to carry out our study on a smaller scale, focusing on nucleotide sequences of human genes, unique parts in DNA with specific functionalities, instead of the whole DNA. This decision did not change the problem structure a lot due to the fact that both DNA and genes can be seen as 1D sequences consisting of four different nucleotides without losing much information and making too much simplification. First of all, we systematically examined an almost entirely unexplored problem and observed that RNNs performed the best while simple techniques like N-grams were also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural language. How essential using real-life tasks beyond the classical metrics such as perplexity is observed. Furthermore, checking whether the data-hungry nature of these models can be changed through selecting a language with minimal vocabulary size, four owing to four different types of nucleotides, is examined. The reason for reviewing this was that choosing such a language might make the problem easier. However, what we observed in this study was it did not provide that much of a change in the amount of data needed

    Targeted KRAS Mutation Assessment on Patient Tumor Histologic Material in Real Time Diagnostics

    Get PDF
    BACKGROUND: Testing for tumor specific mutations on routine formalin-fixed paraffin-embedded (FFPE) tissues may predict response to treatment in Medical Oncology and has already entered diagnostics, with KRAS mutation assessment as a paradigm. The highly sensitive real time PCR (Q-PCR) methods developed for this purpose are usually standardized under optimal template conditions. In routine diagnostics, however, suboptimal templates pose the challenge. Herein, we addressed the applicability of sequencing and two Q-PCR methods on prospectively assessed diagnostic cases for KRAS mutations. METHODOLOGY/PRINCIPAL FINDINGS: Tumor FFPE-DNA from 135 diagnostic and 75 low-quality control samples was obtained upon macrodissection, tested for fragmentation and assessed for KRAS mutations with dideoxy-sequencing and with two Q-PCR methods (Taqman-minor-groove-binder [TMGB] probes and DxS-KRAS-IVD). Samples with relatively well preserved DNA could be accurately analyzed with sequencing, while Q-PCR methods yielded informative results even in cases with very fragmented DNA (p<0.0001) with 100% sensitivity and specificity vs each other. However, Q-PCR efficiency (Ct values) also depended on DNA-fragmentation (p<0.0001). Q-PCR methods were sensitive to detect<or=1% mutant cells, provided that samples yielded cycle thresholds (Ct)<29, but this condition was met in only 38.5% of diagnostic samples. In comparison, FFPE samples (>99%) could accurately be analyzed at a sensitivity level of 10% (external validation of TMGB results). DNA quality and tumor cell content were the main reasons for discrepant sequencing/Q-PCR results (1.5%). CONCLUSIONS/SIGNIFICANCE: Diagnostic targeted mutation assessment on FFPE-DNA is very efficient with Q-PCR methods in comparison to dideoxy-sequencing. However, DNA fragmentation/amplification capacity and tumor DNA content must be considered for the interpretation of Q-PCR results in order to provide accurate information for clinical decision making

    Duplication-Correcting Codes for Data Storage in the DNA of Living Organisms

    Get PDF
    The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors arising from various mutations, such as point mutations, indels, and tandem duplication, which need to be corrected to maintain data integrity. In this paper, we provide error-correcting codes for errors caused by tandem duplications, which create a copy of a block of the sequence and insert it in a tandem manner, i.e., next to the original. In particular, we present two families of codes for correcting errors due to tandem-duplications of a fixed length; the first family can correct any number of errors while the second corrects a bounded number of errors. We also study codes for correcting tandem duplications of length up to a given constant k, where we are primarily focused on the cases of k = 2, 3
    corecore