10 research outputs found

    Client side decompression technique provides faster DNA sequence data delivery

    Get PDF
    DNA sequences are generally very long chains of sequentially linked nucleotides. There are four different nucleotides and combinations of these build the nucleotide information of sequence files contained in data sources. When a user searches for any sequence for an organism, a compressed sequence file can be sent from the data source to the user. The compressed file then can be decompressed at the client end resulting in reduced transmission time over the Internet. A compression algorithm that provides a moderately high compression rate with minimal decompression time is proposed in this paper. We also compare a number of different compression techniques for achieving efficient delivery methods from an intelligent genomic search agent over the Interne

    On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models

    Get PDF
    A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character

    GReEn: a tool for efficient compression of genome resequencing data

    Get PDF
    Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz

    Optimal reference sequence selection for genome assembly using minimum description length principle

    Get PDF
    Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome

    Entropy rate estimation and compression of biological sequences

    Get PDF
    Táto diplomová práca popisuje poznatky o biologických sekvenciách, princípy odhadu entropie a možnosti kompresie DNA sekvencií pomocou substitučných metód. Text obsahuje praktickú časť, kde sú využité kompresné algoritmy a praktický odhad entropie.This master thesis describes theoretical knowledge of biological sequences, principles entropy rate estimates and possibilities of compression of DNA sequences using the substitution methods. Thesis includes practical application of the compression algorithm and practical estimation of entropy.

    AI Techniques for COVID-19

    Get PDF
    © 2013 IEEE. Artificial Intelligence (AI) intent is to facilitate human limits. It is getting a standpoint on human administrations, filled by the growing availability of restorative clinical data and quick progression of insightful strategies. Motivated by the need to highlight the need for employing AI in battling the COVID-19 Crisis, this survey summarizes the current state of AI applications in clinical administrations while battling COVID-19. Furthermore, we highlight the application of Big Data while understanding this virus. We also overview various intelligence techniques and methods that can be applied to various types of medical information-based pandemic. We classify the existing AI techniques in clinical data analysis, including neural systems, classical SVM, and edge significant learning. Also, an emphasis has been made on regions that utilize AI-oriented cloud computing in combating various similar viruses to COVID-19. This survey study is an attempt to benefit medical practitioners and medical researchers in overpowering their faced difficulties while handling COVID-19 big data. The investigated techniques put forth advances in medical data analysis with an exactness of up to 90%. We further end up with a detailed discussion about how AI implementation can be a huge advantage in combating various similar viruses

    AI Techniques for COVID-19

    Get PDF
    © 2013 IEEE. Artificial Intelligence (AI) intent is to facilitate human limits. It is getting a standpoint on human administrations, filled by the growing availability of restorative clinical data and quick progression of insightful strategies. Motivated by the need to highlight the need for employing AI in battling the COVID-19 Crisis, this survey summarizes the current state of AI applications in clinical administrations while battling COVID-19. Furthermore, we highlight the application of Big Data while understanding this virus. We also overview various intelligence techniques and methods that can be applied to various types of medical information-based pandemic. We classify the existing AI techniques in clinical data analysis, including neural systems, classical SVM, and edge significant learning. Also, an emphasis has been made on regions that utilize AI-oriented cloud computing in combating various similar viruses to COVID-19. This survey study is an attempt to benefit medical practitioners and medical researchers in overpowering their faced difficulties while handling COVID-19 big data. The investigated techniques put forth advances in medical data analysis with an exactness of up to 90%. We further end up with a detailed discussion about how AI implementation can be a huge advantage in combating various similar viruses

    DNA Compression

    Get PDF
    Import 04/11/2015Komprese DNA sekvencí je považována za obtížný úkol. Její význam nehraje roli pouze pro úsporu diskového prostoru a využití sítě při přenosu souborů s genomy. Přínosem je také rozpoznávání modelů uvnitř biologických sekvencí a určování evoluční vzdálenosti mezi organizmy. Tato diplomová práce začíná náhledem do biologie a základním poznáním struktury DNA sekvencí. Následuje chronologicky seřazený rozbor vybraných kompresních programů. Je popsána jejich strategie, algoritmy a rozdílné způsoby řešení společných dílčích problémů. Na základě získaných znalostí a poznatků z testování je navržena vlastní metoda, která je implementována v programu DNAcod. Dosažené výsledky jsou porovnány s ostatními kompresními nástroji.DNA compression is considered as a challenging task. It is not useful just for saving the disk space and network bandwidth while transferring genome file. The benefit is also in recognition of the patterns in biological sequences and measuring the evolutionary distance between organisms. This thesis starts with insight into Biology and basic knowledge of DNA sequence structure. It is followed by chronologically ordered chapters with analysis of chosen compression programs. Their strategies, algorithms and different types of solution for common partial issues are described. Based on the gained knowledge and experience from testing own compression method has been designed and then implemented in DNAcod program. Achieved outcome numbers are compared with other compression tools results.460 - Katedra informatikyvýborn

    Sinais simbólicos e aplicações em genómica

    Get PDF
    Doutoramento em Engenharia ElectrotécnicaEsta dissertação surge no contexto do processamento de sinais simbólicos com o objectivo específico de contribuir para o conhecimento da estrutura das sequências de DNA. A localização automática de genes foi um dos problemas biológicos que motivou o desenvolvimento deste trabalho. A compressão de sequências genéticas, quer para reduzir o espaço de armazenamento quer para obtenção de modelos das mesmas, foi outra das motivações. Com o objectivo de contribuir para melhorar uma das técnicas frequentemente usadas na localização automática de genes são comparadas metodologias de análise espectral para sequências simbólicas. Também se discute a validade de aplicação de metodologias de análise espectral às sequências simbólicas e apresenta-se um novo método baseada na função de autocorrelação simbólica. Uma característica que usualmente é tomada para identificação de genes é o tamanho da risca espectral que reflecte a periodicidade de período três. Apresenta-se um algoritmo rápido baseado em contadores de símbolos para cálculo de várias riscas espectrais, e em particular da risca de período três. São também enunciadas e analisadas propriedades associadas ao tamanho de algumas riscas e à redundância espectral. Por último, desenvolve-se uma técnica para compressão de sequências genéticas baseada num modelo de três estados. Em regiões codificantes do DNA esta técnica leva em geral a melhores resultados do que as actuais técnicas de compressão.This dissertation addresses the problem of processing sequences of symbols, and has the specific aim of contributing to the analysis and modeling of DNA sequences. This work was partly motivated by the problem of automatic gene location. Another motivation was the compression of genetic sequences, both for the purpose of reducing the required storage and for determining good DNA models. The main methodologies of spectral analysis of symbolic sequences are compared. The application of spectral analysis methods to the symbolic sequences is discussed and a new method based on the symbolic autocorrelation function is presented. One feature that is often used in gene identification is the size of the Fourier coefficient that reflects periodicity of period three. A fast algorithm for the calculation of Fourier coefficients, based on symbol counters, was developed. Some properties associated with the size of some spectral coefficients and spectral redundancy are discussed. Finally, a technique based on a model with three states was developed to compress genetic sequences. In protein-coding regions this technique leads in general to better results than the state-of-the-art DNA compression techniques

    Análise e compressão de sequências genómicas

    Get PDF
    Tese de doutoramento em InformáticaA informação dos códigos genéticos sequenciados é na actualidade, provavelmente, a fonte mais inspiradora para o estudo e avanço das teorias da informação e da codificação. Algoritmos eficientes para a sua compressão antevêm-se essenciais para a optimização do armazenamento e comunicação da informação genómica. A compressão de informação genómica é um caso particular da compressão de informação. A entropia das sequências de ADN é elevada, contudo variável. Ao nível intra-genómico é maior nas regiões codificantes e menor nas regiões não codificantes. Ao nível inter-genómico é maior nos seres procarióticos e menor nos eucarióticos. Na base da redução da entropia estão as regularidades que perfazem as regiões repetitivas do ADN. As regiões repetitivas compõem-se sobretudo de padrões aproximados, que incluem pontualmente mutações, delecções, inserções ou gaps. Os padrões exactos são menos relevantes e geralmente apresentam-se em numerosas repetições adjacentes. A redundância do ADN também tem manifestações estatísticas e probabilísticas. As redundâncias das sequências de ADN são a fonte de recursos de compressão, as grandes repetições indicam-se para a compressão substitucional com recurso a dicionário, enquanto que as evidências estatísticas e probabilísticas permitem modelar e predizer parcialmente a sucessão de símbolos (bases), utilizando compressores estatísticos para capitalizar esse potencial de compressão. Considerando a entropia máxima para o ADN, a sua codificação corresponde a 2 bits por base. Em média, os melhores compressores disponíveis, concebidos para a especificidade do ADN, alcançam os 1,7 bits/base, o que corresponde a uma taxa de compressão de apenas 15%, valor que é demonstrativo da dificuldade inerente. O trabalho realizado corresponde a um framework de análise e compressão de sequências de ADN, cuja aplicação principal corresponde ao DNALight. O DNALight é uma solução híbrida para compressão de informação genómica baseada na cooperação de várias metodologias vocacionadas para absorver ocorrências das diferentes tipologias de redundâncias presentes nas cadeias de nucleótidos. De facto, a compressão não é possível sem análise. É na completa análise que reside a obtenção dos recursos que permitirão reduzir a entropia. Para a análise de sequências de ADN desenvolveram-se algoritmos inovadores para a pesquisa de padrões exactos (GRASPm) e aproximados (SimSearch) que alcançam desempenhos que superam destacadamente o estado da arte. Estes algoritmos intervêm na primeira fase do DNALight que aproveita o potencial dos padrões mais representativos para a compressão substitucional baseada em dicionário de padrões exactos e aproximados. Para maximizar as captações de padrões, a pesquisa é exaustiva e efectuada multi-nível, ou seja, na sequência normal 5’-3’, na complementar natural 3’-5’, e também nas duas restantes complementares artificiais. Na segunda fase do DNALight, que procura fazer o aproveitamento das redundâncias desconsideradas pela captação da primeira fase, são construídos modelos probabilísticos de linguagem compactos com bases nas regiões menos repetitivas que transitam para esta fase, e que constituem o input para esta metodologia complementar. Em concorrência, os modelos geram predições sustentadas nas apreciações probabilísticas de modelos de linguagem globais e locais. As predições acertadas ou aproximadas permitem codificações mais económicas pois criam maior desequilíbrio no modelo probabilístico de codificação, beneficiando o desempenho da codificação aritmética que encerra o processo. O processo de descompressão é similar mas reverso ao descrito para a compressão. Os resultados experimentais colocam o DNALight como novo integrante do estado da arte em compressão de sequências de ADN, superando consistentemente, mas em pequena escala, os seus antecessores.Genetics is nowadays, probably, the most inspiring source for coding theory study and developments. Efficient compression algorithms are essential to optimise genomic data storage and communication. Genomic data compression is a particular case of data compression. The entropy present in DNA sequences is high, however variable. At intra-genomic level, it is higher in coding regions and lower in non-coding regions. At inter-genomic level, it is higher in the prokaryotes and lower in eukaryotes. DNA entropy reduction is achieved by coding more efficiently the repetitive regions of the ADN. Repetitive regions are mainly composed of inexact patterns. Patterns’ errors are caused by biological processes and DNA dynamics including mutations, deletions, insertions or gaps. Exact patterns are less relevant and generally are presented in tandem repetitions. DNA redundancies have also statistical and probabilistic manifestations. The redundancies of DNA sequences are the most proficuous source of compression resources, the larger repetitions are indicated for substitucional compression based on a dictionary, whereas the statistical and probabilistic evidences allow to model and predict the succession of symbols (bases) in the sequence, using statistical compression to capitalize this compression potential. Considering the maximum DNA entropy, its codification cost corresponds to 2 bits per base. On average, the best available compressors, conceived accordingly DNA data specificities, reach 1,7 bits/base, which corresponds to a compression rate of only 15%, and this value is demonstrative of the inherent difficulty. The developed work corresponds to a framework for the analysis and compression of DNA sequences, being DNALight the most representative application. DNALight is a hybrid solution for DNA compression based on the cooperative integration of complementary methodologies to absorb the different redundancies present in DNA sequences. In fact, compression is not possible without analysis. Gathering resources for compression relies mostly in analysis, and the emerged recurrences will allow to reduce the entropy. Innovative algorithms were developed for exact pattern-matching (GRASPm) and approximate and exact pattern discovery (SimSearch) and their performance notoriously surpasses the state of the art. These algorithms play an important role in the first phase of the DNALight to implement substitucional compression based on dictionary of exact and approximated repeats. To maximize pattern recollection, the searching is performed multi-level, i.e., in normal sequence 5' - 3', in natural complementary sequence 3' - 5', and also in the two remaining artificial complementary sequences. In the second phase of DNALight, focused on taking advantage of the missed redundancies in the first phase, probabilistic language models are built based on the less repetitive regions as they constitute the input of this complementary methodology. In competition, the models generate predictions supported in the probabilistic analysis of global and local language models. Accurate or approximated predictions allow compact codifications as they provide a more disproportional probabilistic model for codification, benefiting the arithmetic coding performance that encloses the process. The decompression process is similar, but reverse when compared with compression. The experimental results place DNALight as a new constituent of the state of the art in DNA sequences compression, surpassing consistently, but in small scale, its predecessors.Programa de Desenvolvimento Educativo para Portugal (PRODEP
    corecore