10 research outputs found
Client side decompression technique provides faster DNA sequence data delivery
DNA sequences are generally very long chains of sequentially linked nucleotides. There are four different nucleotides and combinations of these build the nucleotide information of sequence files contained in data sources. When a user searches for any sequence for an organism, a compressed sequence file can be sent from the data source to the user. The compressed file then can be decompressed at the client end resulting in reduced transmission time over the Internet. A compression algorithm that provides a moderately high compression rate with minimal decompression time is proposed in this paper. We also compare a number of different compression techniques for achieving efficient delivery methods from an intelligent genomic search agent over the Interne
On the Representability of Complete Genomes by Multiple Competing Finite-Context (Markov) Models
A finite-context (Markov) model of order yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth . Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character
GReEn: a tool for efficient compression of genome resequencing data
Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz
Optimal reference sequence selection for genome assembly using minimum description length principle
Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome
Entropy rate estimation and compression of biological sequences
Táto diplomová práca popisuje poznatky o biologických sekvenciách, princípy odhadu entropie a možnosti kompresie DNA sekvencií pomocou substitučných metód. Text obsahuje praktickú časť, kde sú využité kompresné algoritmy a praktický odhad entropie.This master thesis describes theoretical knowledge of biological sequences, principles entropy rate estimates and possibilities of compression of DNA sequences using the substitution methods. Thesis includes practical application of the compression algorithm and practical estimation of entropy.
AI Techniques for COVID-19
© 2013 IEEE. Artificial Intelligence (AI) intent is to facilitate human limits. It is getting a standpoint on human administrations, filled by the growing availability of restorative clinical data and quick progression of insightful strategies. Motivated by the need to highlight the need for employing AI in battling the COVID-19 Crisis, this survey summarizes the current state of AI applications in clinical administrations while battling COVID-19. Furthermore, we highlight the application of Big Data while understanding this virus. We also overview various intelligence techniques and methods that can be applied to various types of medical information-based pandemic. We classify the existing AI techniques in clinical data analysis, including neural systems, classical SVM, and edge significant learning. Also, an emphasis has been made on regions that utilize AI-oriented cloud computing in combating various similar viruses to COVID-19. This survey study is an attempt to benefit medical practitioners and medical researchers in overpowering their faced difficulties while handling COVID-19 big data. The investigated techniques put forth advances in medical data analysis with an exactness of up to 90%. We further end up with a detailed discussion about how AI implementation can be a huge advantage in combating various similar viruses
AI Techniques for COVID-19
© 2013 IEEE. Artificial Intelligence (AI) intent is to facilitate human limits. It is getting a standpoint on human administrations, filled by the growing availability of restorative clinical data and quick progression of insightful strategies. Motivated by the need to highlight the need for employing AI in battling the COVID-19 Crisis, this survey summarizes the current state of AI applications in clinical administrations while battling COVID-19. Furthermore, we highlight the application of Big Data while understanding this virus. We also overview various intelligence techniques and methods that can be applied to various types of medical information-based pandemic. We classify the existing AI techniques in clinical data analysis, including neural systems, classical SVM, and edge significant learning. Also, an emphasis has been made on regions that utilize AI-oriented cloud computing in combating various similar viruses to COVID-19. This survey study is an attempt to benefit medical practitioners and medical researchers in overpowering their faced difficulties while handling COVID-19 big data. The investigated techniques put forth advances in medical data analysis with an exactness of up to 90%. We further end up with a detailed discussion about how AI implementation can be a huge advantage in combating various similar viruses
DNA Compression
Import 04/11/2015Komprese DNA sekvencí je považována za obtížný úkol. Její význam nehraje roli pouze pro úsporu diskového prostoru a využití sítě při přenosu souborů s genomy. Přínosem je také rozpoznávání modelů uvnitř biologických sekvencí a určování evoluční vzdálenosti mezi organizmy. Tato diplomová práce začíná náhledem do biologie a základním poznáním struktury DNA sekvencí. Následuje chronologicky seřazený rozbor vybraných kompresních programů. Je popsána jejich strategie, algoritmy a rozdílné způsoby řešení společných dílčích problémů. Na základě získaných znalostí a poznatků z testování je navržena vlastní metoda, která je implementována v programu DNAcod. Dosažené výsledky jsou porovnány s ostatními kompresními nástroji.DNA compression is considered as a challenging task. It is not useful just for saving the disk space and network bandwidth while transferring genome file. The benefit is also in recognition of the patterns in biological sequences and measuring the evolutionary distance between organisms. This thesis starts with insight into Biology and basic knowledge of DNA sequence structure. It is followed by chronologically ordered chapters with analysis of chosen compression programs. Their strategies, algorithms and different types of solution for common partial issues are described. Based on the gained knowledge and experience from testing own compression method has been designed and then implemented in DNAcod program. Achieved outcome numbers are compared with other compression tools results.460 - Katedra informatikyvýborn
Sinais simbólicos e aplicações em genómica
Doutoramento em Engenharia ElectrotécnicaEsta dissertação surge no contexto do processamento de sinais simbólicos com o objectivo específico de contribuir para o conhecimento da estrutura das sequências de DNA. A localização automática de genes foi um dos problemas biológicos que motivou o desenvolvimento deste trabalho. A compressão de sequências genéticas, quer para reduzir o espaço de armazenamento quer para obtenção de modelos das mesmas, foi outra das motivações. Com o objectivo de contribuir para melhorar uma das técnicas frequentemente usadas na localização automática de genes são comparadas metodologias de análise espectral para sequências simbólicas. Também se discute a validade de aplicação de metodologias de análise espectral às sequências simbólicas e apresenta-se um novo método baseada na função de autocorrelação simbólica. Uma característica que usualmente é tomada para identificação de genes é o tamanho da risca espectral que reflecte a periodicidade de período três. Apresenta-se um algoritmo rápido baseado em contadores de símbolos para cálculo de várias riscas espectrais, e em particular da risca de período três. São também enunciadas e analisadas propriedades associadas ao tamanho de algumas riscas e à redundância espectral. Por último, desenvolve-se uma técnica para compressão de sequências genéticas baseada num modelo de três estados. Em regiões codificantes do DNA esta técnica leva em geral a melhores resultados do que as actuais técnicas de compressão.This dissertation addresses the problem of processing sequences of symbols, and has the specific aim of contributing to the analysis and modeling of DNA sequences. This work was partly motivated by the problem of automatic gene location. Another motivation was the compression of genetic sequences, both for the purpose of reducing the required storage and for determining good DNA models. The main methodologies of spectral analysis of symbolic sequences are compared. The application of spectral analysis methods to the symbolic sequences is discussed and a new method based on the symbolic autocorrelation function is presented. One feature that is often used in gene identification is the size of the Fourier coefficient that reflects periodicity of period three. A fast algorithm for the calculation of Fourier coefficients, based on symbol counters, was developed. Some properties associated with the size of some spectral coefficients and spectral redundancy are discussed. Finally, a technique based on a model with three states was developed to compress genetic sequences. In protein-coding regions this technique leads in general to better results than the state-of-the-art DNA compression techniques
Análise e compressão de sequências genómicas
Tese de doutoramento em InformáticaA informação dos códigos genéticos sequenciados é na actualidade, provavelmente, a
fonte mais inspiradora para o estudo e avanço das teorias da informação e da
codificação. Algoritmos eficientes para a sua compressão antevêm-se essenciais para a
optimização do armazenamento e comunicação da informação genómica. A compressão
de informação genómica é um caso particular da compressão de informação. A entropia
das sequências de ADN é elevada, contudo variável. Ao nível intra-genómico é maior
nas regiões codificantes e menor nas regiões não codificantes. Ao nível inter-genómico
é maior nos seres procarióticos e menor nos eucarióticos. Na base da redução da
entropia estão as regularidades que perfazem as regiões repetitivas do ADN. As regiões
repetitivas compõem-se sobretudo de padrões aproximados, que incluem pontualmente
mutações, delecções, inserções ou gaps. Os padrões exactos são menos relevantes e
geralmente apresentam-se em numerosas repetições adjacentes. A redundância do ADN
também tem manifestações estatísticas e probabilísticas. As redundâncias das
sequências de ADN são a fonte de recursos de compressão, as grandes repetições
indicam-se para a compressão substitucional com recurso a dicionário, enquanto que as
evidências estatísticas e probabilísticas permitem modelar e predizer parcialmente a
sucessão de símbolos (bases), utilizando compressores estatísticos para capitalizar esse
potencial de compressão. Considerando a entropia máxima para o ADN, a sua
codificação corresponde a 2 bits por base. Em média, os melhores compressores
disponíveis, concebidos para a especificidade do ADN, alcançam os 1,7 bits/base, o que
corresponde a uma taxa de compressão de apenas 15%, valor que é demonstrativo da
dificuldade inerente.
O trabalho realizado corresponde a um framework de análise e compressão de
sequências de ADN, cuja aplicação principal corresponde ao DNALight. O DNALight é
uma solução híbrida para compressão de informação genómica baseada na cooperação
de várias metodologias vocacionadas para absorver ocorrências das diferentes tipologias
de redundâncias presentes nas cadeias de nucleótidos. De facto, a compressão não é
possível sem análise. É na completa análise que reside a obtenção dos recursos que
permitirão reduzir a entropia. Para a análise de sequências de ADN desenvolveram-se
algoritmos inovadores para a pesquisa de padrões exactos (GRASPm) e aproximados (SimSearch) que alcançam desempenhos que superam destacadamente o estado da arte.
Estes algoritmos intervêm na primeira fase do DNALight que aproveita o potencial dos
padrões mais representativos para a compressão substitucional baseada em dicionário de
padrões exactos e aproximados. Para maximizar as captações de padrões, a pesquisa é
exaustiva e efectuada multi-nível, ou seja, na sequência normal 5’-3’, na complementar
natural 3’-5’, e também nas duas restantes complementares artificiais. Na segunda fase
do DNALight, que procura fazer o aproveitamento das redundâncias desconsideradas
pela captação da primeira fase, são construídos modelos probabilísticos de linguagem
compactos com bases nas regiões menos repetitivas que transitam para esta fase, e que
constituem o input para esta metodologia complementar. Em concorrência, os modelos
geram predições sustentadas nas apreciações probabilísticas de modelos de linguagem
globais e locais. As predições acertadas ou aproximadas permitem codificações mais
económicas pois criam maior desequilíbrio no modelo probabilístico de codificação,
beneficiando o desempenho da codificação aritmética que encerra o processo. O
processo de descompressão é similar mas reverso ao descrito para a compressão. Os
resultados experimentais colocam o DNALight como novo integrante do estado da arte
em compressão de sequências de ADN, superando consistentemente, mas em pequena
escala, os seus antecessores.Genetics is nowadays, probably, the most inspiring source for coding theory study and
developments. Efficient compression algorithms are essential to optimise genomic data
storage and communication. Genomic data compression is a particular case of data
compression. The entropy present in DNA sequences is high, however variable. At
intra-genomic level, it is higher in coding regions and lower in non-coding regions. At
inter-genomic level, it is higher in the prokaryotes and lower in eukaryotes. DNA
entropy reduction is achieved by coding more efficiently the repetitive regions of the
ADN. Repetitive regions are mainly composed of inexact patterns. Patterns’ errors are
caused by biological processes and DNA dynamics including mutations, deletions,
insertions or gaps. Exact patterns are less relevant and generally are presented in tandem
repetitions. DNA redundancies have also statistical and probabilistic manifestations.
The redundancies of DNA sequences are the most proficuous source of compression
resources, the larger repetitions are indicated for substitucional compression based on a
dictionary, whereas the statistical and probabilistic evidences allow to model and predict
the succession of symbols (bases) in the sequence, using statistical compression to
capitalize this compression potential. Considering the maximum DNA entropy, its
codification cost corresponds to 2 bits per base. On average, the best available
compressors, conceived accordingly DNA data specificities, reach 1,7 bits/base, which
corresponds to a compression rate of only 15%, and this value is demonstrative of the
inherent difficulty.
The developed work corresponds to a framework for the analysis and compression of
DNA sequences, being DNALight the most representative application. DNALight is a
hybrid solution for DNA compression based on the cooperative integration of
complementary methodologies to absorb the different redundancies present in DNA
sequences. In fact, compression is not possible without analysis. Gathering resources for
compression relies mostly in analysis, and the emerged recurrences will allow to reduce
the entropy. Innovative algorithms were developed for exact pattern-matching
(GRASPm) and approximate and exact pattern discovery (SimSearch) and their
performance notoriously surpasses the state of the art. These algorithms play an
important role in the first phase of the DNALight to implement substitucional compression based on dictionary of exact and approximated repeats. To maximize
pattern recollection, the searching is performed multi-level, i.e., in normal sequence 5' -
3', in natural complementary sequence 3' - 5', and also in the two remaining artificial
complementary sequences. In the second phase of DNALight, focused on taking
advantage of the missed redundancies in the first phase, probabilistic language models
are built based on the less repetitive regions as they constitute the input of this
complementary methodology. In competition, the models generate predictions
supported in the probabilistic analysis of global and local language models. Accurate or
approximated predictions allow compact codifications as they provide a more
disproportional probabilistic model for codification, benefiting the arithmetic coding
performance that encloses the process. The decompression process is similar, but
reverse when compared with compression. The experimental results place DNALight as
a new constituent of the state of the art in DNA sequences compression, surpassing
consistently, but in small scale, its predecessors.Programa de Desenvolvimento Educativo para Portugal (PRODEP