649 research outputs found

    Algoritmos de compressão sem perdas para imagens de microarrays e alinhamento de genomas completos

    Get PDF
    Doutoramento em InformáticaNowadays, in the 21st century, the never-ending expansion of information is a major global concern. The pace at which storage and communication resources are evolving is not fast enough to compensate this tendency. In order to overcome this issue, sophisticated and efficient compression tools are required. The goal of compression is to represent information with as few bits as possible. There are two kinds of compression, lossy and lossless. In lossless compression, information loss is not tolerated so the decoded information is exactly the same as the encoded one. On the other hand, in lossy compression some loss is acceptable. In this work we focused on lossless methods. The goal of this thesis was to create lossless compression tools that can be used in two types of data. The first type is known in the literature as microarray images. These images have 16 bits per pixel and a high spatial resolution. The other data type is commonly called Whole Genome Alignments (WGA), in particularly applied to MAF files. Regarding the microarray images, we improved existing microarray-specific methods by using some pre-processing techniques (segmentation and bitplane reduction). Moreover, we also developed a compression method based on pixel values estimates and a mixture of finite-context models. Furthermore, an approach based on binary-tree decomposition was also considered. Two compression tools were developed to compress MAF files. The first one based on a mixture of finite-context models and arithmetic coding, where only the DNA bases and alignment gaps were considered. The second tool, designated as MAFCO, is a complete compression tool that can handle all the information that can be found in MAF files. MAFCO relies on several finite-context models and allows parallel compression/decompression of MAF files.Hoje em dia, no século XXI, a expansão interminável de informação é uma grande preocupação mundial. O ritmo ao qual os recursos de armazenamento e comunicação estão a evoluir não é suficientemente rápido para compensar esta tendência. De forma a ultrapassar esta situação, são necessárias ferramentas de compressão sofisticadas e eficientes. A compressão consiste em representar informação utilizando a menor quantidade de bits possível. Existem dois tipos de compressão, com e sem perdas. Na compressão sem perdas, a perda de informação não é tolerada, por isso a informação descodificada é exatamente a mesma que a informação que foi codificada. Por outro lado, na compressão com perdas alguma perda é aceitável. Neste trabalho, focámo-nos apenas em métodos de compressão sem perdas. O objetivo desta tese consistiu na criação de ferramentas de compressão sem perdas para dois tipos de dados. O primeiro tipo de dados é conhecido na literatura como imagens de microarrays. Estas imagens têm 16 bits por píxel e uma resolução espacial elevada. O outro tipo de dados é geralmente denominado como alinhamento de genomas completos, particularmente aplicado a ficheiros MAF. Relativamente às imagens de microarrays, melhorámos alguns métodos de compressão específicos utilizando algumas técnicas de pré-processamento (segmentação e redução de planos binários). Além disso, desenvolvemos também um método de compressão baseado em estimação dos valores dos pixéis e em misturas de modelos de contexto-finito. Foi também considerada, uma abordagem baseada em decomposição em árvore binária. Foram desenvolvidas duas ferramentas de compressão para ficheiros MAF. A primeira ferramenta, é baseada numa mistura de modelos de contexto-finito e codificação aritmética, onde apenas as bases de ADN e os símbolos de alinhamento foram considerados. A segunda, designada como MAFCO, é uma ferramenta de compressão completa que consegue lidar com todo o tipo de informação que pode ser encontrada nos ficheiros MAF. MAFCO baseia-se em vários modelos de contexto-finito e permite compressão/descompressão paralela de ficheiros MAF

    A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

    Get PDF
    The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe

    Compressão e análise de dados genómicos

    Get PDF
    Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure of all known living organisms. Since the presentation of the rst genomic sequence, a huge amount of genomics data have been generated, with diversi ed characteristics, rendering the data deluge phenomenon a serious problem in most genomics centers. As such, most of the data are discarded (when possible), while other are compressed using general purpose algorithms, often attaining modest data reduction results. Several speci c algorithms have been proposed for the compression of genomic data, but unfortunately only a few of them have been made available as usable and reliable compression tools. From those, most have been developed to some speci c purpose. In this thesis, we propose a compressor for genomic sequences of multiple natures, able to function in a reference or reference-free mode. Besides, it is very exible and can cope with diverse hardware speci cations. It uses a mixture of nite-context models (FCMs) and eXtended FCMs. The results show improvements over state-of-the-art compressors. Since the compressor can be seen as a unsupervised alignment-free method to estimate algorithmic complexity of genomic sequences, it is the ideal candidate to perform analysis of and between sequences. Accordingly, we de ne a way to approximate directly the Normalized Information Distance, aiming to identify evolutionary similarities in intra- and inter-species. Moreover, we introduce a new concept, the Normalized Relative Compression, that is able to quantify and infer new characteristics of the data, previously undetected by other methods. We also investigate local measures, being able to locate speci c events, using complexity pro les. Furthermore, we present and explore a method based on complexity pro les to detect and visualize genomic rearrangements between sequences, identifying several insights of the genomic evolution of humans. Finally, we introduce the concept of relative uniqueness and apply it to the Ebolavirus, identifying three regions that appear in all the virus sequences outbreak but nowhere in the human genome. In fact, we show that these sequences are su cient to classify di erent sub-species. Also, we identify regions in human chromosomes that are absent from close primates DNA, specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos vivos. Desde a apresentação da primeira sequência, um enorme número de dados genómicos tem sido gerado, com diversas características, originando um sério problema de excesso de dados nos principais centros de genómica. Por esta razão, a maioria dos dados é descartada (quando possível), enquanto outros são comprimidos usando algoritmos genéricos, quase sempre obtendo resultados de compressão modestos. Têm também sido propostos alguns algoritmos de compressão para sequências genómicas, mas infelizmente apenas alguns estão disponíveis como ferramentas eficientes e prontas para utilização. Destes, a maioria tem sido utilizada para propósitos específicos. Nesta tese, propomos um compressor para sequências genómicas de natureza múltipla, capaz de funcionar em modo referencial ou sem referência. Além disso, é bastante flexível e pode lidar com diversas especificações de hardware. O compressor usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos. Os resultados mostram melhorias relativamente a compressores estado-dearte. Uma vez que o compressor pode ser visto como um método não supervisionado, que não utiliza alinhamentos para estimar a complexidade algortímica das sequências genómicas, ele é o candidato ideal para realizar análise de e entre sequências. Em conformidade, definimos uma maneira de aproximar directamente a distância de informação normalizada (NID), visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa normalizada (NRC), que é capaz de quantificar e inferir novas características nos dados, anteriormente indetectados por outros métodos. Investigamos também medidas locais, localizando eventos específicos, usando perfis de complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre sequências, identificando algumas características da evolução genómica humana. Por último, introduzimos um novo conceito de singularidade relativa e aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas as sequências do surto viral, mas ausentes do genoma humano. De facto, mostramos que as três sequências são suficientes para classificar diferentes sub-espécies. Também identificamos regiões nos cromossomas humanos que estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana

    Compression algorithms for biomedical signals and nanopore sequencing data

    Get PDF
    The massive generation of biological digital information creates various computing challenges such as its storage and transmission. For example, biomedical signals, such as electroencephalograms (EEG), are recorded by multiple sensors over long periods of time, resulting in large volumes of data. Another example is genome DNA sequencing data, where the amount of data generated globally is seeing explosive growth, leading to increasing needs for processing, storage, and transmission resources. In this thesis we investigate the use of data compression techniques for this problem, in two different scenarios where computational efficiency is crucial. First we study the compression of multi-channel biomedical signals. We present a new lossless data compressor for multi-channel signals, GSC, which achieves compression performance similar to the state of the art, while being more computationally efficient than other available alternatives. The compressor uses two novel integer-based implementations of the predictive coding and expert advice schemes for multi-channel signals. We also develop a version of GSC optimized for EEG data. This version manages to significantly lower compression times while attaining similar compression performance for that specic type of signal. In a second scenario we study the compression of DNA sequencing data produced by nanopore sequencing technologies. We present two novel lossless compression algorithms specifically tailored to nanopore FASTQ files. ENANO is a reference-free compressor, which mainly focuses on the compression of quality scores. It achieves state of the art compression performance, while being fast and with low memory consumption when compared to other popular FASTQ compression tools. On the other hand, RENANO is a reference-based compressor, which improves on ENANO, by providing a more efficient base call sequence compression component. For RENANO two algorithms are introduced, corresponding to the following scenarios: a reference genome is available without cost to both the compressor and the decompressor; and the reference genome is available only on the compressor side, and a compacted version of the reference is included in the compressed le. Both algorithms of RENANO significantly improve the compression performance of ENANO, with similar compression times, and higher memory requirements.La generación masiva de información digital biológica da lugar a múltiples desafíos informáticos, como su almacenamiento y transmisión. Por ejemplo, las señales biomédicas, como los electroencefalogramas (EEG), son generadas por múltiples sensores registrando medidas en simultaneo durante largos períodos de tiempo, generando grandes volúmenes de datos. Otro ejemplo son los datos de secuenciación de ADN, en donde la cantidad de datos a nivel mundial esta creciendo de forma explosiva, lo que da lugar a una gran necesidad de recursos de procesamiento, almacenamiento y transmisión. En esta tesis investigamos como aplicar técnicas de compresión de datos para atacar este problema, en dos escenarios diferentes donde la eficiencia computacional juega un rol importante. Primero estudiamos la compresión de señales biomédicas multicanal. Comenzamos presentando un nuevo compresor de datos sin perdida para señales multicanal, GSC, que logra obtener niveles de compresión en el estado del arte y que al mismo tiempo es mas eficiente computacionalmente que otras alternativas disponibles. El compresor utiliza dos nuevas implementaciones de los esquemas de codificación predictiva y de asesoramiento de expertos para señales multicanal, basadas en aritmética de enteros. También presentamos una versión de GSC optimizada para datos de EEG. Esta versión logra reducir significativamente los tiempos de compresión, sin deteriorar significativamente los niveles de compresión para datos de EEG. En un segundo escenario estudiamos la compresión de datos de secuenciación de ADN generados por tecnologías de secuenciación por nanoporos. En este sentido, presentamos dos nuevos algoritmos de compresión sin perdida, específicamente diseñados para archivos FASTQ generados por tecnología de nanoporos. ENANO es un compresor libre de referencia, enfocado principalmente en la compresión de los valores de calidad de las bases. ENANO alcanza niveles de compresión en el estado del arte, siendo a la vez mas eficiente computacionalmente que otras herramientas populares de compresión de archivos FASTQ. Por otro lado, RENANO es un compresor basado en la utilización de una referencia, que mejora el rendimiento de ENANO, a partir de un nuevo esquema de compresión de las secuencias de bases. Presentamos dos variantes de RENANO, correspondientes a los siguientes escenarios: (i) se tiene a disposición un genoma de referencia, tanto del lado del compresor como del descompresor, y (ii) se tiene un genoma de referencia disponible solo del lado del compresor, y se incluye una versión compacta de la referencia en el archivo comprimido. Ambas variantes de RENANO mejoran significativamente los niveles compresión de ENANO, alcanzando tiempos de compresión similares y un mayor consumo de memoria

    MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression

    Get PDF
    Background: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1–10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. Results: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. Conclusions: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. Availability: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Sequence complexity for biological sequence analysis

    Get PDF

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
    corecore