10,868 research outputs found

    Universal Compressed Sensing

    Full text link
    In this paper, the problem of developing universal algorithms for compressed sensing of stochastic processes is studied. First, R\'enyi's notion of information dimension (ID) is generalized to analog stationary processes. This provides a measure of complexity for such processes and is connected to the number of measurements required for their accurate recovery. Then a minimum entropy pursuit (MEP) optimization approach is proposed, and it is proven that it can reliably recover any stationary process satisfying some mixing constraints from sufficient number of randomized linear measurements, without having any prior information about the distribution of the process. It is proved that a Lagrangian-type approximation of the MEP optimization problem, referred to as Lagrangian-MEP problem, is identical to a heuristic implementable algorithm proposed by Baron et al. It is shown that for the right choice of parameters the Lagrangian-MEP algorithm, in addition to having the same asymptotic performance as MEP optimization, is also robust to the measurement noise. For memoryless sources with a discrete-continuous mixture distribution, the fundamental limits of the minimum number of required measurements by a non-universal compressed sensing decoder is characterized by Wu et al. For such sources, it is proved that there is no loss in universal coding, and both the MEP and the Lagrangian-MEP asymptotically achieve the optimal performance

    Mixing Strategies in Data Compression

    Full text link
    We propose geometric weighting as a novel method to combine multiple models in data compression. Our results reveal the rationale behind PAQ-weighting and generalize it to a non-binary alphabet. Based on a similar technique we present a new, generic linear mixture technique. All novel mixture techniques rely on given weight vectors. We consider the problem of finding optimal weights and show that the weight optimization leads to a strictly convex (and thus, good-natured) optimization problem. Finally, an experimental evaluation compares the two presented mixture techniques for a binary alphabet. The results indicate that geometric weighting is superior to linear weighting.Comment: Data Compression Conference (DCC) 201

    Compressão eficiente de sequências biológicas usando uma rede neuronal

    Get PDF
    Background: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of biosequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for biosequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA and amino acids models. For this purpose, we created GeCo3 and AC2, two new biosequence compressors. Both use a neural network for mixing the opinions of multiple specific models. Findings: We benchmark GeCo3 as a reference-free DNA compressor in five datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, two compilations of archaeal and virus genomes, four whole genomes, and two collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively. As a reference-based DNA compressor, we benchmark GeCo3 in four datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in 12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of this compression improvement is some additional computational time (1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool scales efficiently, independently from the sequence size. Overall, these values outperform the state-of-the-art. For AC2 the improvements and costs over AC are similar, which allows the tool to also outperform the state-of-the-art. Conclusions: The GeCo3 and AC2 are biosequence compressors with a neural network mixing approach, that provides additional gains over top specific biocompressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 and AC2 are released under GPLv3 and are available for free download at https://github.com/cobilab/geco3 and https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma maior necessidade de modelos que possam lidar de forma eficiente com a compressão sem perdas de biosequências. Aplicações importantes incluem armazenamento de longo prazo e análise de dados baseada em compressão. Na literatura, apenas alguns artigos recentes propõem o uso de uma rede neuronal para compressão de biosequências. No entanto, os resultados ficam aquém quando comparados com ferramentas de compressão de ADN específicas, como o GeCo2. Essa limitação deve-se à ausência de modelos específicos para sequências de ADN. Neste trabalho, combinamos o poder de uma rede neuronal com modelos específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e o AC2, dois novos compressores de biosequências. Ambos usam uma rede neuronal para combinar as opiniões de vários modelos específicos. Resultados: Comparamos o GeCo3 como um compressor de ADN sem referência em cinco conjuntos de dados, incluindo um conjunto de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma humano, duas compilações de genomas de arqueas e vírus, quatro genomas inteiros e duas coleções de dados FASTQ de um viroma humano e ADN antigo. O GeCo3 atinge uma melhoria sólida na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%, 6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN baseado em referência, comparamos o GeCo3 em quatro conjuntos de dados constituídos pela compressão aos pares dos cromossomas dos genomas de vários primatas. O GeCo3 melhora a compressão em 12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo desta melhoria de compressão é algum tempo computacional adicional (1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a ferramenta escala de forma eficiente, independentemente do tamanho da sequência. De forma geral, os rácios de compressão superam o estado da arte. Para o AC2, as melhorias e custos em relação ao AC são semelhantes, o que permite que a ferramenta também supere o estado da arte. Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas com uma abordagem de mistura baseada numa rede neuronal, que fornece ganhos adicionais em relação aos biocompressores específicos de topo. O método de mistura proposto é portátil, exigindo apenas as probabilidades dos modelos como entradas, proporcionando uma fácil adaptação a outros compressores de dados ou ferramentas de análise baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3 e estão disponíveis para download gratuito em https://github.com/ cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic

    Universal Codes from Switching Strategies

    Get PDF
    We discuss algorithms for combining sequential prediction strategies, a task which can be viewed as a natural generalisation of the concept of universal coding. We describe a graphical language based on Hidden Markov Models for defining prediction strategies, and we provide both existing and new models as examples. The models include efficient, parameterless models for switching between the input strategies over time, including a model for the case where switches tend to occur in clusters, and finally a new model for the scenario where the prediction strategies have a known relationship, and where jumps are typically between strongly related ones. This last model is relevant for coding time series data where parameter drift is expected. As theoretical ontributions we introduce an interpolation construction that is useful in the development and analysis of new algorithms, and we establish a new sophisticated lemma for analysing the individual sequence regret of parameterised models
    corecore