    GReEn: a tool for efficient compression of genome resequencing data

    Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz

    Group Invariant Deep Representations for Image Instance Retrieval

    Most image instance retrieval pipelines are based on comparison of vectors known as global image descriptors between a query image and the database images. Due to their success in large scale image classification, representations extracted from Convolutional Neural Networks (CNN) are quickly gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors for image instance retrieval. While CNN-based descriptors are generally remarked for good retrieval performance at lower bitrates, they nevertheless present a number of drawbacks including the lack of robustness to common object transformations such as rotations compared with their interest point based FV counterparts. In this paper, we propose a method for computing invariant global descriptors from CNNs. Our method implements a recently proposed mathematical theory for invariance in a sensory cortex modeled as a feedforward neural network. The resulting global descriptors can be made invariant to multiple arbitrary transformation groups while retaining good discriminativeness. Based on a thorough empirical evaluation using several publicly available datasets, we show that our method is able to significantly and consistently improve retrieval results every time a new type of invariance is incorporated. We also show that our method which has few parameters is not prone to overfitting: improvements generalize well across datasets with different properties with regard to invariances. Finally, we show that our descriptors are able to compare favourably to other state-of-the-art compact descriptors in similar bitranges, exceeding the highest retrieval results reported in the literature on some datasets. A dedicated dimensionality reduction step --quantization or hashing-- may be able to further improve the competitiveness of the descriptors

    Adaptive Distributed Source Coding Based on Bayesian Inference

    Distributed Source Coding (DSC) is an important topic for both in information theory and communication. DSC utilizes the correlations among the sources to compress data, and it has the advantages of being simple and easy to carry out. In DSC, Slepian-Wolf (S-W) and Wyner-Ziv (W-Z) are two important problems, which can be classified as lossless compression and loss compression, respectively. Although the lower bounds of the S-W and W-Z problems have been known to researchers for many decades, the code design to achieve the lower bounds is still an open problem. This dissertation focuses on three DSC problems: the adaptive Slepian-Wolf decoding for two binary sources (ASWDTBS) problem, the compression of correlated temperature data of sensor network (CCTDSN) problem and the streamlined genome sequence compression using distributed source coding (SGSCUDSC) problem. For the CCTDSN and SGSCUDSC problems, sources will be converted into the binary expression as the sources in ASWDTBS problem for encoding. The Bayesian inference will be applied to all of these three problems. To efficiently solve these Bayesian inferences, message passing algorithm will be applied. For a discrete variable that takes a small number of values, the belief propagation (BP) algorithm is able to implement the message passing algorithm efficiently. However, the complexity of the BP algorithm increases exponentially with the number of values of the variable. Therefore, the BP algorithm can only deal with discrete variable that takes a small number of values and limited continuous variables. For the more complex variables, deterministic approximation methods are used. These methods, such as the variational Bayes (VB) method and expectation propagation (EP) method, can efficiently incorporated into the message passing algorithm. A virtual binary asymmetric channel (BAC) channel was introduced to model the correlation between the source data and the side information (SI) in ASWDTBS problem, in which two parameters are required to be learned. The two parameters correspond to the crossover probabilities that are 0->1 and 1->0. Based on this model, a factor graph was established that includes LDPC code, source data, SI and both of the crossover probabilities. Since the crossover probabilities are continuous variables, the deterministic approximate inference methods will be incorporated into the message passing algorithm. The proposed algorithm was applied to the synthetic data, and the results showed that the VB-based algorithm achieved much better performance than the performances of the EP-based algorithm and the standard BP algorithm. The poor performance of the EP-based algorithm was also analyzed. For the CCTDSN problem, the temperature data were collected by crossbow sensors. Four sensors were established in different locations of the laboratory and their readings were sent to the common destination. The data from one sensor were used as the SI, and the data from the other 3 sensors were compressed. The decoding algorithm considers both spatial and temporal correlations, which are in the form of Kalman filter in the factor graph. To deal with the mixtures of the discrete messages and the continuous messages (Gaussians) in the Kalman filter region of the factor graph, the EP algorithm was implemented so that all of the messages were approximated by the Gaussian distribution. The testing results on the wireless network have indicated that the proposed algorithm outperforms the prior algorithm. The SGSCUDSC consists of developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require a heavy-client (encoder side) cannot be applied. To tackle this challenge, the DSC theory was carefully examined, and a customized reference-based genome compression protocol was developed to meet the low-complexity need at the client side. Based on the variation between the source and the SI, this protocol will adaptively select either syndrome coding or hash coding to compress variable lengths of code subsequences. The experimental results of the proposed method showed promising performance when compared with the state of the art algorithm (GRS)

    An Adaptive Coding Pass Scanning Algorithm for Optimal Rate Control in Biomedical Images

    High-efficiency, high-quality biomedical image compression is desirable especially for the telemedicine applications. This paper presents an adaptive coding pass scanning (ACPS) algorithm for optimal rate control. It can identify the significant portions of an image and discard insignificant ones as early as possible. As a result, waste of computational power and memory space can be avoided. We replace the benchmark algorithm known as postcompression rate distortion (PCRD) by ACPS. Experimental results show that ACPS is preferable to PCRD in terms of the rate distortion curve and computation time

    Foundations of Information-Flow Control and Effects

    In programming language research, information-flow control (IFC) is a technique for enforcing a variety of security aspects, such as confidentiality of data,on programs. This Licenciate thesis makes novel contributions to the theory and foundations of IFC in the following ways: Chapter A presents a new proof method for showing the usual desired property of noninterference; Chapter B shows how to securely extend the concurrent IFC language MAC with asynchronous exceptions; and, Chapter C presents a new and simpler language for IFC with effects based on an explicit separation of pure and effectful computations

    A visualization tool to explore alphabet orderings for the Burrows-Wheeler Transform

    The Burrows-Wheeler Transform (BWT) is an efficient invertible text transformation algorithm with the properties of tending to group identical characters together in a run, and enabling search of the text. This transformation has extensive uses particularly in lossless compression algorithms, indexing, and within bioinformatics for sequence alignment tasks. There has been recent interest in minimizing the number of identical character runs (rr) for a transform and in finding useful alphabet orderings for the sorting step of the matrix associated with the BWT construction. This motivates the inspection of many transforms while developing algorithms. However, the full Burrows-Wheeler matrix is O(n2)O(n^2) space and therefore very difficult to display and inspect for large input sizes. In this paper we present a graphical user interface (GUI) for working with BWTs, which includes features for searching for matrix row prefixes, skipping over sections in the right-most column (the transform), and displaying BWTs while exploring alphabet orderings with the goal of minimizing the number of runs.Comment: 8 pages, 2 figure

    Compressão e análise de dados genómicos

    Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure of all known living organisms. Since the presentation of the rst genomic sequence, a huge amount of genomics data have been generated, with diversi ed characteristics, rendering the data deluge phenomenon a serious problem in most genomics centers. As such, most of the data are discarded (when possible), while other are compressed using general purpose algorithms, often attaining modest data reduction results. Several speci c algorithms have been proposed for the compression of genomic data, but unfortunately only a few of them have been made available as usable and reliable compression tools. From those, most have been developed to some speci c purpose. In this thesis, we propose a compressor for genomic sequences of multiple natures, able to function in a reference or reference-free mode. Besides, it is very exible and can cope with diverse hardware speci cations. It uses a mixture of nite-context models (FCMs) and eXtended FCMs. The results show improvements over state-of-the-art compressors. Since the compressor can be seen as a unsupervised alignment-free method to estimate algorithmic complexity of genomic sequences, it is the ideal candidate to perform analysis of and between sequences. Accordingly, we de ne a way to approximate directly the Normalized Information Distance, aiming to identify evolutionary similarities in intra- and inter-species. Moreover, we introduce a new concept, the Normalized Relative Compression, that is able to quantify and infer new characteristics of the data, previously undetected by other methods. We also investigate local measures, being able to locate speci c events, using complexity pro les. Furthermore, we present and explore a method based on complexity pro les to detect and visualize genomic rearrangements between sequences, identifying several insights of the genomic evolution of humans. Finally, we introduce the concept of relative uniqueness and apply it to the Ebolavirus, identifying three regions that appear in all the virus sequences outbreak but nowhere in the human genome. In fact, we show that these sequences are su cient to classify di erent sub-species. Also, we identify regions in human chromosomes that are absent from close primates DNA, specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos vivos. Desde a apresentação da primeira sequência, um enorme número de dados genómicos tem sido gerado, com diversas características, originando um sério problema de excesso de dados nos principais centros de genómica. Por esta razão, a maioria dos dados é descartada (quando possível), enquanto outros são comprimidos usando algoritmos genéricos, quase sempre obtendo resultados de compressão modestos. Têm também sido propostos alguns algoritmos de compressão para sequências genómicas, mas infelizmente apenas alguns estão disponíveis como ferramentas eficientes e prontas para utilização. Destes, a maioria tem sido utilizada para propósitos específicos. Nesta tese, propomos um compressor para sequências genómicas de natureza múltipla, capaz de funcionar em modo referencial ou sem referência. Além disso, é bastante flexível e pode lidar com diversas especificações de hardware. O compressor usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos. Os resultados mostram melhorias relativamente a compressores estado-dearte. Uma vez que o compressor pode ser visto como um método não supervisionado, que não utiliza alinhamentos para estimar a complexidade algortímica das sequências genómicas, ele é o candidato ideal para realizar análise de e entre sequências. Em conformidade, definimos uma maneira de aproximar directamente a distância de informação normalizada (NID), visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa normalizada (NRC), que é capaz de quantificar e inferir novas características nos dados, anteriormente indetectados por outros métodos. Investigamos também medidas locais, localizando eventos específicos, usando perfis de complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre sequências, identificando algumas características da evolução genómica humana. Por último, introduzimos um novo conceito de singularidade relativa e aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas as sequências do surto viral, mas ausentes do genoma humano. De facto, mostramos que as três sequências são suficientes para classificar diferentes sub-espécies. Também identificamos regiões nos cromossomas humanos que estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana