71 research outputs found
GReEn: a tool for efficient compression of genome resequencing data
Research in the genomic sciences is confronted with the volume of sequencing and resequencing data increasing at a higher pace than that of data storage and communication resources, shifting a significant part of research budgets from the sequencing component of a project to the computational one. Hence, being able to efficiently store sequencing and resequencing data is a problem of paramount importance. In this article, we describe GReEn (Genome Resequencing Encoding), a tool for compressing genome resequencing data using a reference genome sequence. It overcomes some drawbacks of the recently proposed tool GRS, namely, the possibility of compressing sequences that cannot be handled by GRS, faster running times and compression gains of over 100-fold for some sequences. This tool is freely available for non-commercial use at ftp://ftp.ieeta.pt/∼ap/codecs/GReEn1.tar.gz
Group Invariant Deep Representations for Image Instance Retrieval
Most image instance retrieval pipelines are based on comparison of vectors
known as global image descriptors between a query image and the database
images. Due to their success in large scale image classification,
representations extracted from Convolutional Neural Networks (CNN) are quickly
gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors
for image instance retrieval. While CNN-based descriptors are generally
remarked for good retrieval performance at lower bitrates, they nevertheless
present a number of drawbacks including the lack of robustness to common object
transformations such as rotations compared with their interest point based FV
counterparts.
In this paper, we propose a method for computing invariant global descriptors
from CNNs. Our method implements a recently proposed mathematical theory for
invariance in a sensory cortex modeled as a feedforward neural network. The
resulting global descriptors can be made invariant to multiple arbitrary
transformation groups while retaining good discriminativeness.
Based on a thorough empirical evaluation using several publicly available
datasets, we show that our method is able to significantly and consistently
improve retrieval results every time a new type of invariance is incorporated.
We also show that our method which has few parameters is not prone to
overfitting: improvements generalize well across datasets with different
properties with regard to invariances. Finally, we show that our descriptors
are able to compare favourably to other state-of-the-art compact descriptors in
similar bitranges, exceeding the highest retrieval results reported in the
literature on some datasets. A dedicated dimensionality reduction step
--quantization or hashing-- may be able to further improve the competitiveness
of the descriptors
Adaptive Distributed Source Coding Based on Bayesian Inference
Distributed Source Coding (DSC) is an important topic for both in information theory and communication. DSC utilizes the correlations among the sources to compress data, and it has the advantages of being simple and easy to carry out. In DSC, Slepian-Wolf (S-W) and Wyner-Ziv (W-Z) are two important problems, which can be classified as lossless compression and loss compression, respectively. Although the lower bounds of the S-W and W-Z problems have been known to researchers for many decades, the code design to achieve the lower bounds is still an open problem.
This dissertation focuses on three DSC problems: the adaptive Slepian-Wolf decoding for two binary sources (ASWDTBS) problem, the compression of correlated temperature data of sensor network (CCTDSN) problem and the streamlined genome sequence compression using distributed source coding (SGSCUDSC) problem. For the CCTDSN and SGSCUDSC problems, sources will be converted into the binary expression as the sources in ASWDTBS problem for encoding. The Bayesian inference will be applied to all of these three problems. To efficiently solve these Bayesian inferences, message passing algorithm will be applied. For a discrete variable that takes a small number of values, the belief propagation (BP) algorithm is able to implement the message passing algorithm efficiently. However, the complexity of the BP algorithm increases exponentially with the number of values of the variable. Therefore, the BP algorithm can only deal with discrete variable that takes a small number of values and limited continuous variables. For the more complex variables, deterministic approximation methods are used. These methods, such as the variational Bayes (VB) method and expectation propagation (EP) method, can efficiently incorporated into the message passing algorithm.
A virtual binary asymmetric channel (BAC) channel was introduced to model the correlation between the source data and the side information (SI) in ASWDTBS problem, in which two parameters are required to be learned. The two parameters correspond to the crossover probabilities that are 0->1 and 1->0. Based on this model, a factor graph was established that includes LDPC code, source data, SI and both of the crossover probabilities. Since the crossover probabilities are continuous variables, the deterministic approximate inference methods will be incorporated into the message passing algorithm. The proposed algorithm was applied to the synthetic data, and the results showed that the VB-based algorithm achieved much better performance than the performances of the EP-based algorithm and the standard BP algorithm. The poor performance of the EP-based algorithm was also analyzed.
For the CCTDSN problem, the temperature data were collected by crossbow sensors. Four sensors were established in different locations of the laboratory and their readings were sent to the common destination. The data from one sensor were used as the SI, and the data from the other 3 sensors were compressed. The decoding algorithm considers both spatial and temporal correlations, which are in the form of Kalman filter in the factor graph. To deal with the mixtures of the discrete messages and the continuous messages (Gaussians) in the Kalman filter region of the factor graph, the EP algorithm was implemented so that all of the messages were approximated by the Gaussian distribution. The testing results on the wireless network have indicated that the proposed algorithm outperforms the prior algorithm.
The SGSCUDSC consists of developing a streamlined genome sequence compression algorithm to support alternative miniaturized sequencing devices, which have limited communication, storage, and computation power. Existing techniques that require a heavy-client (encoder side) cannot be applied. To tackle this challenge, the DSC theory was carefully examined, and a customized reference-based genome compression protocol was developed to meet the low-complexity need at the client side. Based on the variation between the source and the SI, this protocol will adaptively select either syndrome coding or hash coding to compress variable lengths of code subsequences. The experimental results of the proposed method showed promising performance when compared with the state of the art algorithm (GRS)
An Adaptive Coding Pass Scanning Algorithm for Optimal Rate Control in Biomedical Images
High-efficiency, high-quality biomedical image compression is desirable especially for the telemedicine applications. This paper presents an adaptive coding pass scanning (ACPS) algorithm for optimal rate control. It can identify the significant portions of an image and discard insignificant ones as early as possible. As a result, waste of computational power and memory space can be avoided. We replace the benchmark algorithm known as postcompression rate distortion (PCRD) by ACPS. Experimental results show that ACPS is preferable to PCRD in terms of the rate distortion curve and computation time
Foundations of Information-Flow Control and Effects
In programming language research, information-flow control (IFC) is a technique for enforcing a variety of security aspects, such as confidentiality of data,on programs. This Licenciate thesis makes novel contributions to the theory and foundations of IFC in the following ways: Chapter A presents a new proof method for showing the usual desired property of noninterference; Chapter B shows how to securely extend the concurrent IFC language MAC with asynchronous exceptions; and, Chapter C presents a new and simpler language for IFC with effects based on an explicit separation of pure and effectful computations
A visualization tool to explore alphabet orderings for the Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) is an efficient invertible text
transformation algorithm with the properties of tending to group identical
characters together in a run, and enabling search of the text. This
transformation has extensive uses particularly in lossless compression
algorithms, indexing, and within bioinformatics for sequence alignment tasks.
There has been recent interest in minimizing the number of identical character
runs () for a transform and in finding useful alphabet orderings for the
sorting step of the matrix associated with the BWT construction. This motivates
the inspection of many transforms while developing algorithms. However, the
full Burrows-Wheeler matrix is space and therefore very difficult to
display and inspect for large input sizes. In this paper we present a graphical
user interface (GUI) for working with BWTs, which includes features for
searching for matrix row prefixes, skipping over sections in the right-most
column (the transform), and displaying BWTs while exploring alphabet orderings
with the goal of minimizing the number of runs.Comment: 8 pages, 2 figure
Compressão e análise de dados genómicos
Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure
of all known living organisms. Since the presentation of the rst genomic
sequence, a huge amount of genomics data have been generated,
with diversi ed characteristics, rendering the data deluge phenomenon a
serious problem in most genomics centers. As such, most of the data are
discarded (when possible), while other are compressed using general purpose
algorithms, often attaining modest data reduction results.
Several speci c algorithms have been proposed for the compression of genomic
data, but unfortunately only a few of them have been made available
as usable and reliable compression tools. From those, most have been developed
to some speci c purpose. In this thesis, we propose a compressor
for genomic sequences of multiple natures, able to function in a reference
or reference-free mode. Besides, it is very
exible and can cope with diverse
hardware speci cations. It uses a mixture of nite-context models (FCMs)
and eXtended FCMs. The results show improvements over state-of-the-art
compressors.
Since the compressor can be seen as a unsupervised alignment-free method
to estimate algorithmic complexity of genomic sequences, it is the ideal
candidate to perform analysis of and between sequences. Accordingly, we
de ne a way to approximate directly the Normalized Information Distance,
aiming to identify evolutionary similarities in intra- and inter-species. Moreover,
we introduce a new concept, the Normalized Relative Compression,
that is able to quantify and infer new characteristics of the data, previously
undetected by other methods. We also investigate local measures, being
able to locate speci c events, using complexity pro les. Furthermore, we
present and explore a method based on complexity pro les to detect and
visualize genomic rearrangements between sequences, identifying several insights
of the genomic evolution of humans.
Finally, we introduce the concept of relative uniqueness and apply it to the
Ebolavirus, identifying three regions that appear in all the virus sequences
outbreak but nowhere in the human genome. In fact, we show that these
sequences are su cient to classify di erent sub-species. Also, we identify
regions in human chromosomes that are absent from close primates DNA,
specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos
vivos. Desde a apresentação da primeira sequência, um enorme número de
dados genómicos tem sido gerado, com diversas características, originando
um sério problema de excesso de dados nos principais centros de genómica.
Por esta razão, a maioria dos dados é descartada (quando possível), enquanto
outros são comprimidos usando algoritmos genéricos, quase sempre
obtendo resultados de compressão modestos.
Têm também sido propostos alguns algoritmos de compressão para
sequências genómicas, mas infelizmente apenas alguns estão disponíveis
como ferramentas eficientes e prontas para utilização. Destes, a maioria
tem sido utilizada para propósitos específicos. Nesta tese, propomos
um compressor para sequências genómicas de natureza múltipla, capaz de
funcionar em modo referencial ou sem referência. Além disso, é bastante
flexível e pode lidar com diversas especificações de hardware. O compressor
usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos.
Os resultados mostram melhorias relativamente a compressores estado-dearte.
Uma vez que o compressor pode ser visto como um método não supervisionado,
que não utiliza alinhamentos para estimar a complexidade
algortímica das sequências genómicas, ele é o candidato ideal para realizar
análise de e entre sequências. Em conformidade, definimos uma maneira
de aproximar directamente a distância de informação normalizada (NID),
visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa
normalizada (NRC), que é capaz de quantificar e inferir novas características
nos dados, anteriormente indetectados por outros métodos. Investigamos
também medidas locais, localizando eventos específicos, usando perfis de
complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre
sequências, identificando algumas características da evolução genómica humana.
Por último, introduzimos um novo conceito de singularidade relativa e
aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas
as sequências do surto viral, mas ausentes do genoma humano. De facto,
mostramos que as três sequências são suficientes para classificar diferentes
sub-espécies. Também identificamos regiões nos cromossomas humanos que
estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana
- …