55 research outputs found
Compressing Proteomes: The Relevance of Medium Range Correlations
We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences
A joint motion & disparity motion estimation technique for 3D integral video compression using evolutionary strategy
3D imaging techniques have the potential to establish a future mass-market in the fields of entertainment and communications. Integral imaging, which can capture true 3D color images with only one camera, has been seen as the right technology to offer stress-free viewing to audiences of more than one person. Just like any digital video, 3D video sequences must also be compressed in order to make it suitable for consumer domain applications. However, ordinary compression techniques found in state-of-the-art video coding standards such as H.264, MPEG-4 and MPEG-2 are not capable of producing enough compression while preserving the 3D clues. Fortunately, a huge amount of redundancies can be found in an integral video sequence in terms of motion and disparity. This paper discusses a novel approach to use both motion and disparity information to compress 3D integral video sequences. We propose to decompose the integral video sequence down to viewpoint video sequences and jointly exploit motion and disparity redundancies to maximize the compression. We further propose an optimization technique based on evolutionary strategies to minimize the computational complexity of the joint motion disparity estimation. Experimental results demonstrate that Joint Motion and Disparity Estimation can achieve over 1 dB objective quality gain over normal motion estimation. Once combined with Evolutionary strategy, this can achieve up to 94% computational cost saving
Using semantic knowledge to improve compression on log files
With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size.TeXpdfTeX-1.40.
Adaptive edge-based prediction for lossless image compression
Many lossless image compression methods have been suggested with established results hard to surpass. However there are some aspects that can be considered to improve the performance further. This research focuses on two-phase prediction-encoding method, separately studying each and suggesting new techniques.;In the prediction module, proposed Edge-Based-Predictor (EBP) and Least-Squares-Edge-Based-Predictor (LS-EBP) emphasizes on image edges and make predictions accordingly. EBP is a gradient based nonlinear adaptive predictor. EBP switches between prediction-rules based on few threshold parameters automatically determined by a pre-analysis procedure, which makes a first pass. The LS-EBP also uses these parameters, but optimizes the prediction for each pre-analysis assigned edge location, thus applying least-square approach only at the edge points.;For encoding module: a novel Burrows Wheeler Transform (BWT) inspired method is suggested, which performs better than applying the BWT directly on the images. We also present a context-based adaptive error modeling and encoding scheme. When coupled with the above-mentioned prediction schemes, the result is the best-known compression performance in the genre of compression schemes with same time and space complexity
Distributed Source Coding Techniques for Lossless Compression of Hyperspectral Images
This paper deals with the application of distributed source coding (DSC) theory to remote sensing image compression. Although DSC exhibits a significant potential in many application fields, up till now the results obtained on real signals fall short of the theoretical bounds, and often impose additional system-level constraints. The objective of this paper is to assess the potential of DSC for lossless image compression carried out onboard a remote platform. We first provide a brief overview of DSC of correlated information sources. We then focus on onboard lossless image compression, and apply DSC techniques in order to reduce the complexity of the onboard encoder, at the expense of the decoder's, by exploiting the correlation of different bands of a hyperspectral dataset. Specifically, we propose two different compression schemes, one based on powerful binary error-correcting codes employed as source codes, and one based on simpler multilevel coset codes. The performance of both schemes is evaluated on a few AVIRIS scenes, and is compared with other state-of-the-art 2D and 3D coders. Both schemes turn out to achieve competitive compression performance, and one of them also has reduced complexity. Based on these results, we highlight the main issues that are still to be solved to further improve the performance of DSC-based remote sensing systems
CompressĂŁo e anĂĄlise de dados genĂłmicos
Doutoramento em InformĂĄticaGenomic sequences are large codi ed messages describing most of the structure
of all known living organisms. Since the presentation of the rst genomic
sequence, a huge amount of genomics data have been generated,
with diversi ed characteristics, rendering the data deluge phenomenon a
serious problem in most genomics centers. As such, most of the data are
discarded (when possible), while other are compressed using general purpose
algorithms, often attaining modest data reduction results.
Several speci c algorithms have been proposed for the compression of genomic
data, but unfortunately only a few of them have been made available
as usable and reliable compression tools. From those, most have been developed
to some speci c purpose. In this thesis, we propose a compressor
for genomic sequences of multiple natures, able to function in a reference
or reference-free mode. Besides, it is very
exible and can cope with diverse
hardware speci cations. It uses a mixture of nite-context models (FCMs)
and eXtended FCMs. The results show improvements over state-of-the-art
compressors.
Since the compressor can be seen as a unsupervised alignment-free method
to estimate algorithmic complexity of genomic sequences, it is the ideal
candidate to perform analysis of and between sequences. Accordingly, we
de ne a way to approximate directly the Normalized Information Distance,
aiming to identify evolutionary similarities in intra- and inter-species. Moreover,
we introduce a new concept, the Normalized Relative Compression,
that is able to quantify and infer new characteristics of the data, previously
undetected by other methods. We also investigate local measures, being
able to locate speci c events, using complexity pro les. Furthermore, we
present and explore a method based on complexity pro les to detect and
visualize genomic rearrangements between sequences, identifying several insights
of the genomic evolution of humans.
Finally, we introduce the concept of relative uniqueness and apply it to the
Ebolavirus, identifying three regions that appear in all the virus sequences
outbreak but nowhere in the human genome. In fact, we show that these
sequences are su cient to classify di erent sub-species. Also, we identify
regions in human chromosomes that are absent from close primates DNA,
specifying novel traits in human uniqueness.As sequĂȘncias genĂłmicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos
vivos. Desde a apresentação da primeira sequĂȘncia, um enorme nĂșmero de
dados genĂłmicos tem sido gerado, com diversas caracterĂsticas, originando
um sério problema de excesso de dados nos principais centros de genómica.
Por esta razĂŁo, a maioria dos dados Ă© descartada (quando possĂvel), enquanto
outros são comprimidos usando algoritmos genéricos, quase sempre
obtendo resultados de compressĂŁo modestos.
TĂȘm tambĂ©m sido propostos alguns algoritmos de compressĂŁo para
sequĂȘncias genĂłmicas, mas infelizmente apenas alguns estĂŁo disponĂveis
como ferramentas eficientes e prontas para utilização. Destes, a maioria
tem sido utilizada para propĂłsitos especĂficos. Nesta tese, propomos
um compressor para sequĂȘncias genĂłmicas de natureza mĂșltipla, capaz de
funcionar em modo referencial ou sem referĂȘncia. AlĂ©m disso, Ă© bastante
flexĂvel e pode lidar com diversas especificaçÔes de hardware. O compressor
usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos.
Os resultados mostram melhorias relativamente a compressores estado-dearte.
Uma vez que o compressor pode ser visto como um método não supervisionado,
que nĂŁo utiliza alinhamentos para estimar a complexidade
algortĂmica das sequĂȘncias genĂłmicas, ele Ă© o candidato ideal para realizar
anĂĄlise de e entre sequĂȘncias. Em conformidade, definimos uma maneira
de aproximar directamente a distùncia de informação normalizada (NID),
visando a identificação evolucionåria de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa
normalizada (NRC), que Ă© capaz de quantificar e inferir novas caracterĂsticas
nos dados, anteriormente indetectados por outros métodos. Investigamos
tambĂ©m medidas locais, localizando eventos especĂficos, usando perfis de
complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre
sequĂȘncias, identificando algumas caracterĂsticas da evolução genĂłmica humana.
Por Ășltimo, introduzimos um novo conceito de singularidade relativa e
aplicamo-lo ao Ebolavirus, identificando trĂȘs regiĂ”es presentes em todas
as sequĂȘncias do surto viral, mas ausentes do genoma humano. De facto,
mostramos que as trĂȘs sequĂȘncias sĂŁo suficientes para classificar diferentes
sub-espécies. Também identificamos regiÔes nos cromossomas humanos que
estĂŁo ausentes do ADN de primatas prĂłximos, especificando novas caracterĂsticas da singularidade humana
A new multistage lattice vector quantization with adaptive subband thresholding for image compression
Lattice vector quantization (LVQ) reduces coding complexity and computation due to its regular structure. A new multistage LVQ (MLVQ) using an adaptive subband thresholding technique is presented and applied to image compression. The technique concentrates on reducing the quantization error of the quantized vectors by "blowing out" the residual quantization errors with an LVQ scale factor. The significant coefficients of each subband are identified using an optimum adaptive thresholding scheme for each subband. A variable length coding procedure using Golomb codes is used to compress the codebook index which produces a very efficient and fast technique for entropy coding. Experimental results using the MLVQ are shown to be significantly better than JPEG 2000 and the recent VQ techniques for various test images
- âŠ