28 research outputs found
Compressão e análise de dados genómicos
Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure
of all known living organisms. Since the presentation of the rst genomic
sequence, a huge amount of genomics data have been generated,
with diversi ed characteristics, rendering the data deluge phenomenon a
serious problem in most genomics centers. As such, most of the data are
discarded (when possible), while other are compressed using general purpose
algorithms, often attaining modest data reduction results.
Several speci c algorithms have been proposed for the compression of genomic
data, but unfortunately only a few of them have been made available
as usable and reliable compression tools. From those, most have been developed
to some speci c purpose. In this thesis, we propose a compressor
for genomic sequences of multiple natures, able to function in a reference
or reference-free mode. Besides, it is very
exible and can cope with diverse
hardware speci cations. It uses a mixture of nite-context models (FCMs)
and eXtended FCMs. The results show improvements over state-of-the-art
compressors.
Since the compressor can be seen as a unsupervised alignment-free method
to estimate algorithmic complexity of genomic sequences, it is the ideal
candidate to perform analysis of and between sequences. Accordingly, we
de ne a way to approximate directly the Normalized Information Distance,
aiming to identify evolutionary similarities in intra- and inter-species. Moreover,
we introduce a new concept, the Normalized Relative Compression,
that is able to quantify and infer new characteristics of the data, previously
undetected by other methods. We also investigate local measures, being
able to locate speci c events, using complexity pro les. Furthermore, we
present and explore a method based on complexity pro les to detect and
visualize genomic rearrangements between sequences, identifying several insights
of the genomic evolution of humans.
Finally, we introduce the concept of relative uniqueness and apply it to the
Ebolavirus, identifying three regions that appear in all the virus sequences
outbreak but nowhere in the human genome. In fact, we show that these
sequences are su cient to classify di erent sub-species. Also, we identify
regions in human chromosomes that are absent from close primates DNA,
specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos
vivos. Desde a apresentação da primeira sequência, um enorme número de
dados genómicos tem sido gerado, com diversas características, originando
um sério problema de excesso de dados nos principais centros de genómica.
Por esta razão, a maioria dos dados é descartada (quando possível), enquanto
outros são comprimidos usando algoritmos genéricos, quase sempre
obtendo resultados de compressão modestos.
Têm também sido propostos alguns algoritmos de compressão para
sequências genómicas, mas infelizmente apenas alguns estão disponíveis
como ferramentas eficientes e prontas para utilização. Destes, a maioria
tem sido utilizada para propósitos específicos. Nesta tese, propomos
um compressor para sequências genómicas de natureza múltipla, capaz de
funcionar em modo referencial ou sem referência. Além disso, é bastante
flexível e pode lidar com diversas especificações de hardware. O compressor
usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos.
Os resultados mostram melhorias relativamente a compressores estado-dearte.
Uma vez que o compressor pode ser visto como um método não supervisionado,
que não utiliza alinhamentos para estimar a complexidade
algortímica das sequências genómicas, ele é o candidato ideal para realizar
análise de e entre sequências. Em conformidade, definimos uma maneira
de aproximar directamente a distância de informação normalizada (NID),
visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa
normalizada (NRC), que é capaz de quantificar e inferir novas características
nos dados, anteriormente indetectados por outros métodos. Investigamos
também medidas locais, localizando eventos específicos, usando perfis de
complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre
sequências, identificando algumas características da evolução genómica humana.
Por último, introduzimos um novo conceito de singularidade relativa e
aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas
as sequências do surto viral, mas ausentes do genoma humano. De facto,
mostramos que as três sequências são suficientes para classificar diferentes
sub-espécies. Também identificamos regiões nos cromossomas humanos que
estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana
Representation and Exploitation of Event Sequences
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The Ten Commandments, the thirty best smartphones in the market and
the five most wanted people by the FBI. Our life is ruled by sequences:
thought sequences, number sequences, event sequences. . . a history book
is nothing more than a compilation of events and our favorite film is
just a sequence of scenes. All of them have something in common, it
is possible to acquire relevant information from them. Frequently, by
accumulating some data from the elements of each sequence we may
access hidden information (e.g. the passengers transported by a bus
on a journey is the sum of the passengers who got on in the sequence
of stops made); other times, reordering the elements by any of their
characteristics facilitates the access to the elements of interest (e.g. the
publication of books in 2019 can be ordered chronologically, by author,
by literary genre or even by a combination of characteristics); but it
will always be sought to store them in the smallest space possible.
Thus, this thesis proposes technological solutions for the storage
and subsequent processing of events, focusing specifically on three
fundamental aspects that can be found in any application that needs
to manage them: compressed and dynamic storage, aggregation
or accumulation of elements of the sequence and element sequence
reordering by their different characteristics or dimensions.
The first contribution of this work is a compact structure for the
dynamic compression of event sequences. This structure allows any
sequence to be compressed in a single pass, that is, it is capable of
compressing in real time as elements arrive. This contribution is
a milestone in the world of compression since, to date, this is the
first proposal for a variable-to-variable dynamic compressor for general purpose.
Regarding aggregation, a data warehouse-like proposal is presented
capable of storing information on any characteristic of the events in a
sequence in an aggregated, compact and accessible way. Following the
philosophy of current data warehouses, we avoid repeating cumulative
operations and speed up aggregate queries by preprocessing the
information and keeping it in this separate structure.
Finally, this thesis addresses the problem of indexing event sequences
considering their different characteristics and possible reorderings. A new
approach for simultaneously keeping the elements of a sequence ordered
by different characteristics is presented through compact structures.
Thus, it is possible to consult the information and perform operations
on the elements of the sequence using any possible rearrangement in a
simple and efficient way.[Resumen]
Los diez mandamientos, los treinta mejores móviles del mercado y las
cinco personas más buscadas por el FBI. Nuestra vida está gobernada
por secuencias: secuencias de pensamientos, secuencias de números,
secuencias de eventos. . . un libro de historia no es más que una sucesión
de eventos y nuestra película favorita no es sino una secuencia de
escenas. Todas ellas tienen algo en común, de todas podemos extraer
información relevante. A veces, al acumular algún dato de los elementos
de cada secuencia accedemos a información oculta (p. ej. los viajeros
transportados por un autobús en un trayecto es la suma de los pasajeros
que se subieron en la secuencia de paradas realizadas); otras veces, la
reordenación de los elementos por alguna de sus características facilita
el acceso a los elementos de interés (p. ej. la publicación de obras
literarias en 2019 puede ordenarse cronológicamente, por autor, por
género literario o incluso por una combinación de características); pero
siempre se buscará almacenarlas en el espacio más reducido posible sin
renunciar a su contenido.
Por ello, esta tesis propone soluciones tecnológicas para el almacenamiento
y posterior procesamiento de secuencias, centrándose
concretamente en tres aspectos fundamentales que se pueden encontrar
en cualquier aplicación que precise gestionarlas: el almacenamiento
comprimido y dinámico, la agregación o acumulación de algún dato
sobre los elementos de la secuencia y la reordenación de los elementos
de la secuencia por sus diferentes características o dimensiones.
La primera contribución de este trabajo es una estructura compacta
para la compresión dinámica de secuencias. Esta estructura permite
comprimir cualquier secuencia en una sola pasada, es decir, es capaz de comprimir en tiempo real a medida que llegan los elementos de la
secuencia. Esta aportación es un hito en el mundo de la compresión ya
que, hasta la fecha, es la primera propuesta de un compresor dinámico
“variable to variable” de carácter general.
En cuanto a la agregación, se presenta una propuesta de almacén
de datos capaz de guardar la información acumulada sobre alguna
característica de los eventos de la secuencia de modo compacto y
fácilmente accesible. Siguiendo la filosofía de los actuales almacenes de
datos, el objetivo es evitar repetir operaciones de acumulación y agilizar
las consultas agregadas mediante el preprocesado de la información
manteniéndola en esta estructura.
Por último, esta tesis aborda el problema de la indexación de
secuencias de eventos considerando sus diferentes características y
posibles reordenaciones. Se presenta una nueva forma de mantener
simultáneamente ordenados los elementos de una secuencia por diferentes
características a través de estructuras compactas. Así se permite
consultar la información y realizar operaciones sobre los elementos
de la secuencia usando cualquier posible ordenación de una manera
sencilla y eficiente
Sinais simbólicos e aplicações em genómica
Doutoramento em Engenharia ElectrotécnicaEsta dissertação surge no contexto do processamento de sinais simbólicos com o objectivo específico de contribuir para o conhecimento da estrutura das sequências de DNA. A localização automática de genes foi um dos problemas biológicos que motivou o desenvolvimento deste trabalho. A compressão de sequências genéticas, quer para reduzir o espaço de armazenamento quer para obtenção de modelos das mesmas, foi outra das motivações. Com o objectivo de contribuir para melhorar uma das técnicas frequentemente usadas na localização automática de genes são comparadas metodologias de análise espectral para sequências simbólicas. Também se discute a validade de aplicação de metodologias de análise espectral às sequências simbólicas e apresenta-se um novo método baseada na função de autocorrelação simbólica. Uma característica que usualmente é tomada para identificação de genes é o tamanho da risca espectral que reflecte a periodicidade de período três. Apresenta-se um algoritmo rápido baseado em contadores de símbolos para cálculo de várias riscas espectrais, e em particular da risca de período três. São também enunciadas e analisadas propriedades associadas ao tamanho de algumas riscas e à redundância espectral. Por último, desenvolve-se uma técnica para compressão de sequências genéticas baseada num modelo de três estados. Em regiões codificantes do DNA esta técnica leva em geral a melhores resultados do que as actuais técnicas de compressão.This dissertation addresses the problem of processing sequences of symbols, and has the specific aim of contributing to the analysis and modeling of DNA sequences. This work was partly motivated by the problem of automatic gene location. Another motivation was the compression of genetic sequences, both for the purpose of reducing the required storage and for determining good DNA models. The main methodologies of spectral analysis of symbolic sequences are compared. The application of spectral analysis methods to the symbolic sequences is discussed and a new method based on the symbolic autocorrelation function is presented. One feature that is often used in gene identification is the size of the Fourier coefficient that reflects periodicity of period three. A fast algorithm for the calculation of Fourier coefficients, based on symbol counters, was developed. Some properties associated with the size of some spectral coefficients and spectral redundancy are discussed. Finally, a technique based on a model with three states was developed to compress genetic sequences. In protein-coding regions this technique leads in general to better results than the state-of-the-art DNA compression techniques
A family of stereoscopic image compression algorithms using wavelet transforms
With the standardization of JPEG-2000, wavelet-based image and video
compression technologies are gradually replacing the popular DCT-based methods. In
parallel to this, recent developments in autostereoscopic display technology is now
threatening to revolutionize the way in which consumers are used to enjoying the
traditional 2-D display based electronic media such as television, computer and
movies. However, due to the two-fold bandwidth/storage space requirement of
stereoscopic imaging, an essential requirement of a stereo imaging system is efficient
data compression.
In this thesis, seven wavelet-based stereo image compression algorithms are
proposed, to take advantage of the higher data compaction capability and better
flexibility of wavelets. [Continues.
A family of stereoscopic image compression algorithms using wavelet transforms
With the standardization of JPEG-2000, wavelet-based image and video
compression technologies are gradually replacing the popular DCT-based methods. In
parallel to this, recent developments in autostereoscopic display technology is now
threatening to revolutionize the way in which consumers are used to enjoying the
traditional 2D display based electronic media such as television, computer and
movies. However, due to the two-fold bandwidth/storage space requirement of
stereoscopic imaging, an essential requirement of a stereo imaging system is efficient
data compression.
In this thesis, seven wavelet-based stereo image compression algorithms are
proposed, to take advantage of the higher data compaction capability and better
flexibility of wavelets. In the proposed CODEC I, block-based disparity
estimation/compensation (DE/DC) is performed in pixel domain. However, this
results in an inefficiency when DWT is applied on the whole predictive error image
that results from the DE process. This is because of the existence of artificial block
boundaries between error blocks in the predictive error image. To overcome this
problem, in the remaining proposed CODECs, DE/DC is performed in the wavelet
domain. Due to the multiresolution nature of the wavelet domain, two methods of
disparity estimation and compensation have been proposed. The first method is
performing DEJDC in each subband of the lowest/coarsest resolution level and then
propagating the disparity vectors obtained to the corresponding subbands of
higher/finer resolution. Note that DE is not performed in every subband due to the
high overhead bits that could be required for the coding of disparity vectors of all
subbands. This method is being used in CODEC II. In the second method, DEJDC is
performed m the wavelet-block domain. This enables disparity estimation to be
performed m all subbands simultaneously without increasing the overhead bits
required for the coding disparity vectors. This method is used by CODEC III.
However, performing disparity estimation/compensation in all subbands would result
in a significant improvement of CODEC III. To further improve the performance of
CODEC ill, pioneering wavelet-block search technique is implemented in CODEC
IV. The pioneering wavelet-block search technique enables the right/predicted image
to be reconstructed at the decoder end without the need of transmitting the disparity
vectors. In proposed CODEC V, pioneering block search is performed in all subbands
of DWT decomposition which results in an improvement of its performance. Further,
the CODEC IV and V are able to perform at very low bit rates(< 0.15 bpp). In
CODEC VI and CODEC VII, Overlapped Block Disparity Compensation (OBDC) is
used with & without the need of coding disparity vector. Our experiment results
showed that no significant coding gains could be obtained for these CODECs over
CODEC IV & V.
All proposed CODECs m this thesis are wavelet-based stereo image coding
algorithms that maximise the flexibility and benefits offered by wavelet transform
technology when applied to stereo imaging. In addition the use of a baseline-JPEG
coding architecture would enable the easy adaptation of the proposed algorithms
within systems originally built for DCT-based coding. This is an important feature
that would be useful during an era where DCT-based technology is only slowly being
phased out to give way for DWT based compression technology.
In addition, this thesis proposed a stereo image coding algorithm that uses JPEG-2000
technology as the basic compression engine. The proposed CODEC, named RASTER
is a rate scalable stereo image CODEC that has a unique ability to preserve the image
quality at binocular depth boundaries, which is an important requirement in the design
of stereo image CODEC. The experimental results have shown that the proposed
CODEC is able to achieve PSNR gains of up to 3.7 dB as compared to directly
transmitting the right frame using JPEG-2000
Succinct and Self-Indexed Data Structures for the Exploitation and Representation of Moving Objects
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis deals with the efficient representation and exploitation of trajectories of
objects that move in space without any type of restriction (airplanes, birds, boats,
etc.). Currently, this is a very relevant problem due to the proliferation of GPS
devices, which makes it possible to collect a large number of trajectories. However,
until now there is no efficient way to properly store and exploit them.
In this thesis, we propose eight structures that meet two fundamental objectives.
First, they are capable of storing space-time data, describing the trajectories, in a
reduced space, so that their exploitation takes advantage of the memory hierarchy.
Second, those structures allow exploiting the information by object queries, given
an object, they retrieve the position or trajectory of that object along that time; or
space-time range queries, given a region of space and a time interval, the objects
that are within the region at that time are obtained. It should be noted that
state-of-the-art solutions are only capable of efficiently answering one of the two
types of queries.
All of these data structures have a common nexus, they all use two elements:
snapshots and logs. Each snapshot works as a spatial index that periodically indexes
the absolute position of each object or the Minimum Bounding Rectangle (MBR) of
its trajectory. They serve to speed up the spatio-temporal range queries. We have
implemented two types of snapshots: based on k2-trees or R-trees.
With respect to the log, it represents the trajectory (sequence of movements) of
each object. It is the main element of the structures, and facilitates the resolution
of object and spatio-temporal range queries. Four strategies have been implemented
to represent the log in a compressed form: ScdcCT, GraCT, ContaCT and RCT.
With the combination of these two elements we build eight different structures for
the representation of trajectories. All of them have been implemented and evaluated
experimentally, showing that they reduce the space required by traditional methods
by up to two orders of magnitude. Furthermore, they are all competitive in solving
object queries as well as spatial-temporal ones.[Resumen]
Esta tesis aborda la representación y explotación eficiente de trayectorias de objetos
que se mueven en el espacio sin ningún tipo de restricción (aviones, pájaros, barcos,
etc.). En la actualidad, este es un problema muy relevante debido a la proliferación
de dispositivos GPS, lo que permite coleccionar una gran cantidad de trayectorias.
Sin embargo, hasta ahora no existe un modo eficiente para almacenarlas y explotarlas
adecuadamente.
Esta tesis propone ocho estructuras que cumplen con dos objetivos fundamentales.
En primer lugar, son capaces de almacenar en espacio reducido los datos espaciotemporales,
que describen las trayectorias, de modo que su explotación saque partido
a la jerarquía de memoria.
En segundo lugar, las estructuras permiten explotar la información realizando
consultas sobre objetos, dado el objeto se calcula su posición o trayectoria durante
un intervalo de tiempo; o consultas de rango espacio-temporal, dada una región del
espacio y un intervalo de tiempo se obtienen los objetos que estaban dentro de la
región en ese tiempo. Hay que destacar que las soluciones del estado del arte solo
son capaces de responder eficientemente uno de los dos tipos de consultas.
Todas estas estructuras de datos tienen un nexo común, todas ellas usan dos
elementos: snapshots y logs. Cada snapshot funciona como un índice espacial que
periódicamente indexa la posición absoluta de cada objeto o el Minimum Bounding
Rectangle (MBR) de su trayectoria. Sirven para agilizar las consultas de rango
espacio-temporal. Hemos implementado dos tipos de snapshot: basadas en k2-trees
o en R-trees.
Con respecto al log, éste representa la trayectoria (secuencia de movimientos) de
cada objeto. Es el principal elemento de nuestras estructuras, y facilita la resolución
de consultas de objeto y de rango espacio-temporal. Se han implementado cuatro
estrategias para representar el log de forma comprimida: ScdcCT, GraCT, ContaCT
y RCT.
Con la combinación de estos dos elementos construimos ocho estructuras diferentes
para la representación de trayectorias. Todas ellas han sido implementadas y
evaluadas experimentalmente, donde reducen hasta dos órdenes de magnitud el
espacio que requieren los métodos tradicionales. Además, todas ellas son competitivas resolviendo tanto consultas de objeto como de rango espacio-temporal.[Resumo]
Esta tese trata sobre a representación e explotación eficiente de traxectorias de
obxectos que se moven no espazo sen ningún tipo de restrición (avións, paxaros,
buques, etc.). Na actualidade, este é un problema moi relevante debido á proliferación
de dispositivos GPS, o que fai posible a recollida dun gran número de traxectorias.
Non obstante, ata o de agora non existe un xeito eficiente de almacenalos e explotalos.
Esta tese propón oito estruturas que cumpren dous obxectivos fundamentais. En
primeiro lugar, son capaces de almacenar datos espazo-temporais, que describen
as traxectorias, nun espazo reducido, de xeito que a súa explotación aproveita a
xerarquía da memoria.
En segundo lugar, as estruturas permiten explotar a información realizando
consultas de obxectos, dado o obxecto calcúlase a súa posición ou traxectoria nun
período de tempo; ou consultas de rango espazo-temporal, dada unha rexión de
espazo e un intervalo de tempo, obtéñense os obxectos que estaban dentro da rexión
nese momento. Cómpre salientar que as solucións do estado do arte só son capaces
de responder eficientemente a un dos dous tipos de consultas.
Todas estas estruturas de datos teñen unha ligazón común, empregan dous
elementos: snapshots e logs. Cada snapshot funciona como un índice espacial que
indexa periodicamente a posición absoluta de cada obxecto ou o Minimum Bounding
Rectangle (MBR) da súa traxectoria. Serven para acelerar as consultas de rango
espazo-temporal. Implementamos dous tipos de snapshot: baseadas en k2-trees ou
en R-trees.
Con respecto ao log, este representa a traxectoria (secuencia de movementos) de
cada obxecto. É o principal elemento das nosas estruturas, e facilita a resolución
de consultas sobre obxectos e de rango espacio-temporal. Implementáronse catro
estratexias para representar o log nunha forma comprimida: ScdcCT, GraCT,
ContaCT e RCT.
Coa combinación destes dous elementos construímos oito estruturas diferentes
para a representación de traxectorias. Todas elas foron implementadas e avaliadas
experimentalmente, onde reducen ata dúas ordes de magnitude o espazo requirido
polos métodos tradicionais. Ademais, todas elas son competitivas para resolver tanto
consultas de obxectos como espazo-temporais