65 research outputs found
A conditional compression distance that unveils insights of the genomic evolution
We describe a compression-based distance for genomic sequences. Instead of
using the usual conjoint information content, as in the classical Normalized
Compression Distance (NCD), it uses the conditional information content. To
compute this Normalized Conditional Compression Distance (NCCD), we need a
normal conditional compressor, that we built using a mixture of static and
dynamic finite-context models. Using this approach, we measured chromosomal
distances between Hominidae primates and also between Muroidea (rat and mouse),
observing several insights of evolution that so far have not been reported in
the literature.Comment: Full version of DCC 2014 paper "A conditional compression distance
that unveils insights of the genomic evolution
Compressão e análise de dados genómicos
Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure
of all known living organisms. Since the presentation of the rst genomic
sequence, a huge amount of genomics data have been generated,
with diversi ed characteristics, rendering the data deluge phenomenon a
serious problem in most genomics centers. As such, most of the data are
discarded (when possible), while other are compressed using general purpose
algorithms, often attaining modest data reduction results.
Several speci c algorithms have been proposed for the compression of genomic
data, but unfortunately only a few of them have been made available
as usable and reliable compression tools. From those, most have been developed
to some speci c purpose. In this thesis, we propose a compressor
for genomic sequences of multiple natures, able to function in a reference
or reference-free mode. Besides, it is very
exible and can cope with diverse
hardware speci cations. It uses a mixture of nite-context models (FCMs)
and eXtended FCMs. The results show improvements over state-of-the-art
compressors.
Since the compressor can be seen as a unsupervised alignment-free method
to estimate algorithmic complexity of genomic sequences, it is the ideal
candidate to perform analysis of and between sequences. Accordingly, we
de ne a way to approximate directly the Normalized Information Distance,
aiming to identify evolutionary similarities in intra- and inter-species. Moreover,
we introduce a new concept, the Normalized Relative Compression,
that is able to quantify and infer new characteristics of the data, previously
undetected by other methods. We also investigate local measures, being
able to locate speci c events, using complexity pro les. Furthermore, we
present and explore a method based on complexity pro les to detect and
visualize genomic rearrangements between sequences, identifying several insights
of the genomic evolution of humans.
Finally, we introduce the concept of relative uniqueness and apply it to the
Ebolavirus, identifying three regions that appear in all the virus sequences
outbreak but nowhere in the human genome. In fact, we show that these
sequences are su cient to classify di erent sub-species. Also, we identify
regions in human chromosomes that are absent from close primates DNA,
specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos
vivos. Desde a apresentação da primeira sequência, um enorme número de
dados genómicos tem sido gerado, com diversas características, originando
um sério problema de excesso de dados nos principais centros de genómica.
Por esta razão, a maioria dos dados é descartada (quando possível), enquanto
outros são comprimidos usando algoritmos genéricos, quase sempre
obtendo resultados de compressão modestos.
Têm também sido propostos alguns algoritmos de compressão para
sequências genómicas, mas infelizmente apenas alguns estão disponíveis
como ferramentas eficientes e prontas para utilização. Destes, a maioria
tem sido utilizada para propósitos específicos. Nesta tese, propomos
um compressor para sequências genómicas de natureza múltipla, capaz de
funcionar em modo referencial ou sem referência. Além disso, é bastante
flexível e pode lidar com diversas especificações de hardware. O compressor
usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos.
Os resultados mostram melhorias relativamente a compressores estado-dearte.
Uma vez que o compressor pode ser visto como um método não supervisionado,
que não utiliza alinhamentos para estimar a complexidade
algortímica das sequências genómicas, ele é o candidato ideal para realizar
análise de e entre sequências. Em conformidade, definimos uma maneira
de aproximar directamente a distância de informação normalizada (NID),
visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa
normalizada (NRC), que é capaz de quantificar e inferir novas características
nos dados, anteriormente indetectados por outros métodos. Investigamos
também medidas locais, localizando eventos específicos, usando perfis de
complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre
sequências, identificando algumas características da evolução genómica humana.
Por último, introduzimos um novo conceito de singularidade relativa e
aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas
as sequências do surto viral, mas ausentes do genoma humano. De facto,
mostramos que as três sequências são suficientes para classificar diferentes
sub-espécies. Também identificamos regiões nos cromossomas humanos que
estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana
Information profiles for DNA pattern discovery
Finite-context modeling is a powerful tool for compressing and hence for
representing DNA sequences. We describe an algorithm to detect genomic
regularities, within a blind discovery strategy. The algorithm uses information
profiles built using suitable combinations of finite-context models. We used
the genome of the fission yeast Schizosaccharomyces pombe strain 972 h- for
illustration, unveilling locations of low information content, which are
usually associated with DNA regions of potential biological interest.Comment: Full version of DCC 2014 paper "Information profiles for DNA pattern
discovery
The complexity landscape of viral genomes
Background Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. Results This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. Conclusions This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.Peer reviewe
Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model
Sources that generate symbolic sequences with algorithmic nature may differ in statistical complexity because they create structures that follow algorithmic schemes, rather than generating symbols from a probabilistic function assuming independence. In the case of Turing machines, this means that machines with the same algorithmic complexity can create tapes with different statistical complexity. In this paper, we use a compression-based approach to measure global and local statistical complexity of specific Turing machine tapes with the same number of states and alphabet. Both measures are estimated using the best-order Markov model. For the global measure, we use the Normalized Compression (NC), while, for the local measures, we define and use normal and dynamic complexity profiles to quantify and localize lower and higher regions of statistical complexity. We assessed the validity of our methodology on synthetic and real genomic data showing that it is tolerant to increasing rates of editions and block permutations. Regarding the analysis of the tapes, we localize patterns of higher statistical complexity in two regions, for a different number of machine states. We show that these patterns are generated by a decrease of the tape's amplitude, given the setting of small rule cycles. Additionally, we performed a comparison with a measure that uses both algorithmic and statistical approaches (BDM) for analysis of the tapes. Naturally, BDM is efficient given the algorithmic nature of the tapes. However, for a higher number of states, BDM is progressively approximated by our methodology. Finally, we provide a simple algorithm to increase the statistical complexity of a Turing machine tape while retaining the same algorithmic complexity. We supply a publicly available implementation of the algorithm in C++ language under the GPLv3 license. All results can be reproduced in full with scripts provided at the repository.Peer reviewe
Smash plus plus : an alignment-free and memory-efficient tool to find genomic rearrangements
Background: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results: We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions: Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was similar to 1 GB, which makes Smash++ feasible to run on present-day standard computers.Peer reviewe
Indirect assessment of railway infrastructure anomalies based on passenger comfort criteria
Railways are among the most efficient and widely used mass transportation systems for mid-range distances. To enhance the attractiveness of this type of transport, it is necessary to improve the level of comfort, which is much influenced by the vibration derived from the train motion and wheel-track interaction; thus, railway track infrastructure conditions and maintenance are a major concern. Based on discomfort levels, a methodology capable of detecting railway track infrastructure failures is proposed. During regular passenger service, acceleration and GPS measurements were taken on Alfa Pendular and Intercity trains between Porto (Campanhã) and Lisbon (Oriente) stations. ISO 2631 methodology was used to calculate instantaneous floor discomfort levels. By matching the results for both trains, using GPS coordinates, 12 track section locations were found to require preventive maintenance actions. The methodology was validated by comparing these results with those obtained by the EM 120 track inspection vehicle, for which similar locations were found. The developed system is a complementary condition-based maintenance tool that presents the advantage of being low-cost while not disturbing regular train operations.This research was funded by Fundação para a Ciência e Tecnologia, grant number PD/BD/143161/2019. The authors also acknowledge the financial support from the Base Funding-UIDB/04708/2020 and Programmatic Funding-UIDP/04708/2020 of the CONSTRUCT—Instituto de Estruturas e Construções, funded by national funds through the FCT/MCTES (PIDDAC).This work is a result of the project “FERROVIA 4.0”, reference POCI-01-0247-FEDER- 046111, co-funded by the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020) and the Lisbon Regional Operational Programme (LISBOA 2020) under the PORTUGAL 2020 Partnership Agreement. The first author thanks Fundação para a Ciência e Tecnologia (FCT) for a PhD scholarship under the project iRail (PD/BD/143161/2019). The authors would like to acknowledge the support of the projects FCT LAETA–UIDB/50022/2020, UIDP/50022/2020, and UIDB/04077/2020. No potential competing interest was reported by the authors
Requalificação do espaço público em Lisboa - Avaliação do projeto: “Uma praça em cada Bairro”
Os espaços públicos sempre desempenharam um papel importante, contribuindo para
a transformação das cidades ao longo do tempo. Tendo em consideração a premência dos
espaços públicos para a população, revela-se como necessário que os poderes públicos
criem novos espaços públicos ou requalifiquem os já existentes, de modo a satisfazer as
necessidades da população. Estas intervenções devem considerar não apenas a dimensão
física e material, mas também outras dimensões, como a social, a ambiental, assim como a
própria identidade cultural dos espaços públicos intervencionados, de forma a promover a
sua utilização por parte da população.
Este trabalho tem como objetivo estudar os impactos de projetos de requalificação do
espaço público. No decorrer de um estágio realizado na Câmara Municipal de Lisboa,
analisamos o programa ‘Uma Praça em cada Bairro’, desenvolvido por esta instituição. Após
atualizarmos uma matriz de avaliação desta instituição, usamos os projetos desenvolvidos na
Rua Actriz Palmira Bastos e na Alameda Manuel Ricardo Espírito Santo, como casos de
estudo.
Tendo em consideração a pesquisa realizada, é possível concluir que as intervenções
de requalificação urbana realizadas aos espaços públicos selecionados alcançaram parte dos
objetivos inicialmente definidos. Também é possível concluir que a atualização da matriz de
avaliação revelou-se como acertada, pela sua abrangência, por incluir mais dimensões de
análise do que a matriz inicial. Pela sua abrangência, esta matriz atualizada pode ser
mobilizada na avaliação de futuros projetos de requalificação urbana com incidência no
espaço público.Public spaces have always played a key role, contributing to the transformation of cities
over time. Considering the importance of public spaces for the population, it is necessary that
the public authorities create new public spaces or requalify the existing ones, in order to satisfy
the needs of the population. These interventions must consider not only the physical and
material dimension, but also other dimensions, such as social, environmental, as well as the
cultural identity of the public spaces involved, in order to promote their use by the population.
This work aims to study the impacts of public space requalification projects. During an
internship held at the Lisbon City Council, we analyzed the program ‘Uma Praça em cada
Bairro,’ developed by this institution. After updating an evaluation matrix for this institution, we
used the projects developed at Rua Actriz Palmira Bastos and Alameda Manuel Ricardo
Espírito Santo as case studies.
Considering the research conducted, it is possible to conclude that the urban
requalification interventions conducted in the selected public spaces achieved part of the
initially defined objectives. It is also possible to conclude that the update of the evaluation
matrix proved to be correct, due to its scope, for including more dimensions of analysis than
the initial matrix. Due to its scope, this updated matrix can be used in the evaluation of future
urban requalification projects with an impact on public space
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe
Detection of Low-Copy Human Virus DNA upon Prolonged Formalin Fixation
Formalin fixation, albeit an outstanding method for morphological and molecular preservation, induces DNA damage and cross-linking, which can hinder nucleic acid screening. This is of particular concern in the detection of low-abundance targets, such as persistent DNA viruses. In the present study, we evaluated the analytical sensitivity of viral detection in lung, liver, and kidney specimens from four deceased individuals. The samples were either frozen or incubated in formalin (±paraffin embedding) for up to 10 days. We tested two DNA extraction protocols for the control of efficient yields and viral detections. We used short-amplicon qPCRs (63–159 nucleotides) to detect 11 DNA viruses, as well as hybridization capture of these plus 27 additional ones, followed by deep sequencing. We observed marginally higher ratios of amplifiable DNA and scantly higher viral genoprevalences in the samples extracted with the FFPE dedicated protocol. Based on the findings in the frozen samples, most viruses were detected regardless of the extended fixation times. False-negative calls, particularly by qPCR, correlated with low levels of viral DNA (150 base pairs). Our data suggest that low-copy viral DNAs can be satisfactorily investigated from FFPE specimens, and encourages further examination of historical materials
- …