65 research outputs found

    A conditional compression distance that unveils insights of the genomic evolution

    Full text link
    We describe a compression-based distance for genomic sequences. Instead of using the usual conjoint information content, as in the classical Normalized Compression Distance (NCD), it uses the conditional information content. To compute this Normalized Conditional Compression Distance (NCCD), we need a normal conditional compressor, that we built using a mixture of static and dynamic finite-context models. Using this approach, we measured chromosomal distances between Hominidae primates and also between Muroidea (rat and mouse), observing several insights of evolution that so far have not been reported in the literature.Comment: Full version of DCC 2014 paper "A conditional compression distance that unveils insights of the genomic evolution

    Compressão e análise de dados genómicos

    Get PDF
    Doutoramento em InformáticaGenomic sequences are large codi ed messages describing most of the structure of all known living organisms. Since the presentation of the rst genomic sequence, a huge amount of genomics data have been generated, with diversi ed characteristics, rendering the data deluge phenomenon a serious problem in most genomics centers. As such, most of the data are discarded (when possible), while other are compressed using general purpose algorithms, often attaining modest data reduction results. Several speci c algorithms have been proposed for the compression of genomic data, but unfortunately only a few of them have been made available as usable and reliable compression tools. From those, most have been developed to some speci c purpose. In this thesis, we propose a compressor for genomic sequences of multiple natures, able to function in a reference or reference-free mode. Besides, it is very exible and can cope with diverse hardware speci cations. It uses a mixture of nite-context models (FCMs) and eXtended FCMs. The results show improvements over state-of-the-art compressors. Since the compressor can be seen as a unsupervised alignment-free method to estimate algorithmic complexity of genomic sequences, it is the ideal candidate to perform analysis of and between sequences. Accordingly, we de ne a way to approximate directly the Normalized Information Distance, aiming to identify evolutionary similarities in intra- and inter-species. Moreover, we introduce a new concept, the Normalized Relative Compression, that is able to quantify and infer new characteristics of the data, previously undetected by other methods. We also investigate local measures, being able to locate speci c events, using complexity pro les. Furthermore, we present and explore a method based on complexity pro les to detect and visualize genomic rearrangements between sequences, identifying several insights of the genomic evolution of humans. Finally, we introduce the concept of relative uniqueness and apply it to the Ebolavirus, identifying three regions that appear in all the virus sequences outbreak but nowhere in the human genome. In fact, we show that these sequences are su cient to classify di erent sub-species. Also, we identify regions in human chromosomes that are absent from close primates DNA, specifying novel traits in human uniqueness.As sequências genómicas podem ser vistas como grandes mensagens codificadas, descrevendo a maior parte da estrutura de todos os organismos vivos. Desde a apresentação da primeira sequência, um enorme número de dados genómicos tem sido gerado, com diversas características, originando um sério problema de excesso de dados nos principais centros de genómica. Por esta razão, a maioria dos dados é descartada (quando possível), enquanto outros são comprimidos usando algoritmos genéricos, quase sempre obtendo resultados de compressão modestos. Têm também sido propostos alguns algoritmos de compressão para sequências genómicas, mas infelizmente apenas alguns estão disponíveis como ferramentas eficientes e prontas para utilização. Destes, a maioria tem sido utilizada para propósitos específicos. Nesta tese, propomos um compressor para sequências genómicas de natureza múltipla, capaz de funcionar em modo referencial ou sem referência. Além disso, é bastante flexível e pode lidar com diversas especificações de hardware. O compressor usa uma mistura de modelos de contexto-finito (FCMs) e FCMs estendidos. Os resultados mostram melhorias relativamente a compressores estado-dearte. Uma vez que o compressor pode ser visto como um método não supervisionado, que não utiliza alinhamentos para estimar a complexidade algortímica das sequências genómicas, ele é o candidato ideal para realizar análise de e entre sequências. Em conformidade, definimos uma maneira de aproximar directamente a distância de informação normalizada (NID), visando a identificação evolucionária de similaridades em intra e interespécies. Além disso, introduzimos um novo conceito, a compressão relativa normalizada (NRC), que é capaz de quantificar e inferir novas características nos dados, anteriormente indetectados por outros métodos. Investigamos também medidas locais, localizando eventos específicos, usando perfis de complexidade. Propomos e exploramos um novo método baseado em perfis de complexidade para detectar e visualizar rearranjos genómicos entre sequências, identificando algumas características da evolução genómica humana. Por último, introduzimos um novo conceito de singularidade relativa e aplicamo-lo ao Ebolavirus, identificando três regiões presentes em todas as sequências do surto viral, mas ausentes do genoma humano. De facto, mostramos que as três sequências são suficientes para classificar diferentes sub-espécies. Também identificamos regiões nos cromossomas humanos que estão ausentes do ADN de primatas próximos, especificando novas características da singularidade humana

    Information profiles for DNA pattern discovery

    Full text link
    Finite-context modeling is a powerful tool for compressing and hence for representing DNA sequences. We describe an algorithm to detect genomic regularities, within a blind discovery strategy. The algorithm uses information profiles built using suitable combinations of finite-context models. We used the genome of the fission yeast Schizosaccharomyces pombe strain 972 h- for illustration, unveilling locations of low information content, which are usually associated with DNA regions of potential biological interest.Comment: Full version of DCC 2014 paper "Information profiles for DNA pattern discovery

    The complexity landscape of viral genomes

    Get PDF
    Background Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. Results This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. Conclusions This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.Peer reviewe

    Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model

    Get PDF
    Sources that generate symbolic sequences with algorithmic nature may differ in statistical complexity because they create structures that follow algorithmic schemes, rather than generating symbols from a probabilistic function assuming independence. In the case of Turing machines, this means that machines with the same algorithmic complexity can create tapes with different statistical complexity. In this paper, we use a compression-based approach to measure global and local statistical complexity of specific Turing machine tapes with the same number of states and alphabet. Both measures are estimated using the best-order Markov model. For the global measure, we use the Normalized Compression (NC), while, for the local measures, we define and use normal and dynamic complexity profiles to quantify and localize lower and higher regions of statistical complexity. We assessed the validity of our methodology on synthetic and real genomic data showing that it is tolerant to increasing rates of editions and block permutations. Regarding the analysis of the tapes, we localize patterns of higher statistical complexity in two regions, for a different number of machine states. We show that these patterns are generated by a decrease of the tape's amplitude, given the setting of small rule cycles. Additionally, we performed a comparison with a measure that uses both algorithmic and statistical approaches (BDM) for analysis of the tapes. Naturally, BDM is efficient given the algorithmic nature of the tapes. However, for a higher number of states, BDM is progressively approximated by our methodology. Finally, we provide a simple algorithm to increase the statistical complexity of a Turing machine tape while retaining the same algorithmic complexity. We supply a publicly available implementation of the algorithm in C++ language under the GPLv3 license. All results can be reproduced in full with scripts provided at the repository.Peer reviewe

    Smash plus plus : an alignment-free and memory-efficient tool to find genomic rearrangements

    Get PDF
    Background: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. Results: We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. Conclusions: Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was similar to 1 GB, which makes Smash++ feasible to run on present-day standard computers.Peer reviewe

    Indirect assessment of railway infrastructure anomalies based on passenger comfort criteria

    Get PDF
    Railways are among the most efficient and widely used mass transportation systems for mid-range distances. To enhance the attractiveness of this type of transport, it is necessary to improve the level of comfort, which is much influenced by the vibration derived from the train motion and wheel-track interaction; thus, railway track infrastructure conditions and maintenance are a major concern. Based on discomfort levels, a methodology capable of detecting railway track infrastructure failures is proposed. During regular passenger service, acceleration and GPS measurements were taken on Alfa Pendular and Intercity trains between Porto (Campanhã) and Lisbon (Oriente) stations. ISO 2631 methodology was used to calculate instantaneous floor discomfort levels. By matching the results for both trains, using GPS coordinates, 12 track section locations were found to require preventive maintenance actions. The methodology was validated by comparing these results with those obtained by the EM 120 track inspection vehicle, for which similar locations were found. The developed system is a complementary condition-based maintenance tool that presents the advantage of being low-cost while not disturbing regular train operations.This research was funded by Fundação para a Ciência e Tecnologia, grant number PD/BD/143161/2019. The authors also acknowledge the financial support from the Base Funding-UIDB/04708/2020 and Programmatic Funding-UIDP/04708/2020 of the CONSTRUCT—Instituto de Estruturas e Construções, funded by national funds through the FCT/MCTES (PIDDAC).This work is a result of the project “FERROVIA 4.0”, reference POCI-01-0247-FEDER- 046111, co-funded by the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020) and the Lisbon Regional Operational Programme (LISBOA 2020) under the PORTUGAL 2020 Partnership Agreement. The first author thanks Fundação para a Ciência e Tecnologia (FCT) for a PhD scholarship under the project iRail (PD/BD/143161/2019). The authors would like to acknowledge the support of the projects FCT LAETA–UIDB/50022/2020, UIDP/50022/2020, and UIDB/04077/2020. No potential competing interest was reported by the authors

    Requalificação do espaço público em Lisboa - Avaliação do projeto: “Uma praça em cada Bairro”

    Get PDF
    Os espaços públicos sempre desempenharam um papel importante, contribuindo para a transformação das cidades ao longo do tempo. Tendo em consideração a premência dos espaços públicos para a população, revela-se como necessário que os poderes públicos criem novos espaços públicos ou requalifiquem os já existentes, de modo a satisfazer as necessidades da população. Estas intervenções devem considerar não apenas a dimensão física e material, mas também outras dimensões, como a social, a ambiental, assim como a própria identidade cultural dos espaços públicos intervencionados, de forma a promover a sua utilização por parte da população. Este trabalho tem como objetivo estudar os impactos de projetos de requalificação do espaço público. No decorrer de um estágio realizado na Câmara Municipal de Lisboa, analisamos o programa ‘Uma Praça em cada Bairro’, desenvolvido por esta instituição. Após atualizarmos uma matriz de avaliação desta instituição, usamos os projetos desenvolvidos na Rua Actriz Palmira Bastos e na Alameda Manuel Ricardo Espírito Santo, como casos de estudo. Tendo em consideração a pesquisa realizada, é possível concluir que as intervenções de requalificação urbana realizadas aos espaços públicos selecionados alcançaram parte dos objetivos inicialmente definidos. Também é possível concluir que a atualização da matriz de avaliação revelou-se como acertada, pela sua abrangência, por incluir mais dimensões de análise do que a matriz inicial. Pela sua abrangência, esta matriz atualizada pode ser mobilizada na avaliação de futuros projetos de requalificação urbana com incidência no espaço público.Public spaces have always played a key role, contributing to the transformation of cities over time. Considering the importance of public spaces for the population, it is necessary that the public authorities create new public spaces or requalify the existing ones, in order to satisfy the needs of the population. These interventions must consider not only the physical and material dimension, but also other dimensions, such as social, environmental, as well as the cultural identity of the public spaces involved, in order to promote their use by the population. This work aims to study the impacts of public space requalification projects. During an internship held at the Lisbon City Council, we analyzed the program ‘Uma Praça em cada Bairro,’ developed by this institution. After updating an evaluation matrix for this institution, we used the projects developed at Rua Actriz Palmira Bastos and Alameda Manuel Ricardo Espírito Santo as case studies. Considering the research conducted, it is possible to conclude that the urban requalification interventions conducted in the selected public spaces achieved part of the initially defined objectives. It is also possible to conclude that the update of the evaluation matrix proved to be correct, due to its scope, for including more dimensions of analysis than the initial matrix. Due to its scope, this updated matrix can be used in the evaluation of future urban requalification projects with an impact on public space

    A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models

    Get PDF
    The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe

    Detection of Low-Copy Human Virus DNA upon Prolonged Formalin Fixation

    Get PDF
    Formalin fixation, albeit an outstanding method for morphological and molecular preservation, induces DNA damage and cross-linking, which can hinder nucleic acid screening. This is of particular concern in the detection of low-abundance targets, such as persistent DNA viruses. In the present study, we evaluated the analytical sensitivity of viral detection in lung, liver, and kidney specimens from four deceased individuals. The samples were either frozen or incubated in formalin (±paraffin embedding) for up to 10 days. We tested two DNA extraction protocols for the control of efficient yields and viral detections. We used short-amplicon qPCRs (63–159 nucleotides) to detect 11 DNA viruses, as well as hybridization capture of these plus 27 additional ones, followed by deep sequencing. We observed marginally higher ratios of amplifiable DNA and scantly higher viral genoprevalences in the samples extracted with the FFPE dedicated protocol. Based on the findings in the frozen samples, most viruses were detected regardless of the extended fixation times. False-negative calls, particularly by qPCR, correlated with low levels of viral DNA (150 base pairs). Our data suggest that low-copy viral DNAs can be satisfactorily investigated from FFPE specimens, and encourages further examination of historical materials
    corecore