Search CORE

127 research outputs found

A conditional compression distance that unveils insights of the genomic evolution

Author: Pinho Armando J.
Pratas Diogo
Publication venue
Publication date: 16/01/2014
Field of study

We describe a compression-based distance for genomic sequences. Instead of using the usual conjoint information content, as in the classical Normalized Compression Distance (NCD), it uses the conditional information content. To compute this Normalized Conditional Compression Distance (NCCD), we need a normal conditional compressor, that we built using a mixture of static and dynamic finite-context models. Using this approach, we measured chromosomal distances between Hominidae primates and also between Muroidea (rat and mouse), observing several insights of evolution that so far have not been reported in the literature.Comment: Full version of DCC 2014 paper "A conditional compression distance that unveils insights of the genomic evolution

arXiv.org e-Print Archive

Crossref

Modelos de compressão e ferramentas para dados ómicos

Author: Hosseini Seyedmorteza
Publication venue
Publication date: 02/07/2020
Field of study

The ever-increasing growth of the development of high-throughput sequencing technologies and as a consequence, generation of a huge volume of data, has revolutionized biological research and discovery. Motivated by that, we investigate in this thesis the methods which are capable of providing an efficient representation of omics data in compressed or encrypted manner, and then, we employ them to analyze omics data. First and foremost, we describe a number of measures for the purpose of quantifying information in and between omics sequences. Then, we present finite-context models (FCMs), substitution-tolerant Markov models (STMMs) and a combination of the two, which are specialized in modeling biological data, in order for data compression and analysis. To ease the storage of the aforementioned data deluge, we design two lossless data compressors for genomic and one for proteomic data. The methods work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned combination along with repeat models and a competitive prediction model. Tested on various synthetic and real data showed their outperformance over the previously proposed methods in terms of compression ratio. Privacy of genomic data is a topic that has been recently focused by developments in the field of personalized medicine. We propose a tool that is able to represent genomic data in a securely encrypted fashion, and at the same time, is able to compact FASTA and FASTQ sequences by a factor of three. It employs AES encryption accompanied by a shuffling mechanism for improving the data security. The results show it is faster than general-purpose and special-purpose algorithms. Compression techniques can be employed for analysis of omics data. Having this in mind, we investigate the identification of unique regions in a species with respect to close species, that can give us an insight into evolutionary traits. For this purpose, we design two alignment-free tools that can accurately find and visualize distinct regions among two collections of DNA or protein sequences. Tested on modern humans with respect to Neanderthals, we found a number of absent regions in Neanderthals that may express new functionalities associated with evolution of modern humans. Finally, we investigate the identification of genomic rearrangements, that have important roles in genetic disorders and cancer, by employing a compression technique. For this purpose, we design a tool that is able to accurately localize and visualize small- and large-scale rearrangements between two genomic sequences. The results of applying the proposed tool on several synthetic and real data conformed to the results partially reported by wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento de alto rendimento e, como consequência, a geração de um enorme volume de dados, revolucionou a pesquisa e descoberta biológica. Motivados por isso, nesta tese investigamos os métodos que fornecem uma representação eficiente de dados ómicros de maneira compactada ou criptografada e, posteriormente, os usamos para análise. Em primeiro lugar, descrevemos uma série de medidas com o objetivo de quantificar informação em e entre sequencias ómicas. Em seguida, apresentamos modelos de contexto finito (FCMs), modelos de Markov tolerantes a substituição (STMMs) e uma combinação dos dois, especializados na modelagem de dados biológicos, para compactação e análise de dados. Para facilitar o armazenamento do dilúvio de dados acima mencionado, desenvolvemos dois compressores de dados sem perda para dados genómicos e um para dados proteómicos. Os métodos funcionam com base em (a) uma combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente com modelos de repetição e um modelo de previsão competitiva. Testados em vários dados sintéticos e reais mostraram a sua eficiência sobre os métodos do estado-de-arte em termos de taxa de compressão. A privacidade dos dados genómicos é um tópico recentemente focado nos desenvolvimentos do campo da medicina personalizada. Propomos uma ferramenta capaz de representar dados genómicos de maneira criptografada com segurança e, ao mesmo tempo, compactando as sequencias FASTA e FASTQ para um fator de três. Emprega criptografia AES acompanhada de um mecanismo de embaralhamento para melhorar a segurança dos dados. Os resultados mostram que ´e mais rápido que os algoritmos de uso geral e específico. As técnicas de compressão podem ser exploradas para análise de dados ómicos. Tendo isso em mente, investigamos a identificação de regiões únicas em uma espécie em relação a espécies próximas, que nos podem dar uma visão das características evolutivas. Para esse fim, desenvolvemos duas ferramentas livres de alinhamento que podem encontrar e visualizar com precisão regiões distintas entre duas coleções de sequências de DNA ou proteínas. Testados em humanos modernos em relação a neandertais, encontrámos várias regiões ausentes nos neandertais que podem expressar novas funcionalidades associadas à evolução dos humanos modernos. Por último, investigamos a identificação de rearranjos genómicos, que têm papéis importantes em desordens genéticas e cancro, empregando uma técnica de compressão. Para esse fim, desenvolvemos uma ferramenta capaz de localizar e visualizar com precisão os rearranjos em pequena e grande escala entre duas sequências genómicas. Os resultados da aplicação da ferramenta proposta, em vários dados sintéticos e reais, estão em conformidade com os resultados parcialmente relatados por abordagens laboratoriais, por exemplo, análise FISH.Programa Doutoral em Engenharia Informátic

Repositório Institucional da Universidade de Aveiro

Recommended from our members

Interpreting Deep Learning for cell differentiation. Supervised and Unsupervised models viewed through the lens of information and perturbation theory.

Author: Andres Terre Helena
Publication venue: University of Cambridge
Publication date: 05/12/2019
Field of study

"Predicting the future isn't magic, it's artificial intelligence" Dave Waters. In the last decades there has been an unprecedented growth in the field of machine learning, and particularly within deep learning models. The combination of big data and computational power has nurtured the evolution of a variety of new methods to predict and interpret future scenarios. These data centric models can achieve exceptional performances on specific tasks, with their prediction boundaries continuously expanding towards new and more complex challenges. However, the model complexity often translates into a lack of interpretability from a scientific c perspective, it is not trivial to identify the factors involved in final outcomes. Explainability may not always be a requirement for some machine learning tasks, specially when it comes in detriment of performance power. But for some applications, such as biological discoveries or medical diagnostics, understanding the output and determining factors that influence decisions is essential. In this thesis we develop both a supervised and unsupervised approach to map from genotype to phenotype. We emphasise the importance of interpretability and feature extraction from the models, by identifying relevant genes for cell differentiation. We then continue to explore the rules and mechanisms behind the models from a theoretical perspective. Using information theory to explain the learning process and applying perturbation theory to transform the results into a generalisable representation. We start by building a supervised approach to mapping cell profiles from genotype to phenotype, using single cell RNA-Seq data. We leverage non-linearities among gene expressions to identify cellular levels of differentiation. The ambiguity and even absence of labels in most biological studies instigated the development of novel unsupervised techniques, leading to a new general and biologically interpretable framework based on Variational Autoencoders. The application and validation of the methods has proven to be successful, but questions regarding the learning process and generative nature of the results remained unanswered. I use information theory to define a new approach to interpret training and the converged solutions of our models. The variational and generative nature of Autoencoders provides a platform to develop general models. Their results should extrapolate and allow generalisation beyond the boundaries of the observed data. To this extent, we introduce for the first time a new interpretation of the embedded generative functions through Perturbation Theory. The embedding multiplicity is addressed by transforming the distributions into a new set of generalisable functions, while characterising their energy spectrum under a particular energy landscape. We outline the combination of theoretical and machine learning based methods, for moving towards interpretable and generalisable models. Developing a theoretical framework to map from genotype to phenotype, we provide both supervised and unsupervised tools to operate over single cell RNA-Seq. data. We have generated a pipeline to identify relevant genes and cell types through Variational Autoencoders (VAEs), validating reconstructed gene expressions to prove the generative performance of the embeddings. The new interpretation of the information learned and extracted by the models de fines a label independent evaluation, particularly useful for unsupervised learning. Lastly, we introduce a novel transformation of the generative embeddings based on quantum and perturbation theory. Our contributions can and have been extended to new datasets, according to the nature of the tasks being explored. For instance, the combination of unsupervised learning and information theory can be applied to a variety of biological or medical data. We have trained several VAE models with additional cancer and metabolic data, proving to extract meaningful representations of the data. The perturbation theory transformation of the embedding can also lead to future research on the generative potential of Variational Autoencoders through a physics perspective, combining statistical and quantum mechanics. We believe that machine learning will only continue its fast expansion and growth through the development of more generalisable more interpretable models. "Prediction is very difficult, especially if it's about the future" Niels Boh

Apollo (Cambridge)

Reconhecimento de padrões baseado em compressão: um exemplo de biometria utilizando ECG

Author: Carvalho João Miguel Rafael de
Publication venue
Publication date: 10/05/2022
Field of study

The amount of data being collected by sensors and smart devices that people use on their daily lives has been increasing at higher rates than ever before. That enables the possibility of using biomedical signals in several applications, with the aid of pattern recognition algorithms in several applications. In this thesis we investigate the usage of compression based methods to perform classification using one-dimensional signals. In order to test those methods, we use as testbed example, electrocardiographic (ECG) signals and the task biometric identification. First and foremost, we introduce the notion of Kolmogorov complexity and how it relates with compression methods. Then, we explain how can these methods be useful for pattern recognition, by exploring different compression-based measures, namely, the Normalized Relative Compression, a measure based on the relative similarity between strings. For this purpose, we present finite-context models and explain the theory behind a generalized version of those models, called the extended-alphabet finite-context models, a novel contribution. Since the testbed application for the methods presented in the thesis is based on ECG signals, we explain what constitutes such a signal and the methods that should be used before data compresison can be applied to them, such as filtering and quantization. Finally, we explore the application of biometric identification using the ECG signal into more depth, making some tests regarding the acquisition of signals and benchmark different proposals based on compresison methods, namely, non-fiducial ones. We also highlight the advantages of such an alternative approach to machine learning methods, namely, low computational costs and not requiring any kind of feature extraction, making this approach easily transferable into different applications and signals.A quantidade de dados recolhidos por sensores e dispositivos inteligentes que as pessoas utilizam no seu dia a dia tem aumentado a taxas mais elevadas do que nunca. Isso possibilita a utilização de sinais biomédicos em diversas aplicações práticas, com o auxílio de algoritmos de reconhecimento de padrões. Nesta tese, investigamos o uso de métodos baseados em compressão para realizar classificação de sinais unidimensionais. Para testar esses métodos, utilizamos, como aplicação de exemplo, o problema de identificação biométrica através de sinais eletrocardiográficos (ECG). Em primeiro lugar, introduzimos a noção de complexidade de Kolmogorov e a forma como a mesma se relaciona com os métodos de compressão. De seguida, explicamos como esses métodos são úteis para reconhecimento de padrões, explorando diferentes medidas baseadas em compressão, nomeadamente, a compressão relativa normalizada (NRC), uma medida baseada na similaridade relativa entre strings. Para isso, apresentamos os modelos de contexto finito e explicaremos a teoria por detrás de uma versão generalizada desses modelos, chamados de modelos de contexto finito de alfabeto estendido (xaFCM), uma nova contribuição. Uma vez que a aplicação de exemplo para os métodos apresentados na tese é baseada em sinais de ECG, explicamos também o que constitui tal sinal e os métodos que devem ser utilizados antes que a compressão de dados possa ser aplicada aos mesmos, tais como filtragem e quantização. Por fim, exploramos com maior profundidade a aplicação da identificação biométrica utilizando o sinal de ECG, realizando alguns testes relativos à aquisição de sinais e comparando diferentes propostas baseadas em métodos de compressão, nomeadamente os não fiduciais. Destacamos também as vantagens de tal abordagem, alternativa aos métodos de aprendizagem computacional, nomeadamente, baixo custo computacional bem como não exigir tipo de extração de atributos, tornando esta abordagem mais facilmente transponível para diferentes aplicações e sinais.Programa Doutoral em Informátic

Repositório Institucional da Universidade de Aveiro

Complement-mediated cooperation between immunocytes in the compound ascidian Botryllus schlosseri

Author: Ballarin L.
Franchi N.
Peronato A.
Publication venue
Publication date: 01/01/2019
Field of study

Two main kinds of innate immune responses are present in ascidians: phagocytosis and cytotoxicity. They are mediated by two different types of circulating immunocytes: phagocytes and cytotoxic morula cells (MCs). MCs, once activated by non-self-recognition, can stimulate phagocytosis by the release of soluble factors able to act as opsonins. BsC3, the complement C3 homologue, like mammalian C3, contains the thioester bond required to split the molecule into BsC3a and BsC3b. BsC3b likely represents the MC opsonin as it can enhances phagocytosis. The tenet is supported by the observed reduction in phagocytosing cells after exposure of hemocytes to compstatin, a drug preventing C3 activation, or after the bsc3 knockdown by iRNA injection. In addition, the transcript for BsCR1, homologous to mammalian CR1, is present in Botryllus phagocytes and the transcription is modulated during the blastogenetic cycle. MCs also release cytokines (chemokines) able to recruit immunocytes to the infection site. The activity is inhibited by antibodies raised against human TNFa. Since no genes for TNFa are present in the Botryllus genome, the observed activity is probably related to a TNF-domain containing protein, member of the Botryllus complement system. Conversely, activated phagocytes release a rhamnose-binding lectin able to interact with microbial surfaces and act as opsonin. It can also activate MCs by inducing the release of the reported cytokine and stimulate their degranulation. Overall, the results obtained so far indicate the presence of a well-defined cross-talk between the two types of immunocytes during the immune responses of B. schlosseri

Archivio istituzionale della ricerca - Università di Padova

The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond

Author: Banf M.
Hartwig T.
Publication venue: 'MDPI AG'
Publication date: 01/01/2021
Field of study

Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

MPG.PuRe

Statistical Population Genomics

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/02/2021
Field of study

This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions

Directory of Open Access Books (DOAB)

Genetics and genomics of moso bamboo (Phyllostachys edulis) : Current status, future challenges, and biotechnological opportunities toward a sustainable bamboo industry

Author: Cho Jungnam
Ramakrishnan Muthusamy
Satheesh Viswanathan
Sharma Anket
Vinod Kunnummal Kurungara
Yrjälä Kim
Zhou Mingbing
Publication venue
Publication date: 01/11/2020
Field of study

Sustainable goals for contemporary world seek viable solutions for interconnected challenges, particularly in the fields of food and energy security and climate change. We present bamboo, one of the versatile plant species on earth, as an ideal candidate for bioeconomy for meeting some of these challenges. With its potential realized, particularly in the industrial sector, countries such as China are going extensive with bamboo development and cultivation to support a myriad of industrial uses. These include timber, fiber, biofuel, paper, food, and medicinal industries. Bamboo is an ecologically viable choice, having better adaptation to wider environments than do other grasses, and can help to restore degraded lands and mitigate climate change. Bamboo, as a crop species, has not become amenable to genetic improvement, due to its long breeding cycle, perennial nature, and monocarpic behavior. One of the commonly used species, moso bamboo (Phyllostachys edulis) is a potential candidate that qualifies as industrial bamboo. With its whole-genome information released, genetic manipulations of moso bamboo offer tremendous potential to meet the industrial expectations either in quality or in quantity. Further, bamboo cultivation can expect several natural hindrances through biotic and abiotic stresses, which needs viable solutions such as genetic resistance. Taking a pragmatic view of these future requirements, we have compiled the present status of bamboo physiology, genetics, genomics, and biotechnology, particularly of moso bamboo, to drive various implications in meeting industrial and cultivation requirements. We also discuss challenges underway, caveats, and contextual opportunities concerning sustainable development.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Computational Investigations of Biomolecular Mechanisms in Genomic Replication, Repair and Transcription

Author: Dodd Thomas
Publication venue: ScholarWorks @ Georgia State University
Publication date: 16/12/2020
Field of study

High fidelity maintenance of the genome is imperative to ensuring stability and proliferation of cells. The genetic material (DNA) of a cell faces a constant barrage of metabolic and environmental assaults throughout the its lifetime, ultimately leading to DNA damage. Left unchecked, DNA damage can result in genomic instability, inviting a cascade of mutations that initiate cancer and other aging disorders. Thus, a large area of focus has been dedicated to understanding how DNA is damaged, repaired, expressed and replicated. At the heart of these processes lie complex macromolecular dynamics coupled with intricate protein-DNA interactions. Through advanced computational techniques it has become possible to probe these mechanisms at the atomic level, providing a physical basis to describe biomolecular phenomena. To this end, we have performed studies aimed at elucidating the dynamics and interactions intrinsic to the functionality of biomolecules critical to maintaining genomic integrity: modeling the DNA editing mechanism of DNA polymerase III, uncovering the DNA damage recognition/repair mechanism of thymine DNA glycosylase and linking genetic disease to the functional dynamics of the pre-initiation complex transcription machinery. Collectively, our results elucidate the dynamic interplay between proteins and DNA, further broadening our understanding of these complex processes involved with genomic maintenance

ScholarWorks @ Georgia State University