127 research outputs found
A conditional compression distance that unveils insights of the genomic evolution
We describe a compression-based distance for genomic sequences. Instead of
using the usual conjoint information content, as in the classical Normalized
Compression Distance (NCD), it uses the conditional information content. To
compute this Normalized Conditional Compression Distance (NCCD), we need a
normal conditional compressor, that we built using a mixture of static and
dynamic finite-context models. Using this approach, we measured chromosomal
distances between Hominidae primates and also between Muroidea (rat and mouse),
observing several insights of evolution that so far have not been reported in
the literature.Comment: Full version of DCC 2014 paper "A conditional compression distance
that unveils insights of the genomic evolution
Modelos de compressão e ferramentas para dados ómicos
The ever-increasing growth of the development of high-throughput sequencing
technologies and as a consequence, generation of a huge volume of data,
has revolutionized biological research and discovery. Motivated by that, we
investigate in this thesis the methods which are capable of providing an
efficient representation of omics data in compressed or encrypted manner,
and then, we employ them to analyze omics data.
First and foremost, we describe a number of measures for the purpose
of quantifying information in and between omics sequences. Then, we
present finite-context models (FCMs), substitution-tolerant Markov models
(STMMs) and a combination of the two, which are specialized in modeling
biological data, in order for data compression and analysis.
To ease the storage of the aforementioned data deluge, we design two lossless
data compressors for genomic and one for proteomic data. The methods
work on the basis of (a) a combination of FCMs and STMMs or (b) the mentioned
combination along with repeat models and a competitive prediction
model. Tested on various synthetic and real data showed their outperformance
over the previously proposed methods in terms of compression ratio.
Privacy of genomic data is a topic that has been recently focused by developments
in the field of personalized medicine. We propose a tool that is
able to represent genomic data in a securely encrypted fashion, and at the
same time, is able to compact FASTA and FASTQ sequences by a factor
of three. It employs AES encryption accompanied by a shuffling mechanism
for improving the data security. The results show it is faster than
general-purpose and special-purpose algorithms.
Compression techniques can be employed for analysis of omics data. Having
this in mind, we investigate the identification of unique regions in a species
with respect to close species, that can give us an insight into evolutionary
traits. For this purpose, we design two alignment-free tools that can accurately
find and visualize distinct regions among two collections of DNA or
protein sequences. Tested on modern humans with respect to Neanderthals,
we found a number of absent regions in Neanderthals that may express new
functionalities associated with evolution of modern humans.
Finally, we investigate the identification of genomic rearrangements, that
have important roles in genetic disorders and cancer, by employing a compression
technique. For this purpose, we design a tool that is able to accurately
localize and visualize small- and large-scale rearrangements between
two genomic sequences. The results of applying the proposed tool on several
synthetic and real data conformed to the results partially reported by
wet laboratory approaches, e.g., FISH analysis.O crescente crescimento do desenvolvimento de tecnologias de sequenciamento
de alto rendimento e, como consequência, a geração de um enorme
volume de dados, revolucionou a pesquisa e descoberta biológica. Motivados
por isso, nesta tese investigamos os métodos que fornecem uma
representação eficiente de dados ómicros de maneira compactada ou criptografada
e, posteriormente, os usamos para análise.
Em primeiro lugar, descrevemos uma série de medidas com o objetivo de
quantificar informação em e entre sequencias ómicas. Em seguida, apresentamos
modelos de contexto finito (FCMs), modelos de Markov tolerantes
a substituição (STMMs) e uma combinação dos dois, especializados na
modelagem de dados biológicos, para compactação e análise de dados.
Para facilitar o armazenamento do dilúvio de dados acima mencionado, desenvolvemos
dois compressores de dados sem perda para dados genómicos e
um para dados proteómicos. Os métodos funcionam com base em (a) uma
combinação de FCMs e STMMs ou (b) na combinação mencionada, juntamente
com modelos de repetição e um modelo de previsão competitiva.
Testados em vários dados sintéticos e reais mostraram a sua eficiência sobre
os métodos do estado-de-arte em termos de taxa de compressão.
A privacidade dos dados genómicos é um tópico recentemente focado nos
desenvolvimentos do campo da medicina personalizada. Propomos uma
ferramenta capaz de representar dados genómicos de maneira criptografada
com segurança e, ao mesmo tempo, compactando as sequencias FASTA e
FASTQ para um fator de três. Emprega criptografia AES acompanhada de
um mecanismo de embaralhamento para melhorar a segurança dos dados.
Os resultados mostram que ´e mais rápido que os algoritmos de uso geral e
específico.
As técnicas de compressão podem ser exploradas para análise de dados
ómicos. Tendo isso em mente, investigamos a identificação de regiões
únicas em uma espécie em relação a espécies próximas, que nos podem
dar uma visão das características evolutivas. Para esse fim, desenvolvemos
duas ferramentas livres de alinhamento que podem encontrar e visualizar
com precisão regiões distintas entre duas coleções de sequências de DNA
ou proteínas. Testados em humanos modernos em relação a neandertais,
encontrámos várias regiões ausentes nos neandertais que podem expressar
novas funcionalidades associadas à evolução dos humanos modernos.
Por último, investigamos a identificação de rearranjos genómicos, que têm
papéis importantes em desordens genéticas e cancro, empregando uma
técnica de compressão. Para esse fim, desenvolvemos uma ferramenta capaz
de localizar e visualizar com precisão os rearranjos em pequena e grande
escala entre duas sequências genómicas. Os resultados da aplicação da ferramenta
proposta, em vários dados sintéticos e reais, estão em conformidade
com os resultados parcialmente relatados por abordagens laboratoriais, por
exemplo, análise FISH.Programa Doutoral em Engenharia Informátic
Recommended from our members
Interpreting Deep Learning for cell differentiation. Supervised and Unsupervised models viewed through the lens of information and perturbation theory.
"Predicting the future isn't magic, it's artificial intelligence" Dave Waters.
In the last decades there has been an unprecedented growth in the field of machine learning, and particularly within deep learning models. The combination of big data and computational power has nurtured the evolution of a variety of new methods to predict and interpret future scenarios. These data centric models can achieve exceptional performances on specific tasks, with their prediction boundaries continuously expanding towards new and more complex challenges.
However, the model complexity often translates into a lack of interpretability from a scientific c perspective, it is not trivial to identify the factors involved in final outcomes.
Explainability may not always be a requirement for some machine learning tasks, specially when it comes in detriment of performance power. But for some applications, such as biological discoveries or medical diagnostics, understanding the output and determining factors that influence decisions is essential.
In this thesis we develop both a supervised and unsupervised approach to map from genotype to phenotype. We emphasise the importance of interpretability and feature extraction from the models, by identifying relevant genes for cell differentiation. We then continue to explore the rules and mechanisms behind the models from a theoretical perspective. Using information theory to explain the learning process and applying
perturbation theory to transform the results into a generalisable representation.
We start by building a supervised approach to mapping cell profiles from genotype to phenotype, using single cell RNA-Seq data. We leverage non-linearities among gene expressions to identify cellular levels of differentiation. The ambiguity and even absence of labels in most biological studies instigated the development of novel unsupervised techniques, leading to a new general and biologically interpretable framework based on Variational Autoencoders.
The application and validation of the methods has proven to be successful, but questions regarding the learning process and generative nature of the results remained unanswered. I use information theory to define a new approach to interpret training and the converged solutions of our models.
The variational and generative nature of Autoencoders provides a platform to develop general models. Their results should extrapolate and allow generalisation beyond the boundaries of the observed data. To this extent, we introduce for the first time a new interpretation of the embedded generative functions through Perturbation Theory. The embedding multiplicity is addressed by transforming the distributions into a new set of generalisable functions, while characterising their energy spectrum
under a particular energy landscape.
We outline the combination of theoretical and machine learning based methods, for moving towards interpretable and generalisable models. Developing a theoretical framework to map from genotype to phenotype, we provide both supervised and unsupervised tools to operate over single cell RNA-Seq. data. We have generated a pipeline to identify relevant genes and cell types through Variational Autoencoders (VAEs),
validating reconstructed gene expressions to prove the generative performance of the embeddings. The new interpretation of the information learned and extracted by the models de fines a label independent evaluation, particularly useful for unsupervised
learning. Lastly, we introduce a novel transformation of the generative embeddings based on quantum and perturbation theory.
Our contributions can and have been extended to new datasets, according to the nature of the tasks being explored. For instance, the combination of unsupervised learning and information theory can be applied to a variety of biological or medical data. We have trained several VAE models with additional cancer and metabolic data, proving to extract meaningful representations of the data. The perturbation theory transformation of the embedding can also lead to future research on the generative potential of Variational Autoencoders through a physics perspective, combining statistical and quantum mechanics.
We believe that machine learning will only continue its fast expansion and growth through the development of more generalisable more interpretable models.
"Prediction is very difficult, especially if it's about the future" Niels Boh
Reconhecimento de padrões baseado em compressão: um exemplo de biometria utilizando ECG
The amount of data being collected by sensors and smart devices that
people use on their daily lives has been increasing at higher rates than
ever before. That enables the possibility of using biomedical signals in
several applications, with the aid of pattern recognition algorithms in several
applications. In this thesis we investigate the usage of compression based
methods to perform classification using one-dimensional signals. In order to
test those methods, we use as testbed example, electrocardiographic (ECG)
signals and the task biometric identification.
First and foremost, we introduce the notion of Kolmogorov complexity
and how it relates with compression methods. Then, we explain how
can these methods be useful for pattern recognition, by exploring different
compression-based measures, namely, the Normalized Relative Compression,
a measure based on the relative similarity between strings. For this purpose,
we present finite-context models and explain the theory behind a generalized
version of those models, called the extended-alphabet finite-context models,
a novel contribution.
Since the testbed application for the methods presented in the thesis is
based on ECG signals, we explain what constitutes such a signal and the
methods that should be used before data compresison can be applied to
them, such as filtering and quantization.
Finally, we explore the application of biometric identification using the ECG
signal into more depth, making some tests regarding the acquisition of
signals and benchmark different proposals based on compresison methods,
namely, non-fiducial ones. We also highlight the advantages of such an
alternative approach to machine learning methods, namely, low computational
costs and not requiring any kind of feature extraction, making this
approach easily transferable into different applications and signals.A quantidade de dados recolhidos por sensores e dispositivos inteligentes
que as pessoas utilizam no seu dia a dia tem aumentado a taxas mais
elevadas do que nunca. Isso possibilita a utilização de sinais biomédicos
em diversas aplicações práticas, com o auxílio de algoritmos de reconhecimento
de padrões. Nesta tese, investigamos o uso de métodos baseados
em compressão para realizar classificação de sinais unidimensionais. Para
testar esses métodos, utilizamos, como aplicação de exemplo, o problema
de identificação biométrica através de sinais eletrocardiográficos (ECG).
Em primeiro lugar, introduzimos a noção de complexidade de Kolmogorov
e a forma como a mesma se relaciona com os métodos de compressão. De
seguida, explicamos como esses métodos são úteis para reconhecimento de
padrões, explorando diferentes medidas baseadas em compressão, nomeadamente,
a compressão relativa normalizada (NRC), uma medida baseada
na similaridade relativa entre strings. Para isso, apresentamos os modelos
de contexto finito e explicaremos a teoria por detrás de uma versão generalizada
desses modelos, chamados de modelos de contexto finito de alfabeto
estendido (xaFCM), uma nova contribuição.
Uma vez que a aplicação de exemplo para os métodos apresentados na tese
é baseada em sinais de ECG, explicamos também o que constitui tal sinal
e os métodos que devem ser utilizados antes que a compressão de dados
possa ser aplicada aos mesmos, tais como filtragem e quantização.
Por fim, exploramos com maior profundidade a aplicação da identificação
biométrica utilizando o sinal de ECG, realizando alguns testes relativos à
aquisição de sinais e comparando diferentes propostas baseadas em métodos
de compressão, nomeadamente os não fiduciais. Destacamos também as
vantagens de tal abordagem, alternativa aos métodos de aprendizagem computacional, nomeadamente, baixo custo computacional bem como não exigir tipo de extração de atributos, tornando esta abordagem mais facilmente
transponível para diferentes aplicações e sinais.Programa Doutoral em Informátic
Complement-mediated cooperation between immunocytes in the compound ascidian Botryllus schlosseri
Two main kinds of innate immune responses are present in ascidians: phagocytosis and cytotoxicity. They are mediated by two different types of circulating immunocytes: phagocytes and cytotoxic morula cells (MCs). MCs, once activated by non-self-recognition, can stimulate phagocytosis by the release of soluble factors able to act as opsonins. BsC3, the complement C3 homologue, like mammalian C3, contains the thioester bond required to split the molecule into BsC3a and BsC3b. BsC3b likely represents the MC opsonin as it can enhances phagocytosis. The tenet is supported by the observed reduction in phagocytosing cells after exposure of hemocytes to compstatin, a drug preventing C3 activation, or after the bsc3 knockdown by iRNA injection. In addition, the transcript for BsCR1, homologous to mammalian CR1, is present in Botryllus phagocytes and the transcription is modulated during the blastogenetic cycle. MCs also release cytokines (chemokines) able to recruit immunocytes to the infection site. The activity is inhibited by antibodies raised against human TNFa. Since no genes for TNFa are present in the Botryllus genome, the observed activity is probably related to a TNF-domain containing protein, member of the Botryllus complement system. Conversely, activated phagocytes release a rhamnose-binding lectin able to interact with microbial surfaces and act as opsonin. It can also activate MCs by inducing the release of the reported cytokine and stimulate their degranulation. Overall, the results obtained so far indicate the presence of a well-defined cross-talk between the two types of immunocytes during the immune responses of B. schlosseri
The Reasonable Effectiveness of Randomness in Scalable and Integrative Gene Regulatory Network Inference and Beyond
Gene regulation is orchestrated by a vast number of molecules, including transcription factors and co-factors, chromatin regulators, as well as epigenetic mechanisms, and it has been shown that transcriptional misregulation, e.g., caused by mutations in regulatory sequences, is responsible for a plethora of diseases, including cancer, developmental or neurological disorders. As a consequence, decoding the architecture of gene regulatory networks has become one of the most important tasks in modern (computational) biology. However, to advance our understanding of the mechanisms involved in the transcriptional apparatus, we need scalable approaches that can deal with the increasing number of large-scale, high-resolution, biological datasets. In particular, such approaches need to be capable of efficiently integrating and exploiting the biological and technological heterogeneity of such datasets in order to best infer the underlying, highly dynamic regulatory networks, often in the absence of sufficient ground truth data for model training or testing. With respect to scalability, randomized approaches have proven to be a promising alternative to deterministic methods in computational biology. As an example, one of the top performing algorithms in a community challenge on gene regulatory network inference from transcriptomic data is based on a random forest regression model. In this concise survey, we aim to highlight how randomized methods may serve as a highly valuable tool, in particular, with increasing amounts of large-scale, biological experiments and datasets being collected. Given the complexity and interdisciplinary nature of the gene regulatory network inference problem, we hope our survey maybe helpful to both computational and biological scientists. It is our aim to provide a starting point for a dialogue about the concepts, benefits, and caveats of the toolbox of randomized methods, since unravelling the intricate web of highly dynamic, regulatory events will be one fundamental step in understanding the mechanisms of life and eventually developing efficient therapies to treat and cure diseases
Statistical Population Genomics
This open access volume presents state-of-the-art inference methods in population genomics, focusing on data analysis based on rigorous statistical techniques. After introducing general concepts related to the biology of genomes and their evolution, the book covers state-of-the-art methods for the analysis of genomes in populations, including demography inference, population structure analysis and detection of selection, using both model-based inference and simulation procedures. Last but not least, it offers an overview of the current knowledge acquired by applying such methods to a large variety of eukaryotic organisms. Written in the highly successful Methods in Molecular Biology series format, chapters include introductions to their respective topics, pointers to the relevant literature, step-by-step, readily reproducible laboratory protocols, and tips on troubleshooting and avoiding known pitfalls. Authoritative and cutting-edge, Statistical Population Genomics aims to promote and ensure successful applications of population genomic methods to an increasing number of model systems and biological questions
Genetics and genomics of moso bamboo (Phyllostachys edulis) : Current status, future challenges, and biotechnological opportunities toward a sustainable bamboo industry
Sustainable goals for contemporary world seek viable solutions for interconnected challenges, particularly in the fields of food and energy security and climate change. We present bamboo, one of the versatile plant species on earth, as an ideal candidate for bioeconomy for meeting some of these challenges. With its potential realized, particularly in the industrial sector, countries such as China are going extensive with bamboo development and cultivation to support a myriad of industrial uses. These include timber, fiber, biofuel, paper, food, and medicinal industries. Bamboo is an ecologically viable choice, having better adaptation to wider environments than do other grasses, and can help to restore degraded lands and mitigate climate change. Bamboo, as a crop species, has not become amenable to genetic improvement, due to its long breeding cycle, perennial nature, and monocarpic behavior. One of the commonly used species, moso bamboo (Phyllostachys edulis) is a potential candidate that qualifies as industrial bamboo. With its whole-genome information released, genetic manipulations of moso bamboo offer tremendous potential to meet the industrial expectations either in quality or in quantity. Further, bamboo cultivation can expect several natural hindrances through biotic and abiotic stresses, which needs viable solutions such as genetic resistance. Taking a pragmatic view of these future requirements, we have compiled the present status of bamboo physiology, genetics, genomics, and biotechnology, particularly of moso bamboo, to drive various implications in meeting industrial and cultivation requirements. We also discuss challenges underway, caveats, and contextual opportunities concerning sustainable development.Peer reviewe
Computational Investigations of Biomolecular Mechanisms in Genomic Replication, Repair and Transcription
High fidelity maintenance of the genome is imperative to ensuring stability and proliferation of cells. The genetic material (DNA) of a cell faces a constant barrage of metabolic and environmental assaults throughout the its lifetime, ultimately leading to DNA damage. Left unchecked, DNA damage can result in genomic instability, inviting a cascade of mutations that initiate cancer and other aging disorders. Thus, a large area of focus has been dedicated to understanding how DNA is damaged, repaired, expressed and replicated. At the heart of these processes lie complex macromolecular dynamics coupled with intricate protein-DNA interactions. Through advanced computational techniques it has become possible to probe these mechanisms at the atomic level, providing a physical basis to describe biomolecular phenomena. To this end, we have performed studies aimed at elucidating the dynamics and interactions intrinsic to the functionality of biomolecules critical to maintaining genomic integrity: modeling the DNA editing mechanism of DNA polymerase III, uncovering the DNA damage recognition/repair mechanism of thymine DNA glycosylase and linking genetic disease to the functional dynamics of the pre-initiation complex transcription machinery. Collectively, our results elucidate the dynamic interplay between proteins and DNA, further broadening our understanding of these complex processes involved with genomic maintenance
- …