12 research outputs found
Spectral Learning of Binomial HMMs for DNA Methylation Data
We consider learning parameters of Binomial Hidden Markov Models, which may
be used to model DNA methylation data. The standard algorithm for the problem
is EM, which is computationally expensive for sequences of the scale of the
mammalian genome. Recently developed spectral algorithms can learn parameters
of latent variable models via tensor decomposition, and are highly efficient
for large data. However, these methods have only been applied to categorial
HMMs, and the main challenge is how to extend them to Binomial HMMs while still
retaining computational efficiency. We address this challenge by introducing a
new feature-map based approach that exploits specific properties of Binomial
HMMs. We provide theoretical performance guarantees for our algorithm and
evaluate it on real DNA methylation data
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Systems Analytics and Integration of Big Omics Data
A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome
Statistical learning based inference and analysis of epigenetic regulatory network topologies in T-helper cells
The reliable statistical inference of epigenetic regulatory networks that govern mammalian cell fates is very challenging. In this thesis we study this question for the differentiation decisions of T-helper (Th) cells, which have recently been shown to adopt a continuum of differentiated states in response to cytokine signals. To infer the underlying regulatory networks we introduce a novel framework for the inference of epigenetic regulatory network topologies based on statistical learning.
First, we infer, via a Hidden Markov Model, chromatin states based on histone modification patterns in naïve Th cells and differentiated Th1, Th2 and mixed Th1/2 states; these states are controlled by external cytokine stimuli and the gene dose of the Th1 master transcription factor Tbet (Tbx21). We then introduce a linear multivariate correlation measure for mapping enhancers to their target genes, which is parametrized on a training set of known enhancers. This analysis is refined further by
the application of partial correlations to distinguish direct from indirect effects. Applying this approach to our data, we recover known enhancers and obtain a genomewide enhancer-gene mapping. We also extend this to the correlation of repressive regulatory elements with gene expression.
Next, we focus on the enhancers that regulate differentially expressed Th1 and Th2 specific transcripts. Building machine learning based predictors, we identify Th1 and Th2 specific enhancer and repressive state classes characterized by their response patterns to cytokine stimuli and Tbet dose. In turn, we use chromatin immunoprecipitation data of transcription factors to define the transcriptional regulatory logic governing the activities of the enhancer classes.
Finally, we combine enhancer-target gene maps and enhancer regulatory logic as well as inhibitory elements to infer a bipartite epigenetic network. The network architecture builds on enhancer and repressive state classes as well as on genes and transcription factors leading to a weighted multidigraph. The network topology reveals distinct community structures related to Th1, Th2 and hybrid functionality. We furthermore analyse multiplex networks resulting in condition-specific topologies. From these analyses we obtain unique contributions of distinct network nodes. Utilizing random walks on multidigraphs we extract metastable processes underlying the observed system.
In conclusion we present a robust quantitative framework for mapping chromatin states to gene activity, and, by factoring in transcription factor regulation of enhancers, inferring epigenetic regulatory networks. This methodology is applicable to a wide range of systems
Recommended from our members
Computational Methods for Comparative Genomic and Epigenomic Annotations across Multiple Species
In recent years Genome Wide Association Studies (GWAS) and large-scale whole genome sequencing case-control studies have led to the identification of a wealth of phenotype-associated and rare genetic variants. Interpreting the biological significance of these variants has been a significant challenge, especially since a large majority of their genomic locations fall within non-protein coding genomic regions. Here we present a computational method, ConsHMM, for annotating the genome at single-nucleotide resolution into a set of conservation states learned from the combinatorial and spatial patterns of species aligning and matching a reference genome in a multiple-sequence alignment. Conservation states have specific enrichments for orthogonal biological annotations and can be used for interpreting genetic variants. We provide here a comprehensive resource of conservation state annotations, the ConsHMM atlas, comprised of models and annotations for eight different organisms based on several multiple-sequence alignments. At the epigenomic level, modifications such as DNA methylation have emerged as useful biomarkers for several phenotypes, but a large majority of these phenotypes have been studied predominantly in human samples. Leveraging sequence conservation among genomes, we have designed a methylation array that can query DNA methylation of many different mammals, and therefore facilitate cross species epigenetic studies. The array has been produced and used to profile 8730 samples from 145 different mammals. In summary, this work takes a comparative genomics based approach to expanding the available genomic and epigenomic annotations of multiple species
Engineering a Mastoparan Peptide Concatemer Prodrug From CircRNA for Cancer Therapy
CircRNAs are covalently closed loops of RNA formed as products of RNA backsplicing in mammalian cells. Engineered circRNAs containing a desired coding sequence have been produced using self-splicing introns. Translatable circRNAs require an internal ribosomal entry site or m6A methylation site for translation initiation. CircRNAs with a nucleotide length a multiple of three, a start codon, and no stop codon in the same frame have an infinite open reading frame. This project aimed to produce a mastoparan peptide concatemer prodrug from circRNA for treatment in cancer therapeutics. Anabaena group I self-splicing introns were used to circularise a mastoparan prodrug containing a metalloproteinase cleavage site for activation (construct named Anabaena Mastoparan). RNA circularisation was achieved in vitro but not in mammalian cells, indicating that group I Anabaena introns do not have the catalytic ability to splice in mammalian cells. Mastoparan peptides were detected in vitro and in vivo after adding a Flag tag to the Anabaena Mastoparan construct. However, only peptides produced from unspliced RNA translation were detected. Mastoparan peptides extracted from Anabaena Mastoparan transfected cells caused cytotoxicity when added to the culture medium of MDA-MB-231 and MCF-7 cells. Anabaena Mastoparan transfection did not directly lead to cytotoxicity, demonstrating the effectiveness of mastoparan as a prodrug, only being activated by metalloproteinase cleavage in the extracellular environment.
This project aimed to identify endogenous circRNAs that have the coding potential to produce a peptide with a different biological function to their parent gene. Using a Bioinformatics approach, circRNAs containing an ORF through the circular junction were identified. Their ORF through junction peptides were investigated for differences in predicted function to their parent gene using InterProScan and Protein Homology/analogY Recognition (Phyre2). Using this approach, four candidate circRNAs were identified that encode a predicted peptide with a different biological function to their parent gene. The four candidate circRNAs contain either a predicted m6A or an internal ribosomal entry site for translation initiation, and have a codon adaption index score (CAI) between 0.781 and 0.821, comparable to the 75th percentile of ORFs through the circular junction (079), and the mean CAI score of coding sequence mRNA. This project demonstrates that the circular junction of circRNAs can provide the coding potential to produce unique peptides with a different function to their parent gene
Engineering a Mastoparan Peptide Concatemer Prodrug From CircRNA for Cancer Therapy
CircRNAs are covalently closed loops of RNA formed as products of RNA backsplicing in mammalian cells. Engineered circRNAs containing a desired coding sequence have been produced using self-splicing introns. Translatable circRNAs require an internal ribosomal entry site or m6A methylation site for translation initiation. CircRNAs with a nucleotide length a multiple of three, a start codon, and no stop codon in the same frame have an infinite open reading frame. This project aimed to produce a mastoparan peptide concatemer prodrug from circRNA for treatment in cancer therapeutics. Anabaena group I self-splicing introns were used to circularise a mastoparan prodrug containing a metalloproteinase cleavage site for activation (construct named Anabaena Mastoparan). RNA circularisation was achieved in vitro but not in mammalian cells, indicating that group I Anabaena introns do not have the catalytic ability to splice in mammalian cells. Mastoparan peptides were detected in vitro and in vivo after adding a Flag tag to the Anabaena Mastoparan construct. However, only peptides produced from unspliced RNA translation were detected. Mastoparan peptides extracted from Anabaena Mastoparan transfected cells caused cytotoxicity when added to the culture medium of MDA-MB-231 and MCF-7 cells. Anabaena Mastoparan transfection did not directly lead to cytotoxicity, demonstrating the effectiveness of mastoparan as a prodrug, only being activated by metalloproteinase cleavage in the extracellular environment.
This project aimed to identify endogenous circRNAs that have the coding potential to produce a peptide with a different biological function to their parent gene. Using a Bioinformatics approach, circRNAs containing an ORF through the circular junction were identified. Their ORF through junction peptides were investigated for differences in predicted function to their parent gene using InterProScan and Protein Homology/analogY Recognition (Phyre2). Using this approach, four candidate circRNAs were identified that encode a predicted peptide with a different biological function to their parent gene. The four candidate circRNAs contain either a predicted m6A or an internal ribosomal entry site for translation initiation, and have a codon adaption index score (CAI) between 0.781 and 0.821, comparable to the 75th percentile of ORFs through the circular junction (079), and the mean CAI score of coding sequence mRNA. This project demonstrates that the circular junction of circRNAs can provide the coding potential to produce unique peptides with a different function to their parent gene
Application of multivariate statistics and machine learning to phenotypic imaging and chemical high-content data
Image-based high-content screens (HCS) hold tremendous promise for cell-based
phenotypic screens. Challenges related to HCS include not only storage and
management of data, but critical analysis of the complex image-based data. I
implemented a data storage and screen management framework and developed
approaches for data analysis of a number high-content microscopy screen formats.
I visualized and analysed pilot screens to develop a robust multi-parametric assay
for the identification of genes involved in DNA damage repair in HeLa cells.
Further, I developed and implemented new approaches for image processing and
screen data normalization. My analyses revealed that the ubiquitin ligase RNF8
plays a central role in DNA-damage response and that a related ubiquitin ligase
RNF168 causes the cellular and developmental phenotypes characteristic for the
RIDDLE syndrome. My approaches also uncovered a role for the MMS22LTONSL
complex in DSB repair and its role in the recombination-dependent repair
of stalled or collapsed replication forks.
The discovery of novel bioactive molecules is a challenge because the fraction of active
candidate molecules is usually small and confounded by noise in experimental
readouts. Cheminformatics can improve robustness of chemical high-throughput
screens and functional genomics data sets by taking structure-activity relationships
into account. I applied statistics, machine learning and cheminformatics
to different data sets to discern novel bioactive compounds. I showed that phenothiazines
and apomorphines are regulators for cell differentiation in murine
embryonic stem cells. Further, I pioneered computational methods for the identification of structural features that influence the degradation and retention of
compounds in the nematode C. elegans. I used chemoinformatics to assemble a
comprehensive screening library of previously approved drugs for redeployment
in new bioassays. A combination of chemical genetic interactions, cheminformatics
and machine learning allowed me to predict novel synergistic antifungal small
molecule combinations from sensitized screens with the drug library. In another
study on the biological effects of commonly prescribed psychoactive compounds,
I discovered a strong link between lipophilicity and bioactivity of compounds in
yeast and unexpected off-target effects that could account for unwanted side effects
in humans. I also investigated structure-activity relationships and assessed
the chemical diversity of a compound collection that was used to probe chemical-genetic
interactions in yeast. Finally, I have made these methods and tools available
to the scientific community, including an open source software package called
MolClass that allows researchers to make predictions about bioactivity of small
molecules based on their chemical structure