Search CORE

217 research outputs found

Gene function finding through cross-organism ensemble learning

Author: Masseroli M.
Moro G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

Author: Domeniconi Giacomo <1986>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 12/05/2016
Field of study

In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

AMS Tesi di Dottorato

Recommended from our members

Computational Toxinology

Author: Romano Joseph Daniel
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Venoms are complex mixtures of biological macromolecules and other compounds that are used for predatory and defensive purposes by hundreds of thousands of known species worldwide. Throughout human history, venoms and venom components have been used to treat a vast array of illnesses, causing them to be of great clinical, economic, and academic interest to the drug discovery and toxinology communities. In spite of major computational advances that facilitate data-driven drug discovery, most therapeutic venom effects are still discovered via tedious trial-and-error, or simply by accident. In this dissertation, I describe a body of work that aims to establish a new subdiscipline of translational bioinformatics, which I name “computational toxinology”. To accomplish this goal, I present three integrated components that span a wide range of informatics techniques: (1) VenomKB, (2) VenomSeq, and (3) VenomKB’s Semantic API. To provide a platform for structuring, representing, retrieving, and integrating venom data relevant to drug discovery, VenomKB provides a database-backed web application and knowledge base for computational toxinology. VenomKB is structured according to a fully-featured ontology of venoms, and provides data aggregated from many popular web re- sources. VenomSeq is a biotechnology workflow that is designed to generate new high-throughput sequencing data for incorporation into VenomKB. Specifically, we expose human cells to controlled doses of crude venoms, conduct RNA-Sequencing, and build profiles of differential gene expression, which we then compare to publicly-available differential expression data for known dis- eases and drugs with known effects, and use those comparisons to hypothesize ways that the venoms could act in a therapeutic manner, as well. These data are then integrated into VenomKB, where they can be effectively retrieved and evaluated using existing data and known therapeutic associations. VenomKB’s Semantic API further develops this functionality by providing an intelligent, powerful, and user-friendly interface for querying the complex underlying data in VenomKB in a way that reflects the intuitive, human-understandable mean- ing of those data. The Semantic API is designed to cater to the needs of advanced users as well as laypersons and bench scientists without previous expertise in computational biology and semantic data analysis. In each chapter of the dissertation, I describe how we evaluated these 3 components through various approaches. We demonstrate the utility of VenomKB and the Semantic API by testing a number of practical use-cases for each, designed to highlight their ability to rediscover existing knowledge as well as suggesting potential areas for future exploration. We use statistics and data science techniques to evaluate VenomSeq on 25 diverse species of venomous animals, and propose biologically feasible explanations for significant findings. In evaluating the Semantic API, I show how observations on VenomSeq data can be interpreted and placed into the context of past research by members of the larger toxinology community. Computational toxinology is a toolbox designed to be used by multiple stakeholders (toxinologists, computational biologists, and systems pharmacologists, among others) to improve the return rate of clinically-significant findings from manual experimentation. It aims to achieve this goal by enabling access to data, providing means for easy validation of results, and suggesting specific hypotheses that are preliminarily supported by rigorous inferential statistics. All components of the research I describe are open-access and publicly available, to improve reproducibility and encourage widespread adoptio

Columbia University Academic Commons

Discovering Domain-Domain Interactions toward Genome-Wide Protein Interaction and Function Predictions

Author: Liu Mei
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2009
Field of study

To fully understand the underlying mechanisms of living cells, it is essential to delineate the intricate interactions between the cell proteins at a genome scale. Insights into the protein functions will enrich our understanding in human diseases and contribute to future drug developments. My dissertation focuses on the development and optimization of machine learning algorithms to study protein-protein interactions and protein function annotations through discovery of domain-domain interactions. First of all, I developed a novel domain-based random decision forest framework (RDFF) that explored all possible domain module pairs in mediating protein interactions. RDFF achieved higher sensitivity (79.78%) and specificity (64.38%) in interaction predictions of S. cerevisiae proteins compared to the popular Maximum Likelihood Estimation (MLE) approach. RDFF can also infer interactions for both single-domain pairs and domain module pairs. Secondly, I proposed cross-species interacting domain patterns (CSIDOP) approach that not only increased fidelity of existing functional annotations, but also proposed novel annotations for unknown proteins. CSIDOP accurately determined functions for 95.42% of proteins in H. sapiens using 2,972 GO `molecular function' terms. In contrast, most existing methods can only achieve accuracies of 50% to 75% using much smaller number of categories. Additionally, we were able to assign novel annotations to 181 unknown H. sapiens proteins. Finally, I implemented a web-based system, called PINFUN, which enables users to make online protein-protein interaction and protein function predictions based on a large-scale collection of known and putative domain interactions

KU ScholarWorks

Associative Pattern Recognition for Biological Regulation Data

Author: Xiao Yiou
Publication venue: SURFACE at Syracuse University
Publication date: 22/12/2017
Field of study

In the last decade, bioinformatics data has been accumulated at an unprecedented rate, thanks to the advancement in sequencing technologies. Such rapid development poses both challenges and promising research topics. In this dissertation, we propose a series of associative pattern recognition algorithms in biological regulation studies. In particular, we emphasize efficiently recognizing associative patterns between genes, transcription factors, histone modifications and functional labels using heterogeneous data sources (numeric, sequences, time series data and textual labels). In protein-DNA associative pattern recognition, we introduce an efficient algorithm for affinity test by searching for over-represented DNA sequences using a hash function and modulo addition calculation. This substantially improves the efficiency of \textit{next generation sequencing} data analysis. In gene regulatory network inference, we propose a framework for refining weak networks based on transcription factor binding sites, thus improved the precision of predicted edges by up to 52%. In histone modification code analysis, we propose an approach to genome-wide combinatorial pattern recognition for histone code to function associative pattern recognition, and achieved improvement by up to

38.1\%

. We also propose a novel shape based modification pattern analysis approach, using this to successfully predict sub-classes of genes in flowering-time category. We also propose a combination to combination associative pattern recognition, and achieved better performance compared against multi-label classification and bidirectional associative memory methods. Our proposed approaches recognize associative patterns from different types of data efficiently, and provides a useful toolbox for biological regulation analysis. This dissertation presents a road-map to associative patterns recognition at genome wide level

Syracuse University Research Facility and Collaborative Environment

Application of knowledge discovery and data mining methods in livestock genomics for hypothesis generation and identification of biomarker candidates influencing meat quality traits in pigs

Author: Sahadevan Sudeep
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Recent advancements in genomics and genome profiling technologies have lead to an increase in the amount of data available in livestock genomics. Yet, most of the studies done in livestock genomics have been following a reductionist approach and very few studies have either followed data mining or knowledge discovery concepts or made use of the wealth of information available in the public domain to gain new knowledge. The goals of this thesis were: (i) the adoption of existing analysis strategies or the development of novel approaches in livestock genomics for integrative data analysis following the principles of data mining and knowledge discovery and (ii) demonstrating the application of such approaches in livestockgenomics for hypothesis generation and biomarker discovery. A pig meat quality trait termed androstenone measurement in backfat was selected as the target phenotype for the experiments. Two experiments were performed as a part of this thesis. The first one followed a knowledge driven approach merging high-throughput expression data with metabolic interaction network. Based on the results from this experiment, several novel biomarker candidates and a hypothesis regarding different mechanisms regulating androstenone synthesis in porcine testis samples with divergent androstenone measurements in back fat were proposed. The model proposed that the elevated levels of androstenone synthesis in sample population could be due to the combined effect of cAMP/PKA signaling, elevated levels of fatty acid metabolism and anti lipid peroxidation activity of members of glutathione metabolic pathway. The second experiment followed a data driven approach and integrated gene expression data from multiple porcine populations to identify similarities in gene expression patterns related to hepatic androstenone metabolism. The results indicated that one of the low androstenone phenotype specific co-expression cluster was functionally enriched in pathways related to androgen and androstenone metabolism and that the members of this cluster exhibited weak co-expression in high androstenone phenotype. Based on the results from this experiment, this co-expression cluster was proposed as a signature cluster for hepatic androstenone metabolism in boars with low androstenone content in back fat. The results from these experiments indicate that integrative analysis approaches following data mining and knowledge discovery concepts can be used for the generation of new knowledge from existing data in livestock genomics. But, limited data availability in livestock genomics is a hindrance to the extensive use such analysis methods in livestock genomics field for gaining new knowledge. In conclusion, this study was aimed at demonstrating the capabilities of data mining and knowledge discovery methods and integrative analysis approaches to generate new knowledge in livestock genomics using existing datasets. The results from the experiments hint the possibilities of further exploring such methods for knowledge generation in this field. Although the application of such methods is limited in livestock genomics due to data availability issues at present, the increase in data availability due to evolving high throughput technologies and decrease in data generation costs would aid in the wide spread use of such methods in livestock genomics in the coming future

bonndoc – Der Publikationsserver der Universität Bonn

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Systems Analytics and Integration of Big Omics Data

Author: Hardiman Gary
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

Directory of Open Access Books (DOAB)

Graph - Based Methods for Protein Function Prediction

Author: CHUA HON NIAN
Publication venue
Publication date: 03/04/2008
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS