217 research outputs found

    Gene function finding through cross-organism ensemble learning

    Get PDF
    Background: Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results: Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions: Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available

    Data and Text Mining Techniques for In-Domain and Cross-Domain Applications

    Get PDF
    In the big data era, a wide amount of data has been generated in different domains, from social media to news feeds, from health care to genomic functionalities. When addressing a problem, we usually need to harness multiple disparate datasets. Data from different domains may follow different modalities, each of which has a different representation, distribution, scale and density. For example, text is usually represented as discrete sparse word count vectors, whereas an image is represented by pixel intensities, and so on. Nowadays plenty of Data Mining and Machine Learning techniques are proposed in literature, which have already achieved significant success in many knowledge engineering areas, including classification, regression and clustering. Anyway some challenging issues remain when tackling a new problem: how to represent the problem? What approach is better to use among the huge quantity of possibilities? What is the information to be used in the Machine Learning task and how to represent it? There exist any different domains from which borrow knowledge? This dissertation proposes some possible representation approaches for problems in different domains, from text mining to genomic analysis. In particular, one of the major contributions is a different way to represent a classical classification problem: instead of using an instance related to each object (a document, or a gene, or a social post, etc.) to be classified, it is proposed to use a pair of objects or a pair object-class, using the relationship between them as label. The application of this approach is tested on both flat and hierarchical text categorization datasets, where it potentially allows the efficient addition of new categories during classification. Furthermore, the same idea is used to extract conversational threads from an unregulated pool of messages and also to classify the biomedical literature based on the genomic features treated

    Discovering Domain-Domain Interactions toward Genome-Wide Protein Interaction and Function Predictions

    Get PDF
    To fully understand the underlying mechanisms of living cells, it is essential to delineate the intricate interactions between the cell proteins at a genome scale. Insights into the protein functions will enrich our understanding in human diseases and contribute to future drug developments. My dissertation focuses on the development and optimization of machine learning algorithms to study protein-protein interactions and protein function annotations through discovery of domain-domain interactions. First of all, I developed a novel domain-based random decision forest framework (RDFF) that explored all possible domain module pairs in mediating protein interactions. RDFF achieved higher sensitivity (79.78%) and specificity (64.38%) in interaction predictions of S. cerevisiae proteins compared to the popular Maximum Likelihood Estimation (MLE) approach. RDFF can also infer interactions for both single-domain pairs and domain module pairs. Secondly, I proposed cross-species interacting domain patterns (CSIDOP) approach that not only increased fidelity of existing functional annotations, but also proposed novel annotations for unknown proteins. CSIDOP accurately determined functions for 95.42% of proteins in H. sapiens using 2,972 GO `molecular function' terms. In contrast, most existing methods can only achieve accuracies of 50% to 75% using much smaller number of categories. Additionally, we were able to assign novel annotations to 181 unknown H. sapiens proteins. Finally, I implemented a web-based system, called PINFUN, which enables users to make online protein-protein interaction and protein function predictions based on a large-scale collection of known and putative domain interactions

    Associative Pattern Recognition for Biological Regulation Data

    Get PDF
    In the last decade, bioinformatics data has been accumulated at an unprecedented rate, thanks to the advancement in sequencing technologies. Such rapid development poses both challenges and promising research topics. In this dissertation, we propose a series of associative pattern recognition algorithms in biological regulation studies. In particular, we emphasize efficiently recognizing associative patterns between genes, transcription factors, histone modifications and functional labels using heterogeneous data sources (numeric, sequences, time series data and textual labels). In protein-DNA associative pattern recognition, we introduce an efficient algorithm for affinity test by searching for over-represented DNA sequences using a hash function and modulo addition calculation. This substantially improves the efficiency of \textit{next generation sequencing} data analysis. In gene regulatory network inference, we propose a framework for refining weak networks based on transcription factor binding sites, thus improved the precision of predicted edges by up to 52%. In histone modification code analysis, we propose an approach to genome-wide combinatorial pattern recognition for histone code to function associative pattern recognition, and achieved improvement by up to 38.1%38.1\%. We also propose a novel shape based modification pattern analysis approach, using this to successfully predict sub-classes of genes in flowering-time category. We also propose a combination to combination associative pattern recognition, and achieved better performance compared against multi-label classification and bidirectional associative memory methods. Our proposed approaches recognize associative patterns from different types of data efficiently, and provides a useful toolbox for biological regulation analysis. This dissertation presents a road-map to associative patterns recognition at genome wide level

    Application of knowledge discovery and data mining methods in livestock genomics for hypothesis generation and identification of biomarker candidates influencing meat quality traits in pigs

    Get PDF
    Recent advancements in genomics and genome profiling technologies have lead to an increase in the amount of data available in livestock genomics. Yet, most of the studies done in livestock genomics have been following a reductionist approach and very few studies have either followed data mining or knowledge discovery concepts or made use of the wealth of information available in the public domain to gain new knowledge. The goals of this thesis were: (i) the adoption of existing analysis strategies or the development of novel approaches in livestock genomics for integrative data analysis following the principles of data mining and knowledge discovery and (ii) demonstrating the application of such approaches in livestockgenomics for hypothesis generation and biomarker discovery. A pig meat quality trait termed androstenone measurement in backfat was selected as the target phenotype for the experiments. Two experiments were performed as a part of this thesis. The first one followed a knowledge driven approach merging high-throughput expression data with metabolic interaction network. Based on the results from this experiment, several novel biomarker candidates and a hypothesis regarding different mechanisms regulating androstenone synthesis in porcine testis samples with divergent androstenone measurements in back fat were proposed. The model proposed that the elevated levels of androstenone synthesis in sample population could be due to the combined effect of cAMP/PKA signaling, elevated levels of fatty acid metabolism and anti lipid peroxidation activity of members of glutathione metabolic pathway. The second experiment followed a data driven approach and integrated gene expression data from multiple porcine populations to identify similarities in gene expression patterns related to hepatic androstenone metabolism. The results indicated that one of the low androstenone phenotype specific co-expression cluster was functionally enriched in pathways related to androgen and androstenone metabolism and that the members of this cluster exhibited weak co-expression in high androstenone phenotype. Based on the results from this experiment, this co-expression cluster was proposed as a signature cluster for hepatic androstenone metabolism in boars with low androstenone content in back fat. The results from these experiments indicate that integrative analysis approaches following data mining and knowledge discovery concepts can be used for the generation of new knowledge from existing data in livestock genomics. But, limited data availability in livestock genomics is a hindrance to the extensive use such analysis methods in livestock genomics field for gaining new knowledge. In conclusion, this study was aimed at demonstrating the capabilities of data mining and knowledge discovery methods and integrative analysis approaches to generate new knowledge in livestock genomics using existing datasets. The results from the experiments hint the possibilities of further exploring such methods for knowledge generation in this field. Although the application of such methods is limited in livestock genomics due to data availability issues at present, the increase in data availability due to evolving high throughput technologies and decrease in data generation costs would aid in the wide spread use of such methods in livestock genomics in the coming future

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Graph - Based Methods for Protein Function Prediction

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore