864 research outputs found
New Probabilistic Graphical Models and Meta-Learning Approaches for Hierarchical Classification, with Applications in Bioinformatics and Ageing
This interdisciplinary work proposes new hierarchical classification algorithms and evaluates them on biological datasets, and specifically on ageing-related datasets. Hierarchical classification is a type of classification task where the classes to be predicted are organized into a hierarchical structure. The focus on ageing is justified by the increasing impact that ageing-related diseases have on the human population and by the increasing amount of freely available ageing-related data.
The main contributions of this thesis are as follows. First, we improve the running time of a previously proposed hierarchical classification algorithm based on an extension of the well-known Naive Bayes classification algorithm. We show that our modification greatly improves the runtime of the hierarchical classification algorithm, maintaining its predictive performance.
We also propose four new hierarchical classification algorithms. The focus on hierarchical classification algorithms and their evaluation on biological data is justified as the class labels of biological data are commonly organized into class hierarchies. Two of our four new hierarchical classification algorithms - the "Hierarchical Dependence Network" (HDN) and the "Hierarchical Dependence Network algorithm based on finding non-Hierarchically related Predictive Classes'' (HDN-nHPC) - are based on Dependence Networks, a relatively new type of probabilistic graphical model that has not yet received a lot of attention from the classification community. The other two hierarchical classification algorithms we proposed are hybrid algorithms that use the hierarchical classification models produced by the Predictive Clustering Tree (PCT) algorithm. One of the hybrids combines the models produced by the PCT algorithm and a Local Hierarchical Classification (LHC) algorithm (which basically induces a local model for each class in the hierarchy). The other hybrid combines the models produced by the PCT and HDN algorithms.
We have tested our four proposed algorithms and four other commonly used hierarchical classification algorithms on 42 hierarchical classification datasets. 20 of these datasets were created by us and are freely available for researchers. We have concluded that, for one out of the three hierarchical predictive accuracy measures used in our experiments, one of our four new algorithms (the HDN-nHPC algorithm) outperforms all other seven algorithms in terms of average rank across the 42 hierarchical classification datasets.
We have also proposed the first meta-learning approach for hierarchical classification problems. In meta-learning, each meta-instance represents a dataset, meta-features represent dataset properties, and meta-classes represent the best classification algorithm for the corresponding dataset (meta-instance). Hence, meta-learning techniques for classification use the predictive performance of some candidate classification algorithms in previously tested datasets, and dataset descriptors (the meta-features), to infer the performance of those candidate classification algorithms in new datasets, given the meta-features of those new datasets.
The predictions of our meta-learning system can be used as a guide to choose which hierarchical classification algorithm (out of a set of candidate ones) to use on a new dataset, without the need for time-consuming trial and error experiments with those candidate algorithms. This is particularly important for hierarchical classification problems, as the training time of hierarchical classification algorithms tends to be much greater than the training time of 'flat' classification algorithms. This increased training time is mainly due to the typically much greater number of class labels that annotate the instances of hierarchical classification problems.
We have tested the predictive power of our meta-learning system and interpreted some generated meta-models. We have concluded that our meta-learning system had good predictive performance when compared to other baseline meta-learning approaches. We have also concluded that the meta-rules generated by our meta-learning system were useful to identify dataset characteristics to assist the choice of hierarchical classification algorithm.
Finally, we have reviewed the current practice of applying supervised machine learning (classification and regression) algorithms to study the biology of ageing. This review discusses the main findings of such algorithms, in the context of the ageing biology literature. We have also interpreted some of the hierarchical classification models generated in our experiments. Both the above literature review and the interpretation of some models were performed in collaboration with an ageing expert, in order to extract relevant information for ageing research
Archives of Data Science, Series A. Vol. 1,1: Special Issue: Selected Papers of the 3rd German-Polish Symposium on Data Analysis and Applications
The first volume of Archives of Data Science, Series A is a special issue of a selection of contributions which have been originally presented at the {\em 3rd Bilateral German-Polish Symposium on Data Analysis and Its Applications} (GPSDAA 2013). All selected papers fit into the emerging field of data science consisting of the mathematical sciences (computer science, mathematics, operations research, and statistics) and an application domain (e.g. marketing, biology, economics, engineering)
Recommended from our members
Building trajectories through clinical data to model disease progression
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Clinical trials are typically conducted over a population within a defined time period
in order to illuminate certain characteristics of a health issue or disease process. These cross-sectional studies provide a snapshot of these disease processes over a large number of people but do not allow us to model the temporal nature of disease, which is essential for modeling detailed prognostic predictions. Longitudinal studies, on the other hand, are used to explore how these processes develop over time in a number of people but can be expensive and time-consuming, and many studies only cover a relatively small window within the disease process. This thesis describes the application of intelligent data analysis techniques for extracting information from time series generated by different diseases. The aim of this thesis is to identify intermediate stages
in a disease process and sub-categories of the disease exhibiting subtly different symptoms. It explores the use of a bootstrap technique that fits trajectories through the data generating āpseudo time-seriesā. It addresses issues including: how clinical variables interact as a disease progresses along the trajectories in the data; and how to automatically identify different disease states along these trajectories, as well as the transitions between them. The thesis documents how reliable time-series models can be created from large amounts of historical cross-sectional data and a novel relabling/latent variable approach has enabled the exploration of the temporal nature of disease progression. The proposed algorithms are tested extensively on simulated data and on three real clinical datasets. Finally, a study is carried out to explore whether we can ācalibrateā pseudo time-series models with real longitudinal data in order to improve them. Plausible directions for future research are discussed at the end of the thesis
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Machine learning in the social and health sciences
The uptake of machine learning (ML) approaches in the social and health
sciences has been rather slow, and research using ML for social and health
research questions remains fragmented. This may be due to the separate
development of research in the computational/data versus social and health
sciences as well as a lack of accessible overviews and adequate training in ML
techniques for non data science researchers. This paper provides a meta-mapping
of research questions in the social and health sciences to appropriate ML
approaches, by incorporating the necessary requirements to statistical analysis
in these disciplines. We map the established classification into description,
prediction, and causal inference to common research goals, such as estimating
prevalence of adverse health or social outcomes, predicting the risk of an
event, and identifying risk factors or causes of adverse outcomes. This
meta-mapping aims at overcoming disciplinary barriers and starting a fluid
dialogue between researchers from the social and health sciences and
methodologically trained researchers. Such mapping may also help to fully
exploit the benefits of ML while considering domain-specific aspects relevant
to the social and health sciences, and hopefully contribute to the acceleration
of the uptake of ML applications to advance both basic and applied social and
health sciences research
Approaches to Integrating Metabolomics and Multi-Omics Data: A Primer
Metabolomics deals with multiple and complex chemical reactions within living organisms and how these are influenced by external or internal perturbations. It lies at the heart of omics profiling technologies not only as the underlying biochemical layer that reflects information expressed by the genome, the transcriptome and the proteome, but also as the closest layer to the phenome. The combination of metabolomics data with the information available from genomics, transcriptomics, and proteomics offers unprecedented possibilities to enhance current understanding of biological functions, elucidate their underlying mechanisms and uncover hidden associations between omics variables. As a result, a vast array of computational tools have been developed to assist with integrative analysis of metabolomics data with different omics. Here, we review and propose five criteriaāhypothesis, data types, strategies, study design and study focusā to classify statistical multi-omics data integration approaches into state-of-the-art classes under which all existing statistical methods fall. The purpose of this review is to look at various aspects that lead the choice of the statistical integrative analysis pipeline in terms of the different classes. We will draw particular attention to metabolomics and genomics data to assist those new to this field in the choice of the integrative analysis pipeline
The era of big data: Genome-scale modelling meets machine learning
With omics data being generated at an unprecedented rate, genome-scale modelling has become pivotal in its organisation and analysis. However, machine learning methods have been gaining ground in cases where knowledge is insufficient to represent the mechanisms underlying such data or as a means for data curation prior to attempting mechanistic modelling. We discuss the latest advances in genome-scale modelling and the development of optimisation algorithms for network and error reduction, intracellular constraining and applications to strain design. We further review applications of supervised and unsupervised machine learning methods to omics datasets from microbial and mammalian cell systems and present efforts to harness the potential of both modelling approaches through hybrid modelling
Semantically aware hierarchical Bayesian network model for knowledge discovery in data : an ontology-based framework
Several mining algorithms have been invented over the course of recent decades. However, many of the invented algorithms are confined to generating frequent patterns and do not illustrate how to act upon them. Hence, many researchers have argued that existing mining algorithms have some limitations with respect to performance and workability. Quantity and quality are the main limitations of the existing mining algorithms. While quantity states that the generated patterns are abundant, quality indicates that they cannot be integrated into the business domain seamlessly. Consequently, recent research has suggested that the limitations of the existing mining algorithms are the result of treating the mining process as an isolated and autonomous data-driven trial-and-error process and ignoring the domain knowledge. Accordingly, the integration of domain knowledge into the mining process has become the goal of recent data mining algorithms. Domain knowledge can be represented using various techniques. However, recent research has stated that ontology is the natural way to represent knowledge for data mining use. The structural nature of ontology makes it a very strong candidate for integrating domain knowledge with data mining algorithms. It has been claimed that ontology can play the following roles in the data mining process: ā¢Bridging the semantic gap. ā¢Providing prior knowledge and constraints. ā¢Formally representing the DM results. Despite the fact that a variety of research has used ontology to enrich different tasks in the data mining process, recent research has revealed that the process of developing a framework that systematically consolidates ontology and the mining algorithms in an intelligent mining environment has not been realised. Hence, this thesis proposes an automatic, systematic and flexible framework that integrates the Hierarchical Bayesian Network (HBN) and domain ontology. The ultimate aim of this thesis is to propose a data mining framework that implicitly caters for the underpinning domain knowledge and eventually leads to a more intelligent and accurate mining process. To a certain extent the proposed mining model will simulate the cognitive system in the human being. The similarity between ontology, the Bayesian Network (BN) and bioinformatics applications establishes a strong connection between these research disciplines. This similarity can be summarised in the following points: ā¢Both ontology and BN have a graphical-based structure. ā¢Biomedical applications are known for their uncertainty. Likewise, BN is a powerful tool for reasoning under uncertainty. ā¢The medical data involved in biomedical applications is comprehensive and ontology is the right model for representing comprehensive data. Hence, the proposed ontology-based Semantically Aware Hierarchical Bayesian Network (SAHBN) is applied to eight biomedical data sets in the field of predicting the effect of the DNA repair gene in the human ageing process and the identification of hub protein. Consequently, the performance of SAHBN was compared with existing Bayesian-based classification algorithms. Overall, SAHBN demonstrated a very competitive performance. The contribution of this thesis can be summarised in the following points. ā¢Proposed an automatic, systematic and flexible framework to integrate ontology and the HBN. Based on the literature review, and to the best of our knowledge, no such framework has been proposed previously. ā¢The complexity of learning HBN structure from observed data is significant. Hence, the proposed SAHBN model utilized the domain knowledge in the form of ontology to overcome this challenge. ā¢The proposed SAHBN model preserves the advantages of both ontology and Bayesian theory. It integrates the concept of Bayesian uncertainty with the deterministic nature of ontology without extending ontology structure and adding probability-specific properties that violate the ontology standard structure. ā¢The proposed SAHBN utilized the domain knowledge in the form of ontology to define the semantic relationships between the attributes involved in the mining process, guides the HBN structure construction procedure, checks the consistency of the training data set and facilitates the calculation of the associated conditional probability tables (CPTs). ā¢The proposed SAHBN model lay out a solid foundation to integrate other semantic relations such as equivalent, disjoint, intersection and union
- ā¦