6 research outputs found

    A hierarchical multi-label classification ant colony algorithm for protein function prediction

    Get PDF
    This paper proposes a novel ant colony optimisation (ACO) algorithm tailored for the hierarchical multi-label classification problem of protein function prediction. This problem is a very active research field, given the large increase in the number of uncharacterised proteins available for analysis and the importance of determining their functions in order to improve the current biological knowledge. Since it is known that a protein can perform more than one function and many protein functional-definition schemes are organised in a hierarchical structure, the classification problem in this case is an instance of a hierarchical multi-label problem. In this type of problem, each example may belong to multiple class labels and class labels are organised in a hierarchical structure—either a tree or a directed acyclic graph structure. It presents a more complex problem than conventional flat classification, given that the classification algorithm has to take into account hierarchical relationships between class labels and be able to predict multiple class labels for the same example. The proposed ACO algorithm discovers an ordered list of hierarchical multi-label classification rules. It is evaluated on sixteen challenging bioinformatics data sets involving hundreds or thousands of class labels to be predicted and compared against state-of-the-art decision tree induction algorithms for hierarchical multi-label classification

    Investigating the use of multi-label classification methods for the purpose of classifying electromyographic signals

    Get PDF
    The type of pattern recognition methods used for controlling modern prosthetics, referred to here as single-label classification methods, restricts users to a small amount of movements. One prominent reason for this is that the accuracy of these classification methods decreases as the number of allowed movements is increased. In this work a possible solution to this problem is presented by looking into the use of multi-label classification for classifying electromyographic signals. This was accomplished by going through the process of recording, processing, and classifying electromyographic data. In order to compare the performance of multi-label methods to that of single-label methods four classification methods from each category were selected. Both categories were then tested on their ability to classify finger flexion movements. The most commonly tested set of movements were the thumb, index, long, and ring finger movements in addition to all the possible combinations of these four fingers. The two categories were also tested on their ability to learn finger combination movements when only individual finger movements were used as training data. The results show that the tested single- and multi-label methods obtain similar classification accuracy when the training data consists of both individual finger movements and finger combination movements. The results also show that none of the tested single-label methods and only one of the tested multi-label methods, multi-label rbf neural networks, manages to learn finger combination movements when trained on only individual finger movements.Using multi-label classification methods to classify finger movements for hand prosthesis control Losing a limb is a traumatic experience that greatly impacts a person’s quality of life. To help the people who have suffered limb loss prosthetic devices were invented. The purpose of a prosthetic device is to mimic the function of the missing limb..

    Predicting controlled vocabulary based on text and citations: Case studies in medical subject headings in MEDLINE and patents

    Get PDF
    This dissertation makes three contributions in the area of controlled vocabulary prediction of Medical Subject Headings. The first contribution is a new partial matching measure based on distributional semantics. The second contribution is a probabilistic model based on text similarity and citations. The third contribution is a case study of cross-domain vocabulary prediction in US Patents. Medical subject headings (MeSH) are an important life sciences controlled vocabulary. They are an ideal ground to study controlled vocabulary prediction due to their complexity, hierarchical nature, and practical significance. The dissertation begins with an updated analysis of human indexing consistency in MEDLINE. This study demonstrates the need for partial matching measures to account for indexing variability. Here, I develop four measures combining the MeSH hierarchy and contextual similarity. These measures provide several new tools for evaluating and diagnosing controlled vocabulary models. Next, a generalized predictive model is introduced. This model uses citations and abstract similarity as inputs to a hybrid KNN classifier. Citations and abstracts are found to be complimentary in that they reliably produce unique and relevant candidate terms. Finally, the predictive model is applied to a corpus of approximately 65,000 biomedical US patents. This case study explores differences in the vocabulary of MEDLINE and patents, as well as the prospect for MeSH prediction to open new scholarly opportunities in economics and health policy research

    A data mining-based approach for investigating the relationship between DNA repair genes and ageing

    Get PDF
    There is a clear motivation for ageing research, since ageing is the greatest risk factor for many diseases, including most types of cancer. Arguably, another strong motivation for ageing research is that, despite the large progress in this area in the last two decades, ageing is still to a large extent a poorly understood process, especially in humans. The vast majority of biogerontology research is still based on “wet lab” experiments done with simpler organisms, due to the problems associated with performing ageing-related experiments with humans. In contrast, this thesis proposes a data mining approach, based on classification algorithms, for analysing data about human DNA repair genes and their relationship to ageing. The classification algorithms – more precisely, decision tree induction and Naive Bayes algorithms – were applied to datasets prepared specifically for this research, by adapting and integrating data from several bioinformatics resources, namely: (a) the GenAge database of ageing-related genes; (b) a web site with a comprehensive list of human DNA repair genes; (c) Uniprot, a centralized repository of richly-annotated data about proteins; (d) the HPRD (Human Protein Reference Database); and (e) the Gene Ontology – a controlled vocabulary for describing gene or protein functions. Some experiments also used a separate dataset including gene expression data. Applying classification algorithms to such datasets aimed at producing classification models that identify which gene properties are most effective in discriminating ageing-related DNA repair genes from other types of genes – mainly non-ageing-related DNA repair genes, but in some experiments the other types of genes also included genes whose protein product interact with DNA repair genes. A related goal of this research was to analyse the automatically-built classification models from two perspectives, namely: (a) measuring the predictive accuracy (or “generalization ability”) of those models from a data mining perspective; and (b) interpreting the meaning of the main gene properties relevant for classification in those models, in the light of biological knowledge about DNA repair genes and the process of ageing. In summary, the main gene properties that were found effective in discriminating ageing-related DNA repair genes from other types of genes (mainly non-ageing-related DNA repair genes) in the datasets created in this research are as follows: ageing-related DNA repair genes’ protein products tend to interact with a considerably larger number of proteins; their protein products are much more likely to interact with WRN (a protein whose defect causes the Werner’s progeroid syndrome) and XRCC5 (KU80, a key protein in the initiation of DNA double-strand repair by the error-prone non-homologous end joining DNA repair pathway); they are more likely to be involved in response to chemical stimulus and, to a lesser extent, in response to endogenous stimulus or oxidative stress; and they are more likely to have high expression in T lymphocytes

    Enhancing Reaction-based de novo Design using Machine Learning

    Get PDF
    De novo design is a branch of chemoinformatics that is concerned with the rational design of molecular structures with desired properties, which specifically aims at achieving suitable pharmacological and safety profiles when applied to drug design. Scoring, construction, and search methods are the main components that are exploited by de novo design programs to explore the chemical space to encourage the cost-effective design of new chemical entities. In particular, construction methods are concerned with providing strategies for compound generation to address issues such as drug-likeness and synthetic accessibility. Reaction-based de novo design consists of combining building blocks according to transformation rules that are extracted from collections of known reactions, intending to restrict the enumerated chemical space into a manageable number of synthetically accessible structures. The reaction vector is an example of a representation that encodes topological changes occurring in reactions, which has been integrated within a structure generation algorithm to increase the chances of generating molecules that are synthesisable. The general aim of this study was to enhance reaction-based de novo design by developing machine learning approaches that exploit publicly available data on reactions. A series of algorithms for reaction standardisation, fingerprinting, and reaction vector database validation were introduced and applied to generate new data on which the entirety of this work relies. First, these collections were applied to the validation of a new ligand-based design tool. The tool was then used in a case study to design compounds which were eventually synthesised using very similar procedures to those suggested by the structure generator. A reaction classification model and a novel hierarchical labelling system were then developed to introduce the possibility of applying transformations by class. The model was augmented with an algorithm for confidence estimation, and was used to classify two datasets from industry and the literature. Results from the classification suggest that the model can be used effectively to gain insights on the nature of reaction collections. Classified reactions were further processed to build a reaction class recommendation model capable of suggesting appropriate reaction classes to apply to molecules according to their fingerprints. The model was validated, then integrated within the reaction vector-based design framework, which was assessed on its performance against the baseline algorithm. Results from the de novo design experiments indicate that the use of the recommendation model leads to a higher synthetic accessibility and a more efficient management of computational resources
    corecore