41 research outputs found

    Machine learning and soft computing approaches to microarray differential expression analysis and feature selection.

    Get PDF
    Differential expression analysis and feature selection is central to gene expression microarray data analysis. Standard approaches are flawed with the arbitrary assignment of cut-off parameters and the inability to adapt to the particular data set under analysis. Presented in this thesis are three novel approaches to microarray data feature selection and differential expression analysis based on various machine learning and soft computing paradigms. The first approach uses a Separability Index to select ranked genes, making gene selection less arbitrary and more data intrinsic. The second approach is a novel gene ranking system, the Fuzzy Gene Filter, which provides a more holistic and adaptive approach to ranking genes. The third approach is based on a Stochastic Search paradigm and uses the Population Based Incremental Learning algorithm to identify an optimal gene set with maximum inter-class distinction. All three approaches were implemented and tested on a number of data sets and the results compared to those of standard approaches. The Separability Index approach attained a K-Nearest Neighbour classification accuracy of 92%, outperforming the standard approach which attained an accuracy of 89.6%. The gene list identified also displayed significant functional enrichment. The Fuzzy Gene Filter also outperformed standard approaches, attaining significantly higher accuracies for all of the classifiers tested, on both data sets (p < 0.0231 for the prostate data set and p < 0.1888 for the lymphoma data set). Population Based Incremental Learning outperformed Genetic Algorithm, identifying a maximum Separability Index of 97.04% (as opposed to 96.39%). Future developments include incorporating biological knowledge when ranking genes using the Fuzzy Gene Filter as well as incorporating a functional enrichment assessment in the fitness function of the Population Based Incremental Learning algorithm

    Computational Intelligence Based Classifier Fusion Models for Biomedical Classification Applications

    Get PDF
    The generalization abilities of machine learning algorithms often depend on the algorithms’ initialization, parameter settings, training sets, or feature selections. For instance, SVM classifier performance largely relies on whether the selected kernel functions are suitable for real application data. To enhance the performance of individual classifiers, this dissertation proposes classifier fusion models using computational intelligence knowledge to combine different classifiers. The first fusion model called T1FFSVM combines multiple SVM classifiers through constructing a fuzzy logic system. T1FFSVM can be improved by tuning the fuzzy membership functions of linguistic variables using genetic algorithms. The improved model is called GFFSVM. To better handle uncertainties existing in fuzzy MFs and in classification data, T1FFSVM can also be improved by applying type-2 fuzzy logic to construct a type-2 fuzzy classifier fusion model (T2FFSVM). T1FFSVM, GFFSVM, and T2FFSVM use accuracy as a classifier performance measure. AUC (the area under an ROC curve) is proved to be a better classifier performance metric. As a comparison study, AUC-based classifier fusion models are also proposed in the dissertation. The experiments on biomedical datasets demonstrate promising performance of the proposed classifier fusion models comparing with the individual composing classifiers. The proposed classifier fusion models also demonstrate better performance than many existing classifier fusion methods. The dissertation also studies one interesting phenomena in biology domain using machine learning and classifier fusion methods. That is, how protein structures and sequences are related each other. The experiments show that protein segments with similar structures also share similar sequences, which add new insights into the existing knowledge on the relation between protein sequences and structures: similar sequences share high structure similarity, but similar structures may not share high sequence similarity

    Transdiagnostic dimensions of psychosis in the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP)

    Get PDF
    The validity of the classification of non-affective and affective psychoses as distinct entities has been disputed, but, despite calls for alternative approaches to defining psychosis syndromes, there is a dearth of empirical efforts to identify transdiagnostic phenotypes of psychosis. We aimed to investigate the validity and utility of general and specific symptom dimensions of psychosis cutting across schizophrenia, schizoaffective disorder and bipolar I disorder with psychosis. Multidimensional item-response modeling was conducted on symptom ratings of the Positive and Negative Syndrome Scale, Young Mania Rating Scale, and Montgomery-angstrom sberg Depression Rating Scale in the multicentre Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP) consortium, which included 933 patients with a diagnosis of schizophrenia (N=397), schizoaffective disorder (N=224), or bipolar I disorder with psychosis (N=312). A bifactor model with one general symptom dimension, two distinct dimensions of non-affective and affective psychosis, and five specific symptom dimensions of positive, negative, disorganized, manic and depressive symptoms provided the best model fit. There was further evidence on the utility of symptom dimensions for predicting B-SNIP psychosis biotypes with greater accuracy than categorical DSM diagnoses. General, positive, negative and disorganized symptom dimension scores were higher in African American vs. Caucasian patients. Symptom dimensions accurately classified patients into categorical DSM diagnoses. This study provides evidence on the validity and utility of transdiagnostic symptom dimensions of psychosis that transcend traditional diagnostic boundaries of psychotic disorders. Findings further show promising avenues for research at the interface of dimensional psychopathological phenotypes and basic neurobiological dimensions of psychopathology

    Novel methods to elucidate core classes in multi-dimensional biomedical data

    Get PDF
    Breast cancer, which is the most common cancer in women, is a complex disease characterised by multiple molecular alterations. Current routine clinical management relies on availability of robust clinical and pathologic prognostic and predictive factors, like the Nottingham Prognostic Index, to support decision making. Recent advances in highthroughput molecular technologies supported the evidence of a biologic heterogeneity of breast cancer. This thesis is a multi-disciplinary work involving both computer scientists and molecular pathologists. It focuses on the development of advanced computational models for the classification of breast cancer into sub-types of the disease based on protein expression levels of selected markers. In a previous study conducted at the University of Nottingham, it has been suggested that immunohistochemical analysis may be used to identify distinct biological classes of breast cancer. The objectives of this work were related both to the clinical and technical aspects. From a clinical point of view, the aim was to encourage a multiple techniques approach when dealing with classification and clustering. From a technical point of view, one of the goals was to verify the stability of groups obtained from different unsupervised clustering algorithms, applied to the same data, and to compare and combine the different solutions with the ones available from the previous study. These aims and objectives were considered in the attempt to fill a number of gaps in the body of knowledge. Several research questions were raised, including how to combine the results obtained by a multi-techniques approach for clustering and whether the medical decision making process could be moved in the direction of personalised healthcare. An original framework to identify core representative classes in a dataset was developed and is described in this thesis. Using different clustering algorithms and several validity indices to explore the best number of groups to split the data, a set of classes may be defined by considering those points that remain stable across different clustering techniques. This set of representative classes may be then characterised resorting to usual statistical techniques and validated using supervised learning. Each step of this framework has been studied separately, resulting in different chapters of this thesis. The whole approach has been successfully applied to a novel set of histone markers for breast cancer provided by the School of Pharmacy at the University of Nottingham. Although further tests are needed to validate and improve the proposed framework, these results make it a good candidate for being transferred to the real world of medical decision making. Other contributions to knowledge may be extracted from this work. Firstly, six breast cancer subtypes have been identified, using consensus clustering, and characterised in terms of clinical outcome. Two of these classes were new in the literature. The second contribution is related to supervised learning. A novel method, based on the naive Bayes classifier, was developed to cope with the non-normality of covariates in many real world problems. This algorithm was validated over known data sets and compared with traditional approaches, obtaining better results in two examples. All these contributions, and especially the novel framework may also have a clinical impact, as the overall medical care is gradually moving in the direction of a personalised one. By training a small number of doctors it may be possible for them to use the framework directly and find different sub-types of the disease they are investigating

    Evaluation of the validity and utility of a transdiagnostic psychosis dimension encompassing schizophrenia and bipolar disorder

    Get PDF
    Background In recent years, the Kraepelinian dichotomy has been challenged in light of evidence on shared genetic and environmental factors for schizophrenia and bipolar disorder, but empirical efforts to identify a transdiagnostic phenotype of psychosis remain remarkably limited. Aims To investigate whether schizophrenia spectrum and bipolar disorder lie on a transdiagnostic spectrum with overlapping non-affective and affective psychotic symptoms. Method Multidimensional item-response modelling was conducted on symptom ratings of the OPerational CRITeria (OPCRIT) system in 1168 patients with schizophrenia spectrum and bipolar disorder. Results A bifactor model with one general, transdiagnostic psychosis dimension underlying affective and non-affective psychotic symptoms and five specific dimensions of positive, negative, disorganised, manic and depressive symptoms provided the best model fit and diagnostic utility for categorical classification. Conclusions Our findings provide support for including dimensional approaches into classification systems and a directly measurable clinical phenotype for cross-disorder investigations into shared genetic and environmental factors of psychosis

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Complexity Reduction in Image-Based Breast Cancer Care

    Get PDF
    The diversity of malignancies of the breast requires personalized diagnostic and therapeutic decision making in a complex situation. This thesis contributes in three clinical areas: (1) For clinical diagnostic image evaluation, computer-aided detection and diagnosis of mass and non-mass lesions in breast MRI is developed. 4D texture features characterize mass lesions. For non-mass lesions, a combined detection/characterisation method utilizes the bilateral symmetry of the breast s contrast agent uptake. (2) To improve clinical workflows, a breast MRI reading paradigm is proposed, exemplified by a breast MRI reading workstation prototype. Instead of mouse and keyboard, it is operated using multi-touch gestures. The concept is extended to mammography screening, introducing efficient navigation aids. (3) Contributions to finite element modeling of breast tissue deformations tackle two clinical problems: surgery planning and the prediction of the breast deformation in a MRI biopsy device

    Data Clustering and Partial Supervision with Some Parallel Developments

    Get PDF
    Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clustering algorithms affect the reliability of their results. Moreover, the majority tend to be based on the knowledge of the number of clusters as one of the input parameters. Unfortunately, there are many scenarios, where this knowledge may not be available. In addition, clustering algorithms are very computationally intensive which leads to a major challenging problem in scaling up to large datasets. This thesis gives possible solutions for such problems. First, new measures - called clustering performance measures (CPMs) - for assessing the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate: I) clustering algorithms that have a structure bias to certain type of data distribution as well as those that have no such biases, 2) clustering algorithms that have initialisation dependency as well as the clustering algorithms that have a unique solution for a given set of parameter values with no initialisation dependency. Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm (RACAL), is proposed. RACAL uses a distance based principle to map the distributions of the data assuming that clusters are determined by a distance parameter, without having to specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to choose the best clustering result, i.e. result has compact clusters with wide cluster separations, for a given input parameter. Comparisons with other clustering algorithms indicate the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive partial supervision strategy is proposed for using in conjunction with RACAL_to make it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms of speedup and scaleup, which gives the ability to handle large datasets of high dimensions in a reasonable time. Next, a novel clustering algorithm, which achieves clustering without any control of cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering, Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier with the advantage that the algorithm needs no training set and it is completely unsupervised. Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to act as a classifier. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environment indicate the suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets. Further investigations on more challenging data are carried out. In this context, microarray data is considered. In such data, the number of clusters is not clearly defined. This points directly towards the clustering algorithms that does not require the knowledge of the number of clusters. Therefore, the efficacy of one of these algorithms is examined. Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used as a guideline for choosing the proper clustering algorithm that has the ability to extract useful biological information in a particular dataset. Supplied by The British Library - 'The world's knowledge' Supplied by The British Library - 'The world's knowledge

    Automatic discovery of drug mode of action and drug repositioning from gene expression data

    Get PDF
    2009 - 2010The identification of the molecular pathway that is targeted by a compound, combined with the dissection of the following reactions in the cellular environment, i.e. the drug mode of action, is a key challenge in biomedicine. Elucidation of drug mode of action has been attempted, in the past, with different approaches. Methods based only on transcriptional responses are those requiring the least amount of information and can be quickly applied to new compounds. On the other hand, they have met with limited success and, at the present, a general, robust and efficient gene-expression based method to study drugs in mammalian systems is still missing. We developed an efficient analysis framework to investigate the mode of action of drugs by using gene expression data only. Particularly, by using a large compendium of gene expression profiles following treatments with more than 1,000 compounds on different human cell lines, we were able to extract a synthetic consensual transcriptional response for each of the tested compounds. This was obtained by developing an original rank merging procedure. Then, we designed a novel similarity measure among the transcriptional responses to each drug, endingending up with a “drug similarity network”, where each drug is a node and edges represent significant similarities between drugs. By means of a novel hierarchical clustering algorithm, we then provided this network with a modular topology, contanining groups of highly interconnected nodes (i.e. network communities) whose exemplars form secondlevel modules (i.e. network rich-clubs), and so on. We showed that these topological modules are enriched for a given mode of action and that the hierarchy of the resulting final network reflects the different levels of similarities among the composing compound mode of actions. Most importantly, by integrating a novel drug X into this network (which can be done very quickly) the unknown mode of action can be inferred by studying the topology of the subnetwork surrounding X. Moreover, novel potential therapeutic applications can be assigned to safe and approved drugs, that are already present in the network, by studying their neighborhood (i.e. drug repositioning), hence in a very cheap, easy and fast way, without the need of additional experiments. By using this approach, we were able to correctly classify novel anti-cancer compounds; to predict and experimentally validate an unexpected similarity in the mode of action of CDK2 inhibitors and TopoIsomerase inhibitors and to predict that Fasudil, a known and FDA-approved cardiotonic agent, could be repositioned as novel enhancer of cellular autophagy. Due to the extremely safe profile of this drug and its potential ability to traverse the blood-brain barrier, this could have strong implications in the treatment of several human neurodegenerative disorders, such as Huntington and Parkinson diseases. [edited by author]IX n.s
    corecore