109 research outputs found

    Optimisation approaches for data mining in biological systems

    Get PDF
    The advances in data acquisition technologies have generated massive amounts of data that present considerable challenge for analysis. How to efficiently and automatically mine through the data and extract the maximum value by identifying the hidden patterns is an active research area, called data mining. This thesis tackles several problems in data mining, including data classification, regression analysis and community detection in complex networks, with considerable applications in various biological systems. First, the problem of data classification is investigated. An existing classifier has been adopted from literature and two novel solution procedures have been proposed, which are shown to improve the predictive accuracy of the original method and significantly reduce the computational time. Disease classification using high throughput genomic data is also addressed. To tackle the problem of analysing large number of genes against small number of samples, a new approach of incorporating extra biological knowledge and constructing higher level composite features for classification has been proposed. A novel model has been introduced to optimise the construction of composite features. Subsequently, regression analysis is considered where two piece-wise linear regression methods have been presented. The first method partitions one feature into multiple complementary intervals and ts each with a distinct linear function. The other method is a more generalised variant of the previous one and performs recursive binary partitioning that permits partitioning of multiple features. Lastly, community detection in complex networks is investigated where a new optimisation framework is introduced to identify the modular structure hidden in directed networks via optimisation of modularity. A non-linear model is firstly proposed before its linearised variant is presented. The optimisation framework consists of two major steps, including solving the non-linear model to identify a coarse initial partition and a second step of solving repeatedly the linearised models to re fine the network partition

    Molecular Signature as Optima of Multi-Objective Function with Applications to Prediction in Oncogenomics

    Get PDF
    Náplní této práce je teoretický úvod a následné praktické zpracování tématu Molekulární signatura jako optimální multi-objektivní funkce s aplikací v predikci v onkogenomice. Úvodní kapitoly jsou zaměřeny na téma rakovina, zejména pak rakovina prsu a její podtyp triple negativní rakovinu prsu. Následuje literární přehled z oblasti optimalizačních metod, zejména se zaměřením na metaheuristické metody a problematiku strojového učení. Část se odkazuje na onkogenomiku a principy microarray a také na statistiku a s důrazem na výpočet p-hodnoty a bimodálního indexu. Praktická část je pak zaměřena na konkrétní průběh výzkumu a nalezené závěry, vedoucí k dalším krokům výzkumu. Implementace vybraných metod byla provedena v programech Matlab a R, s využitím dalších programovacích jazyků a to konkrétně programů Java a Python.Content of this work is theoretical introduction and follow-up practical processing of topic Molecular signature as optima of multi-objective function with applications to prediction in oncogenomics. Opening chapters are targeted on topic of cancer, mainly on breast cancer and its subtype Triple Negative Breast Cancer. Succeeds the literature review of optimization methods, mainly on meta-heuristic methods for multi-objective optimization and problematic of machine learning. Part is focused on the oncogenomics and on the principal of microarray and also to statistics methods with emphasis on the calculation of p-value and Bimodality Index. Practical part of work consists from concrete research and conclusions lead to next steps of research. Implementation of selected methods was realised in Matlab and R, with use of other programming languages Java and Python.

    Automated classification of cancer tissues using multispectral imagery

    Get PDF
    Automated classification of medical images for colorectal and prostate cancer diagnosis is a crucial tool for improving routine diagnosis decisions. Therefore, in the last few decades, there has been an increasing interest in refining and adapting machine learning algorithms to classify microscopic images of tumour biopsies. Recently, multispectral imagery has received a significant interest from the research community due to the fast-growing development of high-performance computers. This thesis investigates novel algorithms for automatic classification of colorectal and prostate cancer using multispectral imagery in order to propose a system outperforming the state-of-the-art techniques in the field. To achieve this objective, several feature extraction methods based on image texture have been investigated, analysed and evaluated. A novel texture feature for multispectral images is also constructed as an adaptation of the local binary pattern texture feature to multispectral images by expanding the pixels neighbourhood to the spectral dimension. It has the advantage of capturing the multispectral information with a limited feature vector size. This feature has demonstrated improved classification results when compared against traditional texture features. In order to further enhance the systems performance, advanced classification schemes such as bag-of-features - to better capture local information - and stacked generalisation - to select the most discriminative texture features - are explored and evaluated. Finally, the recent years have seen an accelerated and exponential rise of deep learning, boosted by the advances in hardware, and more specifically graphics processing units. Such models have demonstrated excellent results for supervised learning in multiple applications. This observation has motivated the employment in this thesis of deep neural network architectures, namely convolutional neural networks. Experiments were also carried out to evaluate and compare the performance obtained with the features extracted using convolutional neural networks with random initialisation against features extracted with pre-trained models on ImageNet dataset. The analysis of the classication accuracy achieved with deep learning models reveals that the latter outperforms the previously proposed texture extraction methods. In this thesis, the algorithms are assessed using two separate multiclass datasets: the first one consists of prostate tumour multispectral images, and the second contains multispectral images of colorectal tumours. The colorectal dataset was acquired on a wide domain of the light spectrum ranging from the visible to the infrared wavelengths. This dataset was used to demonstrate the improved results produced using infrared light as well as visible light

    Simulated Annealing

    Get PDF
    The book contains 15 chapters presenting recent contributions of top researchers working with Simulated Annealing (SA). Although it represents a small sample of the research activity on SA, the book will certainly serve as a valuable tool for researchers interested in getting involved in this multidisciplinary field. In fact, one of the salient features is that the book is highly multidisciplinary in terms of application areas since it assembles experts from the fields of Biology, Telecommunications, Geology, Electronics and Medicine

    Machine Learning for Prostate Histopathology Assessment

    Get PDF
    Pathology reporting on radical prostatectomy (RP) specimens is essential to post-surgery patient care. However, current pathology interpretation of RP sections is typically qualitative and subject to intra- and inter-observer variability, which challenges quantitative and repeatable reporting of lesion grade, size, location, and spread. Therefore, we developed and validated a software platform that can automatically detect and grade cancerous regions on whole slide images (WSIs) of whole-mount RP sections to support quantitative and visual reporting. Our study used hæmatoxylin- and eosin-stained WSIs from 299 whole-mount RP sections from 71 patients, comprising 1.2 million 480μm×480μm regions-of-interest (ROIs) covering benign and cancerous tissues which contain all clinically relevant grade groups. Each cancerous region was annotated and graded by an expert genitourinary pathologist. We used a machine learning approach with 7 different classifiers (3 non-deep learning and 4 deep learning) to classify: 1) each ROI as cancerous vs. non-cancerous, and 2) each cancerous ROI as high- vs. low-grade. Since recent studies found some subtypes beyond Gleason grade to have independent prognostic value, we also used one deep learning method to classify each cancerous ROI from 87 RP sections of 25 patients as each of eight subtypes to support further clinical pathology research on this topic. We cross-validated each system against the expert annotations. To compensate for the staining variability across different WSIs from different patients, we computed the tissue component map (TCM) using our proposed adaptive thresholding algorithm to label nucleus pixels, global thresholding to label lumen pixels, and assigning the rest as stroma/other. Fine-tuning AlexNet with ROIs of the TCM yielded the best results for prostate cancer (PCa) detection and grading, with areas under the receiver operating characteristic curve (AUCs) of 0.98 and 0.93, respectively, followed by fine-tuned AlexNet with ROIs of the raw image. For subtype grading, fine-tuning AlexNet with ROIs of the raw image yielded AUCs ≥ 0.7 for seven of eight subtypes. To conclude, deep learning approaches outperformed non-deep learning approaches for PCa detection and grading. The TCMs provided the primary cues for PCa detection and grading. Machine learning can be used for subtype grading beyond the Gleason grading system

    Explainable clinical decision support system: opening black-box meta-learner algorithm expert's based

    Get PDF
    Mathematical optimization methods are the basic mathematical tools of all artificial intelligence theory. In the field of machine learning and deep learning the examples with which algorithms learn (training data) are used by sophisticated cost functions which can have solutions in closed form or through approximations. The interpretability of the models used and the relative transparency, opposed to the opacity of the black-boxes, is related to how the algorithm learns and this occurs through the optimization and minimization of the errors that the machine makes in the learning process. In particular in the present work is introduced a new method for the determination of the weights in an ensemble model, supervised and unsupervised, based on the well known Analytic Hierarchy Process method (AHP). This method is based on the concept that behind the choice of different and possible algorithms to be used in a machine learning problem, there is an expert who controls the decisionmaking process. The expert assigns a complexity score to each algorithm (based on the concept of complexity-interpretability trade-off) through which the weight with which each model contributes to the training and prediction phase is determined. In addition, different methods are presented to evaluate the performance of these algorithms and explain how each feature in the model contributes to the prediction of the outputs. The interpretability techniques used in machine learning are also combined with the method introduced based on AHP in the context of clinical decision support systems in order to make the algorithms (black-box) and the results interpretable and explainable, so that clinical-decision-makers can take controlled decisions together with the concept of "right to explanation" introduced by the legislator, because the decision-makers have a civil and legal responsibility of their choices in the clinical field based on systems that make use of artificial intelligence. No less, the central point is the interaction between the expert who controls the algorithm construction process and the domain expert, in this case the clinical one. Three applications on real data are implemented with the methods known in the literature and with those proposed in this work: one application concerns cervical cancer, another the problem related to diabetes and the last one focuses on a specific pathology developed by HIV-infected individuals. All applications are supported by plots, tables and explanations of the results, implemented through Python libraries. The main case study of this thesis regarding HIV-infected individuals concerns an unsupervised ensemble-type problem, in which a series of clustering algorithms are used on a set of features and which in turn produce an output used again as a set of meta-features to provide a set of labels for each given cluster. The meta-features and labels obtained by choosing the best algorithm are used to train a Logistic regression meta-learner, which in turn is used through some explainability methods to provide the value of the contribution that each algorithm has had in the training phase. The use of Logistic regression as a meta-learner classifier is motivated by the fact that it provides appreciable results and also because of the easy explainability of the estimated coefficients

    A Hierarchical, Fuzzy Inference Approach to Data Filtration and Feature Prioritization in the Connected Manufacturing Enterprise

    Get PDF
    The current big data landscape is one such that the technology and capability to capture and storage of data has preceded and outpaced the corresponding capability to analyze and interpret it. This has led naturally to the development of elegant and powerful algorithms for data mining, machine learning, and artificial intelligence to harness the potential of the big data environment. A competing reality, however, is that limitations exist in how and to what extent human beings can process complex information. The convergence of these realities is a tension between the technical sophistication or elegance of a solution and its transparency or interpretability by the human data scientist or decision maker. This dissertation, contextualized in the connected manufacturing enterprise, presents an original Fuzzy Approach to Feature Reduction and Prioritization (FAFRAP) approach that is designed to assist the data scientist in filtering and prioritizing data for inclusion in supervised machine learning models. A set of sequential filters reduces the initial set of independent variables, and a fuzzy inference system outputs a crisp numeric value associated with each feature to rank order and prioritize for inclusion in model training. Additionally, the fuzzy inference system outputs a descriptive label to assist in the interpretation of the feature’s usefulness with respect to the problem of interest. Model testing is performed using three publicly available datasets from an online machine learning data repository and later applied to a case study in electronic assembly manufacture. Consistency of model results is experimentally verified using Fisher’s Exact Test, and results of filtered models are compared to results obtained by the unfiltered sets of features using a proposed novel metric of performance-size ratio (PSR)
    corecore