158 research outputs found

    Data-Adaptive Kernel Support Vector Machine

    Get PDF
    In this thesis, we propose the data-adaptive kernel Support Vector Machine (SVM), a new method with a data-driven scaling kernel function based on real data sets. This two-stage approach of kernel function scaling can enhance the accuracy of a support vector machine, especially when the data are imbalanced. Followed by the standard SVM procedure in the first stage, the proposed method locally adapts the kernel function to data locations based on the skewness of the class outcomes. In the second stage, the decision rule is constructed with the data-adaptive kernel function and is used as the classifier. This process enlarges the magnification effect directly on the Riemannian manifold within the feature space rather than the input space. The proposed data-adaptive kernel SVM technique is applied in the binary classification, and is extended to the multi-class situations when imbalance is a main concern. We conduct extensive simulation studies to assess the performance of the proposed methods, and the prostate cancer image study is employed as an illustration. The data-adaptive kernel is further applied in feature selection process. We propose the data-adaptive kernel-penalized SVM, a new method of simultaneous feature selection and classification by penalizing data-adaptive kernels in SVMs. Instead of penalizing the standard cost function of SVMs in the usual way, the penalty will be directly added to the dual objective function that contains the data-adaptive kernel. Classification results with sparse features selected can be obtained simultaneously. Different penalty terms in the data-adaptive kernel-penalized SVM will be compared. The oracle property of the estimator is examined. We conduct extensive simulation studies to assess the performance of all the proposed methods, and employ the method on a breast cancer data set as an illustration. The data-adaptive kernel is further applied in feature selection process. We propose the data-adaptive kernel-penalized SVM, a new method of simultaneous feature selection and classification by penalizing data-adaptive kernels in SVMs. Instead of penalizing the standard cost function of SVMs in the usual way, the penalty will be directly added to the dual objective function that contains the data-adaptive kernel. Classification results with sparse features selected can be obtained simultaneously. Different penalty terms in the data-adaptive kernel-penalized SVM will be compared. The oracle property of the estimator is examined. We conduct extensive simulation studies to assess the performance of all the proposed methods, and employ the method on a breast cancer data set as an illustration

    Characterization of Mammographic Dense Tissue Sub-Types Using Classification Algorithms On Metric Space Technique Output Functions

    Get PDF
    In recent years breast cancer has become the leading cause of global cancer incidence. One of the most common forms of screening is through the use of digital x-ray screening mammography. Risk assessment models which help predict a patient’s risk of developing breast cancer rely mainly on patient history and qualitative breast density assessment from screening. The 2D wavelet transform maxima modulus (2D WTMM) method uses a sliding window approach to quantify the spatial organization of underlying mammographic tissue according to Hurst- exponent ranges (H) as fatty (H ≤ 0.45), healthy dense (H ≥ 0.55) and risky dense (0.45 \u3c H \u3c 0.55) resulting in grey-scale maps composed of H pixel values in the shape of mammograms. The metric space technique (MST) is a method for quantifying 2D maps as a 1D output function where characteristics of an image are measured across threshold values and plotted. The MST was run on 89 tumorous patients (71 cancer, 18 benign) of the Perm data set which is comprised of H-value maps of mediolateral oblique (MLO) and craniocaudal (CC) views. Of thirty possible metrics, six are concluded to have statistically significant differences between cancer and benign categories. These six metrics were used to train univariate and multivariate general linear models (GLM) and k-nearest neighbor (KNN) models. The univariate KNN models outperformed the univariate GLM models which resulted in acceptable discrimination and high specificity, but low sensitivity and balanced accuracy. The multivariate KNN achieved the highest area under the curve of the receiver operator curve (ROC AUC) of 0.71 indicating acceptable discriminatory capacity of the model. A modification to the MST is suggested which would address a dilution effect introduced through the examination of fatty, risky dense, and healthy dense AUC as discrete regions. Further work on feature selection and work with a larger, more balanced data set is necessary to validate these results

    Novel chemometric approaches towards handling biospectroscopy datasets

    Get PDF
    Chemometrics allows one to identify chemical patterns using spectrochemical information of biological materials, such as tissues and biofluids. This has fundamental importance to overcome limitations in traditional bioanalytical analysis, such as the need for laborious and extreme invasive procedures, high consumption of reagents, and expensive instrumentation. In biospectroscopy, a beam of light, usually in the infrared region, is projected onto the surface of a biological sample and, as a result, a chemical signature is generated containing the vibrational information of most of the molecules in that material. This can be performed in a single-spectra or hyperspectral imaging fashion, where a resultant spectrum is generated for each position (pixel) in the surface of a biological material segment, hence, allowing extraction of both spatial and spectrochemical information simultaneously. As an advantage, these methodologies are non-destructive, have a relatively low-cost, and require minimum sample preparation. However, in biospectroscopy, large datasets containing complex spectrochemical signatures are generated. These datasets are processed by computational tools in order to solve their signal complexity and then provide useful information that can be used for decision taking, such as the identification of clustering patterns distinguishing disease from healthy controls samples; differentiation of tumour grades; prediction of unknown samples categories; or identification of key molecular fragments (biomarkers) associated with the appearance of certain diseases, such as cancer. In this PhD thesis, new computational tools are developed in order to improve the processing of bio-spectrochemical data, providing better clinical outcomes for both spectral and hyperspectral datasets

    Tutorial: Multivariate Classification for Vibrational Spectroscopy in Biological Samples

    Get PDF
    Vibrational spectroscopy techniques, such as Fourier-transform infrared (FTIR) and Raman spectroscopy, have been successful methods for studying the interaction of light with biological materials and facilitating novel cell biology analysis. Spectrochemical analysis is very attractive in disease screening and diagnosis, microbiological studies and forensic and environmental investigations because of its low cost, minimal sample preparation, non-destructive nature and substantially accurate results. However, there is now an urgent need for multivariate classification protocols allowing one to analyze biologically derived spectrochemical data to obtain accurate and reliable results. Multivariate classification comprises discriminant analysis and class-modeling techniques where multiple spectral variables are analyzed in conjunction to distinguish and assign unknown samples to pre-defined groups. The requirement for such protocols is demonstrated by the fact that applications of deep-learning algorithms of complex datasets are being increasingly recognized as critical for extracting important information and visualizing it in a readily interpretable form. Hereby, we have provided a tutorial for multivariate classification analysis of vibrational spectroscopy data (FTIR, Raman and near-IR) highlighting a series of critical steps, such as preprocessing, data selection, feature extraction, classification and model validation. This is an essential aspect toward the construction of a practical spectrochemical analysis model for biological analysis in real-world applications, where fast, accurate and reliable classification models are fundamental

    Virtual patient-specific treatment verification using machine learning methods to assist the dose deliverability evaluation of radiotherapy prostate plans

    Get PDF
    Machine Learning (ML) methods represent a potential tool to support and optimize virtual patient-specific plan verifications within radiotherapy workflows. However, previously reported applications did not consider the actual physical implications in the predictor’s quality and modelperformance and did not report the implementation pertinence nor their limitations. Therefore, the main goal of this thesis was to predict dose deliverability using different ML models and input predictor features, analysing the physical aspects involved in the predictions to propose areliable decision-support tool for virtual patient-specific plan verification protocols. Among the principal predictors explored in this thesis, numerical and high-dimensional features based on modulation complexity, treatment-unit parameters, and dosimetric plan parameters were all implemented by designing random forest (RF), extreme gradient boosting (XG-Boost), neural networks (NN), and convolutional neural networks (CNN) models to predict gamma passing rates (GPR) for prostate treatments. Accordingly, this research highlights three principal findings. (1) The dataset composition's heterogeneity directly impacts the quality of the predictor features and, subsequently, the model performance. (2) The models based on automatic extracted features methods (CNN models) of multi-leaf-collimator modulation maps (MM) presented a more independent and transferable prediction performance. Furthermore, (3) ML algorithms incorporated in radiotherapy workflows for virtual plan verification are required to retrieve treatment plan parameters associated with the prediction to support themodel's reliability and stability. Finally, this thesis presents how the most relevant automatically extracted features from the activation maps were considered to suggest an alternative decision support tool to comprehensively evaluate the causes of the predicted dose deliverability

    Cascade of classifier ensembles for reliable medical image classification

    Get PDF
    Medical image analysis and recognition is one of the most important tools in modern medicine. Different types of imaging technologies such as X-ray, ultrasonography, biopsy, computed tomography and optical coherence tomography have been widely used in clinical diagnosis for various kinds of diseases. However, in clinical applications, it is usually time consuming to examine an image manually. Moreover, there is always a subjective element related to the pathological examination of an image. This produces the potential risk of a doctor to make a wrong decision. Therefore, an automated technique will provide valuable assistance for physicians. By utilizing techniques from machine learning and image analysis, this thesis aims to construct reliable diagnostic models for medical image data so as to reduce the problems faced by medical experts in image examination. Through supervised learning of the image data, the diagnostic model can be constructed automatically. The process of image examination by human experts is very difficult to simulate, as the knowledge of medical experts is often fuzzy and not easy to be quantified. Therefore, the problem of automatic diagnosis based on images is usually converted to the problem of image classification. For the image classification tasks, using a single classifier is often hard to capture all aspects of image data distributions. Therefore, in this thesis, a classifier ensemble based on random subspace method is proposed to classify microscopic images. The multi-layer perceptrons are used as the base classifiers in the ensemble. Three types of feature extraction methods are selected for microscopic image description. The proposed method was evaluated on two microscopic image sets and showed promising results compared with the state-of-art results. In order to address the classification reliability in biomedical image classification problems, a novel cascade classification system is designed. Two random subspace based classifier ensembles are serially connected in the proposed system. In the first stage of the cascade system, an ensemble of support vector machines are used as the base classifiers. The second stage consists of a neural network classifier ensemble. Using the reject option, the images whose classification results cannot achieve the predefined rejection threshold at the current stage will be passed to the next stage for further consideration. The proposed cascade system was evaluated on a breast cancer biopsy image set and two UCI machine learning datasets, the experimental results showed that the proposed method can achieve high classification reliability and accuracy with small rejection rate. Many computer aided diagnosis systems face the problem of imbalance data. The datasets used for diagnosis are often imbalanced as the number of normal cases is usually larger than the number of the disease cases. Classifiers that generalize over the data are not the most appropriate choice in such an imbalanced situation. To tackle this problem, a novel one-class classifier ensemble is proposed. The Kernel Principle Components are selected as the base classifiers in the ensemble; the base classifiers are trained by different types of image features respectively and then combined using a product combining rule. The proposed one-class classifier ensemble is also embedded into the cascade scheme to improve classification reliability and accuracy. The proposed method was evaluated on two medical image sets. Favorable results were obtained comparing with the state-of-art results

    Learning from Structured Data with High Dimensional Structured Input and Output Domain

    Get PDF
    Structured data is accumulated rapidly in many applications, e.g. Bioinformatics, Cheminformatics, social network analysis, natural language processing and text mining. Designing and analyzing algorithms for handling these large collections of structured data has received significant interests in data mining and machine learning communities, both in the input and output domain. However, it is nontrivial to adopt traditional machine learning algorithms, e.g. SVM, linear regression to structured data. For one thing, the structural information in the input domain and output domain is ignored if applying the normal algorithms to structured data. For another, the major challenge in learning from many high-dimensional structured data is that input/output domain can contain tens of thousands even larger number of features and labels. With the high dimensional structured input space and/or structured output space, learning a low dimensional and consistent structured predictive function is important for both robustness and interpretability of the model. In this dissertation, we will present a few machine learning models that learn from the data with structured input features and structured output tasks. For learning from the data with structured input features, I have developed structured sparse boosting for graph classification, structured joint sparse PCA for anomaly detection and localization. Besides learning from structured input, I also investigated the interplay between structured input and output under the context of multi-task learning. In particular, I designed a multi-task learning algorithms that performs structured feature selection & task relationship Inference. We will demonstrate the applications of these structured models on subgraph based graph classification, networked data stream anomaly detection/localization, multiple cancer type prediction, neuron activity prediction and social behavior prediction. Finally, through my intern work at IBM T.J. Watson Research, I will demonstrate how to leverage structural information from mobile data (e.g. call detail record and GPS data) to derive important places from people's daily life for transit optimization and urban planning

    Identifying disease-associated genes based on artificial intelligence

    Get PDF
    Identifying disease-gene associations can help improve the understanding of disease mechanisms, which has a variety of applications, such as early diagnosis and drug development. Although experimental techniques, such as linkage analysis, genome-wide association studies (GWAS), have identified a large number of associations, identifying disease genes is still challenging since experimental methods are usually time-consuming and expensive. To solve these issues, computational methods are proposed to predict disease-gene associations. Based on the characteristics of existing computational algorithms in the literature, we can roughly divide them into three categories: network-based methods, machine learning-based methods, and other methods. No matter what models are used to predict disease genes, the proper integration of multi-level biological data is the key to improving prediction accuracy. This thesis addresses some limitations of the existing computational algorithms, and integrates multi-level data via artificial intelligence techniques. The thesis starts with a comprehensive review of computational methods, databases, and evaluation methods used in predicting disease-gene associations, followed by one network-based method and four machine learning-based methods. The first chapter introduces the background information, objectives of the studies and structure of the thesis. After that, a comprehensive review is provided in the second chapter to discuss the existing algorithms as well as the databases and evaluation methods used in existing studies. Having the objectives and future directions, the thesis then presents five computational methods for predicting disease-gene associations. The first method proposed in Chapter 3 considers the issue of non-disease gene selection. A shortest path-based strategy is used to select reliable non-disease genes from a disease gene network and a differential network. The selected genes are then used by a network-energy model to improve its performance. The second method proposed in Chapter 4 constructs sample-based networks for case samples and uses them to predict disease genes. This strategy improves the quality of protein-protein interaction (PPI) networks, which further improves the prediction accuracy. Chapter 5 presents a generic model which applies multimodal deep belief nets (DBN) to fuse different types of data. Network embeddings extracted from PPI networks and gene ontology (GO) data are fused with the multimodal DBN to obtain cross-modality representations. Chapter 6 presents another deep learning model which uses a convolutional neural network (CNN) to integrate gene similarities with other types of data. Finally, the fifth method proposed in Chapter 7 is a nonnegative matrix factorization (NMF)-based method. This method maps diseases and genes onto a lower-dimensional manifold, and the geodesic distance between diseases and genes are used to predict their associations. The method can predict disease genes even if the disease under consideration has no known associated genes. In summary, this thesis has proposed several artificial intelligence-based computational algorithms to address the typical issues existing in computational algorithms. Experimental results have shown that the proposed methods can improve the accuracy of disease-gene prediction

    Interpretability-oriented data-driven modelling of bladder cancer via computational intelligence

    Get PDF

    Kernel Methods and Measures for Classification with Transparency, Interpretability and Accuracy in Health Care

    Get PDF
    Support vector machines are a popular method in machine learning. They learn from data about a subject, for example, lung tumors in a set of patients, to classify new data, such as, a new patient’s tumor. The new tumor is classified as either cancerous or benign, depending on how similar it is to the tumors of other patients in those two classes—where similarity is judged by a kernel. The adoption and use of support vector machines in health care, however, is inhibited by a perceived and actual lack of rationale, understanding and transparency for how they work and how to interpret information and results from them. For example, a user must select the kernel, or similarity function, to be used, and there are many kernels to choose from but little to no useful guidance on choosing one. The primary goal of this thesis is to create accurate, transparent and interpretable kernels with rationale to select them for classification in health care using SVM—and to do so within a theoretical framework that advances rationale, understanding and transparency for kernel/model selection with atomic data types. The kernels and framework necessarily co-exist. The secondary goal of this thesis is to quantitatively measure model interpretability for kernel/model selection and identify the types of interpretable information which are available from different models for interpretation. Testing my framework and transparent kernels with empirical data I achieve classification accuracy that is better than or equivalent to the Gaussian RBF kernels. I also validate some of the model interpretability measures I propose
    • …
    corecore