1,873 research outputs found

    Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning

    Get PDF
    With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction

    A rare event classification in the advanced manufacturing system: focused on imbalanced datasets

    Get PDF
    In many industrial applications, classification tasks are often associated with imbalanced class labels in training datasets. Imbalanced datasets can severely affect the accuracy of class predictions, and thus they need to be handled by appropriate data processing before analyzing the data since most machine learning techniques assume that the input data is balanced. When this imbalance problem comes with highdimensional space, feature extraction can be applied. In Chapter 2, we present two versions of feature extraction techniques called CL-LNN and RD-LNN in a time series dataset based on the nearest neighbor combined with machine learning algorithms to detect a failure of the paper manufacturing machinery earlier than its occurrence from the multi-stream system monitoring data. The nearest neighbor is applied to each separate feature instead of the whole 61 features to address the curse of dimensionality. Also, another technique for the skewness between class labels can be solved by either oversampling minorities or downsampling majorities in class. In the chapter 3, we are seeking to find a better way of downsampling by selecting the most informative samples in the given imbalanced dataset through the active learning strategy to mitigate the effect of imbalanced class labels. The data selection for downsampling is performed by the criterion used in optimal experimental designs, from which the generalization error of the trained model is minimized in a sequential manner under the penalized logistic regression as a classification model. We also suggest that the performance is significantly improved, especially with the highly imbalanced dataset, e.g., the imbalanced ratio is greater than ten if tuning hyper-parameter and costweight method are applied to the active downsampling technique. The research is further extended to cover nonlinearity using nonparametric logistic regression, and performance-based active learning (PBAL) is proposed to enhance the performance compared to the existing ones such as D-optimality and A-optimality.Includes bibliographical references

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining
    corecore