134 research outputs found

    Automatic classification of respiratory patterns involving missing data imputation techniques

    Get PDF
    [Abstract] A comparative study of the respiratory pattern classification task, involving five missing data imputation techniques and several machine learning algorithms is presented in this paper. The main goal was to find a classifier that achieves the best accuracy results using a scalable imputation method in comparison to the method used in a previous work of the authors. The results obtained show that in general, the Self-Organising Map imputation method allows non-tree based classifiers to achieve improvements over the rest of the imputation methods in terms of the classification accuracy, and that the Feedforward neural network and the Random Forest classifiers offer the best performance regardless of the imputation method used. The improvements in terms of accuracy over the previous work of the authors are limited but the Feed Forward neural network model achieves promising results.Ministerio de Economía y Competitividad; TIN 2013-40686-PXunta de Galicia; GRC2014/35

    Systematic Review on Missing Data Imputation Techniques with Machine Learning Algorithms for Healthcare

    Get PDF
    Missing data is one of the most common issues encountered in data cleaning process especially when dealing with medical dataset. A real collected dataset is prone to be incomplete, inconsistent, noisy and redundant due to potential reasons such as human errors, instrumental failures, and adverse death. Therefore, to accurately deal with incomplete data, a sophisticated algorithm is proposed to impute those missing values. Many machine learning algorithms have been applied to impute missing data with plausible values. However, among all machine learning imputation algorithms, KNN algorithm has been widely adopted as an imputation for missing data due to its robustness and simplicity and it is also a promising method to outperform other machine learning methods. This paper provides a comprehensive review of different imputation techniques used to replace the missing data. The goal of the review paper is to bring specific attention to potential improvements to existing methods and provide readers with a better grasps of imputation technique trends

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

    SQL Injection Vulnerability Detection Using Deep Learning: A Feature-based Approach

    Get PDF
    SQL injection (SQLi), a well-known exploitation technique, is a serious risk factor for database-driven web applications that are used to manage the core business functions of organizations. SQLi enables an unauthorized user to get access to sensitive information of the database, and subsequently, to the application’s administrative privileges. Therefore, the detection of SQLi is crucial for businesses to prevent financial losses. There are different rules and learning-based solutions to help with detection, and pattern recognition through support vector machines (SVMs) and random forest (RF) have recently become popular in detecting SQLi. However, these classifiers ensure 97.33% accuracy with our dataset. In this paper, we propose a deep learning-based solution for detecting SQLi in web applications. The solution employs both correlation and chi-squared methods to rank the features from the dataset. Feed-forward network approach has been applied not only in feature selection but also in the detection process. Our solution provides 98.04% accuracy over 1,850+ recorded datasets, where it proves its superior efficiency among other existing machine learning solutions

    A Novel computer assisted genomic test method to detect breast cancer in reduced cost and time using ensemble technique

    Get PDF
    Breast cancer is the leading cause of death among women around the world. It is a primary malignancy for which genetic markers have revealed the ability for clinical decision making. It is a genetic disease that generates due to gene mutations, but the cost of a genetic test is relatively high for a number of patients in developing nations like India. The results of a genetic test can take a few weeks to determine cancer. This time duration influences the prognosis of genes since certain patients suffer from a high rate of malignant cell proliferation. Therefore, a computer-assisted genetic test method (CAGT) is proposed to detect breast cancer. This test method will predict the gene expressions and convert these expressions in the state of mutation (under-expression (-1), transition (0) overexpression (1)) and afterwards perform the classification to get the benign and malignant class in reduced time and cost. In the research work, machine learning techniques are applied to identify the most responsive genes of breast cancer on the premises of the clinical report of a patient and generated a CAGT. In the research work, the hard voting ensemble approach is applied to detect breast cancer on the basis of most responsive genes by CAGT which leads to improving 3.5% accuracy in cancer classification

    Missing Data Imputation Using Machine Learning and Natural Language Processing for Clinical Diagnostic Codes

    Get PDF
    Imputation of missing data is a common application in supervised classification problems, where the feature matrix of the training dataset has various degrees of missingness. Most of the former studies do not take into account the presence of the class label in the classification problem with missing data. A widely used solution to this problem is missing data imputation based on the lazy learning technique, k-Nearest Neighbor (KNN) approach. We work on a variant of this imputation algorithm using Gray's distance and Mutual Information (MI), called Class-weighted Gray's kk-Nearest Neighbor (CGKNN) approach. Gray's distance works well with heterogeneous mixed-type data with missing instances, and we weigh distance with mutual information (MI), a measure of feature relevance, between the features and the class label. This method performs better than traditional methods for classification problems with mixed data, as shown in simulations and applications on University of California, Irvine (UCI) Machine Learning datasets (http://archive.ics.uci.edu/ml/index.php). Data being lost to follow up is a common problem in longitudinal data, especially if it involves multiple visits over a long period of time. If the outcome of interest is present in each time point, despite missing covariates due to follow-up (like outcome ascertained through phone calls), then random forest imputation would be a good imputation technique for the missing covariates. The missingness of the data involves more complicated interactions over time since most of the covariates and the outcome have repeated measurements over time. Random forests are a good non-parametric learning technique which captures complex interactions between mixed type data. We propose a proximity imputation and missForest type covariate imputation with random splits while building the forest. The performance of the imputation techniques used is compared to existing techniques in various simulation settings. The Atherosclerosis Risk in Communities (ARIC) Study Cohort is a longitudinal study which started in 1987-1989 to collect data on participants across 4 states in the USA, aimed at studying the factors behind heart diseases. We consider patients at the 5th visit (occurred in 2013) and enrolled in continuous Medicare Fee-For-Service (FFS) insurance in the last 6 months prior to their visit so that their hospitalization diagnostic (ICD) codes are available. Our aim is to characterize the hospitalization of patients having cognitive status ascertainment (classified into dementia, mild cognitive disorder or no cognitive disorder) in the 5th visit. Diagnostic codes for inpatient and outpatient visits identified from CMS (Centers for Medicare \& Medicaid Services) Medicare FFS data linked with ARIC participant data are stored in the form of International Classification of Diseases and related health problems (ICD) codes. We treat these codes as a bag-of-words model to apply text mining techniques and get meaningful cluster of ICD codes.Doctor of Philosoph

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios

    Interpretable Machine Learning Model for Clinical Decision Making

    Get PDF
    Despite machine learning models being increasingly used in medical decision-making and meeting classification predictive accuracy standards, they remain untrusted black-boxes due to decision-makers\u27 lack of insight into their complex logic. Therefore, it is necessary to develop interpretable machine learning models that will engender trust in the knowledge they generate and contribute to clinical decision-makers intention to adopt them in the field. The goal of this dissertation was to systematically investigate the applicability of interpretable model-agnostic methods to explain predictions of black-box machine learning models for medical decision-making. As proof of concept, this study addressed the problem of predicting the risk of emergency readmissions within 30 days of being discharged for heart failure patients. Using a benchmark data set, supervised classification models of differing complexity were trained to perform the prediction task. More specifically, Logistic Regression (LR), Random Forests (RF), Decision Trees (DT), and Gradient Boosting Machines (GBM) models were constructed using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD). The precision, recall, area under the ROC curve for each model were used to measure predictive accuracy. Local Interpretable Model-Agnostic Explanations (LIME) was used to generate explanations from the underlying trained models. LIME explanations were empirically evaluated using explanation stability and local fit (R2). The results demonstrated that local explanations generated by LIME created better estimates for Decision Trees (DT) classifiers
    corecore