583,701 research outputs found

    Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

    Full text link
    We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure

    Submodularity in Batch Active Learning and Survey Problems on Gaussian Random Fields

    Full text link
    Many real-world datasets can be represented in the form of a graph whose edge weights designate similarities between instances. A discrete Gaussian random field (GRF) model is a finite-dimensional Gaussian process (GP) whose prior covariance is the inverse of a graph Laplacian. Minimizing the trace of the predictive covariance Sigma (V-optimality) on GRFs has proven successful in batch active learning classification problems with budget constraints. However, its worst-case bound has been missing. We show that the V-optimality on GRFs as a function of the batch query set is submodular and hence its greedy selection algorithm guarantees an (1-1/e) approximation ratio. Moreover, GRF models have the absence-of-suppressor (AofS) condition. For active survey problems, we propose a similar survey criterion which minimizes 1'(Sigma)1. In practice, V-optimality criterion performs better than GPs with mutual information gain criteria and allows nonuniform costs for different nodes

    Unsupervised Stream-Weights Computation in Classification and Recognition Tasks

    Get PDF
    International audienceIn this paper, we provide theoretical results on the problem of optimal stream weight selection for the multi-stream classi- fication problem. It is shown, that in the presence of estimation or modeling errors using stream weights can decrease the total classification error. Stream weight estimates are computed for various conditions. Then we turn our attention to the problem of unsupervised stream weights computation. Based on the theoretical results we propose to use models and “anti-models” (class- specific background models) to estimate stream weights. A non-linear function of the ratio of the inter- to intra-class distance is used for stream weight estimation. The proposed unsupervised stream weight estimation algorithm is evaluated on both artificial data and on the problem of audio-visual speech classification. Finally the proposed algorithm is extended to the problem of audio- visual speech recognition. It is shown that the proposed algorithms achieve results comparable to the supervised minimum-error training approach under most testing conditions

    Simulation of External-Internal and Through Trips for Small Urban Areas

    Get PDF
    The objective of this study was to develop models which would simulate internal-external trips and external-external (through) trips. Regression analysis and cross-classification of data were tested in an attempt to predict the number of internal-external trips and the percentage of through trips. Regression analysis was used in the development of a through-trip distribution model. Grouping data for analysis created some problems; however, trial-and-error evaluation enabled selection of variables which produced reasonable results. Variables found to be most significant in the development of internal-external models were population and employment. For through-trip models, variables used were population, functional classification, AADT at the external station, and percent trucks. In developing through-trip distribution models, variables of significance were AADT at the destination station, percent trucks at destination station, percent through trips at destination station, and ratio of destination AADT to total AADT\u27s at all stations (value squared). Overall, the models developed in this study appear to be adequate for planning purposes when ease of application and accuracy of the models are considered

    Simulation of Travel Patterns for Small Urban Areas

    Get PDF
    The objective of this study was to develop models which would simulate internal-external trips and external-external (through) trips. Regression analysis and cross-classification of data were tested in an attempt to predict the number of internal-external trips and the percentage of through trips. Regression analysis was used in the development of a through-trip distribution model. Grouping data for analysis created some problems; however, trial-and-error evaluation enabled selection of variables which produced reasonable results. Variables found to be most significant in the development of internal-external models were population and employment. For through-trip models, variables used were population, functional classification, AADT at the external station, and percent trucks. In developing through-trip distribution models, variables of significance were AADT at the destination station, percent trucks at destination station, percent through trips at destination station, and ratio of destination AADT to total AADT\u27s at all stations (value squared). Overall, the models developed in this study appear to be adequate for planning purposes when ease of application and accuracy of the models are considered

    Risk Assessment for Venous Thromboembolism in Chemotherapy-Treated Ambulatory Cancer Patients: A Machine Learning Approach

    Get PDF
    OBJECTIVE: To design a precision medicine approach aimed at exploiting significant patterns in data, in order to produce venous thromboembolism (VTE) risk predictors for cancer outpatients that might be of advantage over the currently recommended model (Khorana score). DESIGN: Multiple kernel learning (MKL) based on support vector machines and random optimization (RO) models were used to produce VTE risk predictors (referred to as machine learning [ML]-RO) yielding the best classification performance over a training (3-fold cross-validation) and testing set. RESULTS: Attributes of the patient data set ( n = 1179) were clustered into 9 groups according to clinical significance. Our analysis produced 6 ML-RO models in the training set, which yielded better likelihood ratios (LRs) than baseline models. Of interest, the most significant LRs were observed in 2 ML-RO approaches not including the Khorana score (ML-RO-2: positive likelihood ratio [+LR] = 1.68, negative likelihood ratio [-LR] = 0.24; ML-RO-3: +LR = 1.64, -LR = 0.37). The enhanced performance of ML-RO approaches over the Khorana score was further confirmed by the analysis of the areas under the Precision-Recall curve (AUCPR), and the approaches were superior in the ML-RO approaches (best performances: ML-RO-2: AUCPR = 0.212; ML-RO-3-K: AUCPR = 0.146) compared with the Khorana score (AUCPR = 0.096). Of interest, the best-fitting model was ML-RO-2, in which blood lipids and body mass index/performance status retained the strongest weights, with a weaker association with tumor site/stage and drugs. CONCLUSIONS: Although the monocentric validation of the presented predictors might represent a limitation, these results demonstrate that a model based on MKL and RO may represent a novel methodological approach to derive VTE risk classifiers. Moreover, this study highlights the advantages of optimizing the relative importance of groups of clinical attributes in the selection of VTE risk predictors

    Improvement of EEG based brain computer interface by application of tripolar electrodes and independent component analysis

    Get PDF
    For persons with severe disabilities, a brain computer interface (BCI) may be a viable means of communication, with scalp-recorded electroencephalogram (EEG) being the most common signal employed in the operation of a BCI. Various electrode configurations can be used for EEG recording, one of which was a set of concentric rings that was referred to as a Laplacian electrode. It has been shown that Lapalacian EEG could improve classification in EEG recognition, but the complete advantages of this configuration have not been established. This project included two parts. First, a modeling study was performed using Independent Component Analysis (ICA) to prove that tripolar electrodes could provide better EEG signal for BCI. Next, human experiments were performed to study the application of tripolar electrodes in a BCI model to show that the application of tripolar electrodes and data-segment related parameter selection can improve EEG classification ratio for BCI. In the first part of work, an improved four-layer anisotropic concentric spherical head computer model was programmed, then four configurations of time-varying dipole signals were used to generate the scalp surface signals that would be obtained with tripolar and disc electrodes. Four important EEG artifacts were tested: eye blinking, cheek movements, jaw movements and talking. Finally, a fast fixed-point algorithm was used for signal-independent component analysis (ICA). The results showed that signals from tripolar electrodes generated better ICA separation than signals from disc electrodes for EEG signals, suggesting that tripolar electrodes could provide better EEG signal for BCI. The human experiments were divided into three parts: improvement of the data acquirement system by application of tripolar concentric electrodes and related circuit; development of pre-feature selection algorithm to improve BCI EEG signal classification; and an autoregressive (AR) model and Mahalanobis distance-based linear classifier for BCI classification. In the work, tripolar electrodes and corresponding data acquisition system were developed. Two sets of left/right hand motor imagery EEG signals were acquired. Then the effectiveness of signals from tripolar concentric electrodes and disc electrodes were compared for use as a BCI. The pre-feature selection methods were developed and applied to four data segment-related parameters: the length of the data segment in each trial (LDS), its starting position (SPD), the number of trials (NT) and the AR model order (AR Order). The study showed that, compared to the classification ratio (CR) without parameter selection, the CR was significantly different with an increase by 20% to 30% with proper selection of these data-segment-related parameter values and that the optimum parameter values were subject-dependent, which suggests that the data-segment-related parameters should be individualized when building models for BCI. The experiments also showed that that tripolar concentric electrodes generated significantly higher classification accuracy than disc electrodes

    Enhancing Accuracy Of Credit Scoring Classification With Imbalance Data Using Synthetic Minority Oversampling Technique-Support Vector Machine (SMOTE-SVM) Model

    Get PDF
    Credit is one of the business models that provide a significant growth. With the growth of new credit applicants and financial markets, the possibility of credit problem occurrence also become higher. Thus, it becomes important for a financial institution to conduct a preliminary selection to the credit applicants. In order to do that, credit scoring becomes one of the models used by a financial institution to perform a preliminary selection of potential customer. One of the most common techniques used to develop a credit scoring model is data mining classification task. However, this technique provides difficulties in classifying imbalanced data distribution. It is because imbalanced data problem may lead the classifier to perform misclassification by classified all of the data into majority class and perform poorly on minority class. In the case of credit scoring, credit data also have imbalanced data distribution. Therefore, classifying a credit data with imbalanced data distribution using unappropriated technique may lead the classification provides a wrong decision result for a financial institution. In this study, several methods for handle imbalanced data problem are identified. Moreover, an improvement of credit scoring model with imbalanced data problem in a financial institution using SMOTE-SVM model is also proposed in this study. This study is conducted in five phases which are data collection, data pre-processing, feature selection, classification, validation, and evaluation. For the experiments using SMOTE-SVM model, the experiments are conducted by taking a consideration in different data ratio and nearest neighbours used in SMOTE. The result of experiments provides that the accuracy and performance result are improved along with the balanced data using SMOTE-SVM model. The performance measurement using 10-fold cross validation and confusion matrix shows that SMOTE-SVM model can correctly classify most of the data in each class with the good result of accuracy, class precision, and class recall. Based on this result, an SMOTE-SVM model is believed to be effective in handling imbalanced data for credit scoring classification

    Validation of probabilistic classifiers

    Get PDF
    Non-parametric probabilistic classification models are increasingly being investigated as an alternative to Discrete Choice Models (DCMs), e.g. for predicting mode choice. There exist many strategies within the literature for model selection between DCMs, either through the testing of a null hypothesis, e.g. likelihood ratio, Wald, Lagrange Multiplier tests, or through the comparison of information criteria, e.g. Bayesian and Aikaike information criteria. However, these tests are only valid for parametric models, and cannot be applied to non-parametric classifiers. Typically, the performance of Machine Learning classifiers is validated by computing a performance metric on out-of-sample test data, either through cross validation or hold-out testing. Whilst bootstrapping can be used to investigate whether differences between test scores are stable under resampling, there are few studies within the literature investigating whether these differences are significant for non-parametric models. To address this, in this paper we introduce three statistical tests which can be applied to both parametric and non-parametric probabilistic classification models. The first test considers the analytical distribution of the expected likelihood of a model given the true model. The second test uses similar anaylsis to determine the distribution of the Kullback-Leibler divergence between two models. The final test considers the convex combination of two classifiers under comparison. These tests allow ML classifiers to be compared directly, including with DCMs
    • …
    corecore