3 research outputs found

    The best texture image for Gaussian Naïve Bayes with nearest neighbor interpolation

    Get PDF
    One of the factors affecting the performance of the Gaussian naïve Bayes classifier (GNBC) in texture image classification is the image size (dimensions). Image size is one of the best texture image criteria besides its pixel value. In this study, a method is proposed to obtain the size of the best texture image for GNBC by nearest neighbor (NN) interpolation optimization. The best texture image size with interpolated pixel values makes GNBC able to distinguish texture images in each class with the highest performance. The first step of the proposed method was to determine the texture image size for training through a combination of row and column sizes in the optimization process. The next important step in generating the new texture images was resizing each of the original texture images using NN interpolation. The next step was to build GNBC based on the new image from interpolation and determine the classification accuracy. The last step was to select the best texture image size based on the largest classification accuracy value as the first criterion and image size as the second criterion. The evaluation of the proposed method was carried out using texture image data from the CVonline public dataset involving several test scenarios and interpolation methods. The test result shows that in scenarios involving five classes of texture images, GNBC with NN interpolation gives the smallest classification accuracy value of 89% and the largest 100% at the best image size, 14 × 32 and 47 × 42, respectively. In scenarios involving small to large class numbers, GNBC with NN interpolation provides classification accuracy of 81.6%–95%. From these results, GNBC with NN optimization gives better results than other nonadaptive interpolation methods (bilinear, bicubic, and Lanczos) and principal component analysis (PCA)

    Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

    No full text
    In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Naïve Bayes classifier. We show that the performance of the Naïve Bayes depends on the degree of functional dependence between the features and labels. We provide a sufficient condition for zero loss of performance, too

    Variable selection for classification in complex ophthalmic data: a multivariate statistical framework

    Get PDF
    Variable selection is an essential part of the process of model-building for classification or prediction. Some of the challenges of variable selection are heterogeneous variance-covariance matrices, differing scales of variables, non-normally distributed data and missing data. Statistical methods exist for variable selection however these are often univariate, make restrictive assumptions about the distribution of data or are expensive in terms of the computational power required. In this thesis I focus on filter methods of variable selection that are computationally fast and propose a metric of discrimination. The main objectives of this thesis are (1) to propose a novel Signal-to-Noise Ratio (SNR) discrimination metric accommodating heterogeneous variance-covariance matrices, (2) to develop a multiple forward selection (MFS) algorithm employing the novel SNR metric, (3) to assess the performance of the MFS-SNR algorithm compared to alternative methods of variable selection, (4) to investigate the ability of the MFS-SNR algorithm to carry out variable selection when data are not normally distributed and (5) to apply the MFS-SNR algorithm to the task of variable selection from real datasets. The MFS-SNR algorithm was implemented in the R programming environment. It calculates the SNR for subsets of variables, identifying the optimal variable during each round of selection as whichever causes the largest increase in SNR. A dataset was simulated comprising 10 variables: 2 discriminating variables, 7 non-discriminating variables and one non-discriminating variable which enhanced the discriminatory performance of other variables. In simulations the frequency of each variable’s selection was recorded. The probability of correct classification (PCC) and area under the curve (AUC) were calculated for sets of selected variables. I assessed the ability of the MFS-SNR algorithm to select variables when data are not normally distributed using simulated data. I compared the MFS-SNR algorithm to filter methods utilising information gain, chi-square statistics and the Relief-F algorithm as well as a support vector machines and an embedded method using random forests. A version of the MFS algorithm utilising Hotelling’s T2 statistic (MFS-T2) was included in this comparison. The MFS-SNR algorithm selected all 3 variables relevant to discrimination with higher or equivalent frequencies to competing methods in all scenarios. Following non-normal variable transformation the MFS-SNR algorithm still selected the variables known to be relevant to discrimination in the simulated scenarios. Finally, I studied both the MFS-SNR and MFS-T2 algorithm’s ability to carry out variable selection for disease classification using several clinical datasets from ophthalmology. These datasets represented a spectrum of quality issues such as missingness, imbalanced group sizes, heterogeneous variance-covariance matrices and differing variable scales. In 3 out of 4 datasets the MFS-SNR algorithm out-performed the MFS-T2 algorithm. In the fourth study both MFS-T2 and MFS-SNR produced the same variable selection results. In conclusion I have demonstrated that the novel SNR is an extension of Hotelling’s T2 statistic accommodating heterogeneity of variance-covariance matrices. The MFS-SNR algorithm is capable of selecting the relevant variables whether data are normally distributed or not. In the simulated scenarios the MFS-SNR algorithm performs at least as well as competing methods and outperforms the MFS-T2 algorithm when selecting variables from real clinical datasets
    corecore