34 research outputs found

    Clustering-Based Undersampling for Class-imbalanced Data

    No full text
    [[abstract]]Class imbalance is often a problem in various real-world data sets, where one class (i.e. the minority class) contains a small number of data points and the other (i.e. the majority class) contains a large number of data points. It is notably difficult to develop an effective model using current data mining and machine learning algorithms without considering data preprocessing to balance the imbalanced data sets. Random undersampling and oversampling have been used in numerous studies to ensure that the different classes contain the same number of data points. A classifier ensemble (i.e. a structure containing several classifiers) can be trained on several different balanced data sets for later classification purposes. In this paper, we introduce two undersampling strategies in which a clustering technique is used during the data preprocessing step. Specifically, the number of clusters in the majority class is set to be equal to the number of data points in the minority class. The first strategy uses the cluster centers to represent the majority class, whereas the second strategy uses the nearest neighbors of the cluster centers. A further study was conducted to examine the effect on performance of the addition or deletion of 5 to 10 cluster centers in the majority class. The experimental results obtained using 44 small-scale and 2 large-scale data sets revealed that the clustering-based undersampling approach with the second strategy outperformed five state-of-the-art approaches. Specifically, this approach combined with a single multilayer perceptron classifier and C4.5 decision tree classifier ensembles delivered optimal performance over both small- and large-scale data sets

    Keypoint Selection for Efficient Bag-of-Words Feature Generation and Effective Image Classification

    No full text
    [[abstract]]One of the most popular image representations for image classification is based on the bag-of-words (BoW) features. However, the number of keypoints that need to be detected from images to generate the BoW features is usually very large, which causes two problems. First, the computational cost during the vector quantization step is high. Second, some of the detected keypoints are not helpful for recognition. To resolve these limitations, we introduce a framework, called iterative keypoint selection (IKS), with which to select representative keypoints for accelerating the computational time to generate the BoW features, leading to more discriminative feature representation. Each iteration in IKS is comprised of two steps. In the first step some representative keypoint(s) are identified from each image. Then, the keypoints are filtered out if the distances between them and the identified representative keypoint(s) are less than a pre-defined distance. The iteration process continues until no unrepresentative keypoints can be found. Two specific approaches are proposed to perform the first step of IKS. IKS1 focuses on randomly selecting one representative keypoint and IKS2 is based on a clustering algorithm in which the representative keypoints are the closest points to their cluster centers. Experiments carried out based on the Caltech 101, Caltech 256, and PASCAL 2007 datasets demonstrate that performing keypoint selection using IKS1 and IKS2 to generate both the BoW and spatial-based BoW features allows the support vector machine (SVM) classifier to provide better classification accuracy than with the baseline features without keypoint selection. However, it is found that the computational cost of IKS1 is larger than the baseline methods. On the other hand, IKS2 is able to not only efficiently generate the BoW and spatial-based features that reduce the computational time for vector quantization over these datasets, but also provides better classification results than IKS1 over the PASCAL 2007 and Caltech 256 datasets

    Sex, Menopause, Metabolic Syndrome, and All-Cause and Cause-Specific Mortality-Cohort Analysis from the Third National Health and Nutrition Examination Survey

    No full text
    Objective: This study assessed the mortality risk associated with metabolic syndrome (MetS) for participants from the Third National Health and Nutrition Examination Survey. Design, Setting, and Patients: The study analyzed mortality data from 1364 men and 1321 women aged 40 yr and older based on their MetS status defined by National Cholesterol Education Program Adult Treatment Panel III. Subjects initially using insulin, oral hypoglycemic, antihypertensive , or lipid-lowering medications were excluded. Main Outcome Measures: All-cause, cardiovascular, cardiac, and noncardiovascular mortality were obtained from the Third National Health and Nutrition Examination Survey-linked mortality follow-up file through December 31, 2000. Results: The prevalence of MetS was 33 and 29% for men and women, respectively. In the male subjects, there was no significant association between MetS and mortality. In the women, MetS was an independent risk factor for all-cause mortality [ hazard ratio (HR) 1.84, 95% confidence interval (CI) 1.29-2. 64, P = 0.001], cardiovascular mortality (HR 1.96, 95% CI 1. 21-3.17, P = 0.007), cardiac mortality (HR 1 .88, 95% CI 1.15 -3.09, P = 0.01), and noncardiovascular mortality (HR 1. 80, 95% CI 1.13-2.87, P = 0.01). The HR was stronger when postmenopausal women were analyzed separately and became nonsignificant in the premenopausal cohort. The sex-specific HR remained unchanged, regardless of the MetS criteria used or the inclusion of actively treated subjects. Conclusions: MetS poses a significant increase in mortality risk through an observation period as long as 12 yr, primarily in postmenopausal women, that is not apparent in men and premenopausal women. Sex is an important effect modifier of all-cause and cause-specific death
    corecore