4,084 research outputs found

    Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains

    Full text link
    There has been increased interest in devising learning techniques that combine unlabeled data with labeled data ? i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi-supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias

    Box Drawings for Learning with Imbalanced Data

    Get PDF
    The vast majority of real world classification problems are imbalanced, meaning there are far fewer data from the class of interest (the positive class) than from other classes. We propose two machine learning algorithms to handle highly imbalanced classification problems. The classifiers constructed by both methods are created as unions of parallel axis rectangles around the positive examples, and thus have the benefit of being interpretable. The first algorithm uses mixed integer programming to optimize a weighted balance between positive and negative class accuracies. Regularization is introduced to improve generalization performance. The second method uses an approximation in order to assist with scalability. Specifically, it follows a \textit{characterize then discriminate} approach, where the positive class is characterized first by boxes, and then each box boundary becomes a separate discriminative classifier. This method has the computational advantages that it can be easily parallelized, and considers only the relevant regions of feature space

    Class Balanced Similarity-Based Instance Transfer Learning for Botnet Family Classification

    Get PDF
    The use of Transfer Learning algorithms for enhancing the performance of machine learning algorithms has gained attention over the last decade. In this paper we introduce an extension and evaluation of our novel approach Similarity Based Instance Transfer Learning (SBIT). The extended version is denoted Class Balanced SBIT (or CB-SBIT for short) because it ensures the dataset resulting after instance transfer does not contain class imbalance. We compare the performance of CB-SBIT against the original SBIT algorithm. In addition, we compare its performance against that of the classical Synthetic Minority Over-sampling Technique (SMOTE) using network tra ffic data. We also compare the performance of CB-SBIT against the performance of the open source transfer learning algorithm TransferBoost using text data. Our results show that CB-SBIT outperforms the original SBIT and SMOTE using varying sizes of network tra ffic data but falls short when compared to TransferBoost using text data

    SMOTE: Synthetic Minority Over-sampling Technique

    Full text link
    An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy

    A review onquantification learning

    Get PDF
    The task of quantification consists in providing an aggregate estimation (e.g. the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of methods that do not require predictions for individual examples and just focus on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This paper presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the fiel

    Effect of plant growth regulators on flowering behavior of cashew cv. Vengurla-4 grown in the hilly tracts of South Gujarat

    Get PDF
    A trial was conducted at Subhir and Chikhalda locations in Dang district of South Gujarat, India to assess the effect of Ethrel, NAA and GA3 on the flowering behavior of cashew cultivar Vengurla-4 during 2013-14. Three concentrations each of GA3 (50, 75, 100 ppm), Ethrel (10, 30, 50 ppm) and NAA (50, 75, 100ppm) were applied as foliar sprays 20 days before blossoming and 20 days after full bloom in twenty year old trees of cashew cultivar Vengurla-4. Trees sprayed with 50 ppm Ethrel had significantly the highest number of flowering panicles per squaremeter (13.09), number of perfect flowers per panicle (87.11) and sex ratio (0.24) across locations and in pooled data. However, this was at par with 10 ppm Ethrel which emerged as the second best treatment of the trial. This study demonstrated the potential of Ethrel in improving various flowering parameters of cashew which are important determinations in increasing nut production

    Identification of micro satellite markers on chromosomes of bread wheat showing an association with karnal bunt resistance

    Get PDF
    A set of 104 wheat recombinant inbred lines developed from a cross between parents resistant (HD 29) and susceptible (WH 542) to karnal bunt (caused by Neovossia indica) were screened and used toidentify SSR markers linked with resistance to karnal bunt as these would allow indirect marker assisted selection of karnal bunt resistant genotypes. The two parents were analysed with 46 SSR primer pairs. Of these, 15 (32%) were found polymorphic between the two parental genotypes. Using these primer pairs, we carried out bulked segregate analysis on two bulked DNAs, one obtained by pooling DNA from 10 karnal bunt resistant recombinant inbred lines and the other similarly derived by pooling DNA from 10 karnal bunt susceptible recombinant inbred lines. Two molecular markers, Xgwm 337-1D and Xgwm 637-4A showed apparent linkage with resistance to karnal bunt. This was confirmed following selective genotyping of individual recombinant inbred lines included in the bulks. These markers may be useful in marker assisted selection for karnal bunt resistance in wheat
    corecore