44,573 research outputs found

    Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

    Full text link
    For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a budget-sensitive progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance

    Collaborative decision making by ensemble rule based classification systems

    Get PDF

    Online Deception Detection Refueled by Real World Data Collection

    Full text link
    The lack of large realistic datasets presents a bottleneck in online deception detection studies. In this paper, we apply a data collection method based on social network analysis to quickly identify high-quality deceptive and truthful online reviews from Amazon. The dataset contains more than 10,000 deceptive reviews and is diverse in product domains and reviewers. Using this dataset, we explore effective general features for online deception detection that perform well across domains. We demonstrate that with generalized features - advertising speak and writing complexity scores - deception detection performance can be further improved by adding additional deceptive reviews from assorted domains in training. Finally, reviewer level evaluation gives an interesting insight into different deceptive reviewers' writing styles.Comment: 10 pages, Accepted to Recent Advances in Natural Language Processing (RANLP) 201

    Unbalanced load flow with hybrid wavelet transform and support vector machine based Error-Correcting Output Codes for power quality disturbances classification including wind energy

    Get PDF
    Purpose. The most common methods to designa multiclass classification consist to determine a set of binary classifiers and to combine them. In this paper support vector machine with Error-Correcting Output Codes (ECOC-SVM) classifier is proposed to classify and characterize the power qualitydisturbances such as harmonic distortion,voltage sag, and voltage swell include wind farms generator in power transmission systems. Firstly three phases unbalanced load flow analysis is executed to calculate difference electric network characteristics, levels of voltage, active and reactive power. After, discrete wavelet transform is combined with the probabilistic ECOC-SVM model to construct the classifier. Finally, the ECOC-SVM classifies and identifies the disturbance type according tothe energy deviation of the discrete wavelet transform. The proposedmethod gives satisfactory accuracy with 99.2% compared with well known methods and shows that each power quality disturbances has specific deviations from the pure sinusoidal waveform,this is good at recognizing and specifies the type of disturbance generated from the wind power generator.НаиболСС распространСнныС ΠΌΠ΅Ρ‚ΠΎΠ΄Ρ‹ построСния ΠΌΡƒΠ»ΡŒΡ‚ΠΈΠΊΠ»Π°ΡΡΠΎΠ²ΠΎΠΉ классификации Π·Π°ΠΊΠ»ΡŽΡ‡Π°ΡŽΡ‚ΡΡ Π² ΠΎΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½ΠΈΠΈ Π½Π°Π±ΠΎΡ€Π° Π΄Π²ΠΎΠΈΡ‡Π½Ρ‹Ρ… классификаторов ΠΈ ΠΈΡ… объСдинСнии. Π’ Π΄Π°Π½Π½ΠΎΠΉ ΡΡ‚Π°Ρ‚ΡŒΠ΅ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π° машина ΠΎΠΏΠΎΡ€Π½Ρ‹Ρ… Π²Π΅ΠΊΡ‚ΠΎΡ€ΠΎΠ² с классификатором Π²Ρ‹Ρ…ΠΎΠ΄Π½Ρ‹Ρ… ΠΊΠΎΠ΄ΠΎΠ² исправлСния ошибок(ECOC-SVM) с Ρ†Π΅Π»ΡŒΡŽ ΠΊΠ»Π°ΡΡΠΈΡ„ΠΈΡ†ΠΈΡ€ΠΎΠ²Π°Ρ‚ΡŒ ΠΈ Ρ…Π°Ρ€Π°ΠΊΡ‚Π΅Ρ€ΠΈΠ·ΠΎΠ²Π°Ρ‚ΡŒ Ρ‚Π°ΠΊΠΈΠ΅ Π½Π°Ρ€ΡƒΡˆΠ΅Π½ΠΈΡ качСства элСктроэнСргии, ΠΊΠ°ΠΊ гармоничСскиС искаТСния, ΠΏΠ°Π΄Π΅Π½ΠΈΠ΅ напряТСния ΠΈ скачок напряТСния, Π²ΠΊΠ»ΡŽΡ‡Π°Ρ Π³Π΅Π½Π΅Ρ€Π°Ρ‚ΠΎΡ€ Π²Π΅Ρ‚Ρ€ΠΎΠ²Ρ‹Ρ… элСктростанций Π² систСмах ΠΏΠ΅Ρ€Π΅Π΄Π°Ρ‡ΠΈ элСктроэнСргии. Π‘Π½Π°Ρ‡Π°Π»Π° выполняСтся Π°Π½Π°Π»ΠΈΠ· ΠΏΠΎΡ‚ΠΎΠΊΠ° нСсиммСтричной Π½Π°Π³Ρ€ΡƒΠ·ΠΊΠΈ Ρ‚Ρ€Π΅Ρ… Ρ„Π°Π· для расчСта разностных характСристик элСктричСской сСти, ΡƒΡ€ΠΎΠ²Π½Π΅ΠΉ напряТСния, Π°ΠΊΡ‚ΠΈΠ²Π½ΠΎΠΉ ΠΈ Ρ€Π΅Π°ΠΊΡ‚ΠΈΠ²Π½ΠΎΠΉ мощности. ПослС этого дискрСтноС Π²Π΅ΠΉΠ²Π»Π΅Ρ‚-ΠΏΡ€Π΅ΠΎΠ±Ρ€Π°Π·ΠΎΠ²Π°Π½ΠΈΠ΅ ΠΎΠ±ΡŠΠ΅Π΄ΠΈΠ½ΡΠ΅Ρ‚ΡΡ с вСроятностной модСлью ECOC-SVM для построСния классификатора. НаконСц, ECOC-SVM классифицируСт ΠΈ ΠΈΠ΄Π΅Π½Ρ‚ΠΈΡ„ΠΈΡ†ΠΈΡ€ΡƒΠ΅Ρ‚ Ρ‚ΠΈΠΏ возмущСния Π² соотвСтствии с ΠΎΡ‚ΠΊΠ»ΠΎΠ½Π΅Π½ΠΈΠ΅ΠΌ энСргии дискрСтного Π²Π΅ΠΉΠ²Π»Π΅Ρ‚-прСобразования. ΠŸΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½Π½Ρ‹ΠΉ ΠΌΠ΅Ρ‚ΠΎΠ΄ Π΄Π°Π΅Ρ‚ ΡƒΠ΄ΠΎΠ²Π»Π΅Ρ‚Π²ΠΎΡ€ΠΈΡ‚Π΅Π»ΡŒΠ½ΡƒΡŽ Ρ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ 99,2% ΠΏΠΎ ΡΡ€Π°Π²Π½Π΅Π½ΠΈΡŽ с Ρ…ΠΎΡ€ΠΎΡˆΠΎ извСстными ΠΌΠ΅Ρ‚ΠΎΠ΄Π°ΠΌΠΈ ΠΈ ΠΏΠΎΠΊΠ°Π·Ρ‹Π²Π°Π΅Ρ‚, Ρ‡Ρ‚ΠΎ ΠΊΠ°ΠΆΠ΄ΠΎΠ΅ Π½Π°Ρ€ΡƒΡˆΠ΅Π½ΠΈΠ΅ качСства элСктроэнСргии ΠΈΠΌΠ΅Π΅Ρ‚ ΠΎΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½Π½Ρ‹Π΅ отклонСния ΠΎΡ‚ чисто ΡΠΈΠ½ΡƒΡΠΎΠΈΠ΄Π°Π»ΡŒΠ½ΠΎΠΉ Ρ„ΠΎΡ€ΠΌΡ‹ Π²ΠΎΠ»Π½Ρ‹, Ρ‡Ρ‚ΠΎ способствуСт Ρ€Π°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡŽ ΠΈ ΠΎΠΏΡ€Π΅Π΄Π΅Π»Π΅Π½ΠΈΡŽ Ρ‚ΠΈΠΏΠ° возмущСния, Π³Π΅Π½Π΅Ρ€ΠΈΡ€ΡƒΠ΅ΠΌΠΎΠ³ΠΎ Π²Π΅Ρ‚Ρ€ΠΎΠ²Ρ‹ΠΌ Π³Π΅Π½Π΅Ρ€Π°Ρ‚ΠΎΡ€ΠΎΠΌ

    Impact Of Content Features For Automatic Online Abuse Detection

    Full text link
    Online communities have gained considerable importance in recent years due to the increasing number of people connected to the Internet. Moderating user content in online communities is mainly performed manually, and reducing the workload through automatic methods is of great financial interest for community maintainers. Often, the industry uses basic approaches such as bad words filtering and regular expression matching to assist the moderators. In this article, we consider the task of automatically determining if a message is abusive. This task is complex since messages are written in a non-standardized way, including spelling errors, abbreviations, community-specific codes... First, we evaluate the system that we propose using standard features of online messages. Then, we evaluate the impact of the addition of pre-processing strategies, as well as original specific features developed for the community of an online in-browser strategy game. We finally propose to analyze the usefulness of this wide range of features using feature selection. This work can lead to two possible applications: 1) automatically flag potentially abusive messages to draw the moderator's attention on a narrow subset of messages ; and 2) fully automate the moderation process by deciding whether a message is abusive without any human intervention

    A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

    Full text link
    Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities
    • …
    corecore