16 research outputs found

    Improved Robustness of Signature-Based Near-Replica Detection via Lexicon Randomization

    No full text
    Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct interdocument similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to nearreplica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicatedocument recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements

    Microsoft Live Labs One Microsoft Way

    No full text
    Many important application areas of text classifiers demand high precision and it is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make outperforming this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance

    Local sparsity control for Naive Bayes with extreme misclassification costs

    No full text
    In applications of data mining characterized by highly skewed misclassification costs certain types of errors become virtually unacceptable. This limits the utility of a classifier to a range in which such constraints can be met. Naive Bayes, which has proven to be very useful in text mining applications due to high scalability, can be particularly affected. Although its 0/1 loss tends to be small, its misclassifications are often made with apparently high con…dence. Aside from e¤orts to better calibrate Naive Bayes scores, it has been shown that its accuracy depends on document sparsity and feature selection can lead to marked improvement in classification performance. Traditionally, sparsity is controlled globally, and the result for any particular document may vary. In this work we examine the merits of local sparsity control for Naive Bayes in the context of highly asymmetric misclassification costs. In experiments with three benchmark document collections we demonstrate clear advantages of document-level feature selection. In the extreme cost setting, multinomial Naive Bayes with local sparsity control is able to outperform even some of the recently proposed e¤ective improvements to the Naive Bayes classifier. There are also indications that local feature selection may be preferable in different cost settings

    Microsoft

    No full text
    Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an effective spam-detection technique. Among different approaches to near-replica detection methods based on document signatures are particularly attractive due to their scalability to massive document collections and their ability to handle high throughput rates. Their weakness lies in the potential brittleness of signatures to small changes in content, which makes them vulnerable to various types of noise. In the important spam-filtering application, this vulnerability can also be exploited by dedicated attackers aiming to maximally fragment signatures corresponding to the same email campaign. We focus on the I-Match algorithm and present a method of strengthening it by considering the usage context when deciding which portions of a document should affect signature generation. This substantially (almost 100-fold in some cases) increases the difficulty of dedicated attacks and provides effective protection against document noise in non-adversarial settings. Our analysis is supported by experiments using a real email collection. 1

    Spam filter evaluation with imprecise ground truth

    No full text
    When trained and evaluated on accurately labeled datasets, online email spam filters are remarkably effective, achieving error rates an order of magnitude better than classifiers in similar applications. But labels acquired from user feedback or third-party adjudication exhibit higher error rates than the best filters – even filters trained using the same source of labels. It is appropriate to use naturally occuring labels – including errors – as training data in evaluating spam filters. Erroneous labels are problematic, however, when used as ground truth to measure filter effectiveness. Any measurement of the filter’s error rate will be augmented and perhaps masked by the label error rate. Using two natural sources of labels, we demonstrate automatic and semi-automatic methods that reduce the influence of labeling errors on evaluation, yielding substantially more precise measurements of true filter error rates

    SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs

    No full text
    We address the problem of separating legitimate emails from uncolicited ones in the context of a large-scale operation, where the diversity of user accounts is very high, while misclassification costs are content-dependent and highly asymmetric. A category specific cost model is proposed and several effective methods of training a cost sensitive filter are studied, using a Support Vector Machine (SVM) as the base classifier. Clear benefits of explictly accounting for varied misclassification costs, either during training or as a form of post-processing, are shown

    Avoidance of Model Re-Induction in SVM-based Feature Selection for Text Categorization

    No full text
    Searching the feature space for a subset yielding optimum performance tends to be expensive, especially in applications where the cardinality of the feature space is high (e.g., text categorization). This is particularly true for massive datasets and learning algorithms with worse than linear scaling factors. Linear Support Vector Machines (SVMs) are among the top performers in the text classi-cation domain and often work best with very rich feature representations. Even they however bene t from reducing the number of features, sometimes to a large extent. In this work we propose alternatives to exact re-induction of SVM models during the search for the optimum feature subset. The approximations offer substantial bene ts in terms of computational ef ciency. We are able to demonstrate that no signi cant compromises in terms of model quality are made and, moreover, in some cases gains in accuracy can be achieved

    Asymmetric Missing-Data Problems: Overcoming the Lack of Negative Data in Preference Ranking

    No full text
    In certain classification problems there is a strong asymmetry between the number of labeled examples available for each of the classes involved. In an extreme case, there may be a complete lack of labeled data for one of the classes while, at the same time, there are adequate labeled examples for the others, accompanied by a large body of unlabeled data. Since most classification algorithms require some information about all classes involved, label estimation for the un-represented class is desired. An important representative of this group of problems is that of user interest/preference modeling where there may be a large number of examples of what the user likes with essentially no counterexamples. Recently, there has been much interest in applying the EM algorithm to incomplete data problems in the area of text retrieval and categorization. We adapt this approach to the asymmetric case of modeling user interests in news articles, where only labeled positive training data are available, with access to a large corpus of unlabeled documents. User modeling is here equivalent to that of user-speciÞc document ranking. EM is used in conjunction with the Naive Bayes model while its output is also utilized by a Support Vector Machine and Rocchio’s technique. Our Þndings demonstrate that the EM algorithm can be quite effective in modeling the negative class under a number of different initialization schemes. Although primarily just the negative training examples are needed, a natural question is whether using all of the estimated labels (i.e., positive and negative) would be more (or less) beneficial. This is important considering that, in this context, the initialization of the negative class for EM is likely not to be very accurate. Experimental results suggest that EM output should be limited to negative label estimates only
    corecore