4,747 research outputs found
Impact Of Content Features For Automatic Online Abuse Detection
Online communities have gained considerable importance in recent years due to
the increasing number of people connected to the Internet. Moderating user
content in online communities is mainly performed manually, and reducing the
workload through automatic methods is of great financial interest for community
maintainers. Often, the industry uses basic approaches such as bad words
filtering and regular expression matching to assist the moderators. In this
article, we consider the task of automatically determining if a message is
abusive. This task is complex since messages are written in a non-standardized
way, including spelling errors, abbreviations, community-specific codes...
First, we evaluate the system that we propose using standard features of online
messages. Then, we evaluate the impact of the addition of pre-processing
strategies, as well as original specific features developed for the community
of an online in-browser strategy game. We finally propose to analyze the
usefulness of this wide range of features using feature selection. This work
can lead to two possible applications: 1) automatically flag potentially
abusive messages to draw the moderator's attention on a narrow subset of
messages ; and 2) fully automate the moderation process by deciding whether a
message is abusive without any human intervention
A Method for Avoiding Bias from Feature Selection with Application to Naive Bayes Classification Models
For many classification and regression problems, a large number of features
are available for possible use - this is typical of DNA microarray data on gene
expression, for example. Often, for computational or other reasons, only a
small subset of these features are selected for use in a model, based on some
simple measure such as correlation with the response variable. This procedure
may introduce an optimistic bias, however, in which the response variable
appears to be more predictable than it actually is, because the high
correlation of the selected features with the response may be partly or wholely
due to chance. We show how this bias can be avoided when using a Bayesian model
for the joint distribution of features and response. The crucial insight is
that even if we forget the exact values of the unselected features, we should
retain, and condition on, the knowledge that their correlation with the
response was too small for them to be selected. In this paper we describe how
this idea can be implemented for ``naive Bayes'' models of binary data.
Experiments with simulated data confirm that this method avoids bias due to
feature selection. We also apply the naive Bayes model to subsets of data
relating gene expression to colon cancer, and find that correcting for bias
from feature selection does improve predictive performance
Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams
We present an architecture for data streams based on structures typically found in web cache hierarchies. The main idea is to build a meta level analyser from a number of levels constructed over time from a data stream. We present the general architecture for such a system and an application to classification. This architecture is an instance of the general wrapper idea allowing us to reuse standard batch learning algorithms in an inherently incremental learning environment. By artificially generating data sources we demonstrate that a hierarchy containing a mixture of models is able to adapt over time to the source of the data. In these experiments the hierarchies use an elementary performance based replacement policy and unweighted voting for making classification decisions
Asymptotic Analysis of Generative Semi-Supervised Learning
Semisupervised learning has emerged as a popular framework for improving
modeling accuracy while controlling labeling cost. Based on an extension of
stochastic composite likelihood we quantify the asymptotic accuracy of
generative semi-supervised learning. In doing so, we complement
distribution-free analysis by providing an alternative framework to measure the
value associated with different labeling policies and resolve the fundamental
question of how much data to label and in what manner. We demonstrate our
approach with both simulation studies and real world experiments using naive
Bayes for text classification and MRFs and CRFs for structured prediction in
NLP.Comment: 12 pages, 9 figure
Statistical Function Tagging and Grammatical Relations of Myanmar Sentences
This paper describes a context free grammar (CFG) based grammatical relations
for Myanmar sentences which combine corpus-based function tagging system. Part
of the challenge of statistical function tagging for Myanmar sentences comes
from the fact that Myanmar has free-phrase-order and a complex morphological
system. Function tagging is a pre-processing step to show grammatical relations
of Myanmar sentences. In the task of function tagging, which tags the function
of Myanmar sentences with correct segmentation, POS (part-of-speech) tagging
and chunking information, we use Naive Bayesian theory to disambiguate the
possible function tags of a word. We apply context free grammar (CFG) to find
out the grammatical relations of the function tags. We also create a functional
annotated tagged corpus for Myanmar and propose the grammar rules for Myanmar
sentences. Experiments show that our analysis achieves a good result with
simple sentences and complex sentences.Comment: 16 pages, 7 figures, 8 tables, AIAA-2011 (India). arXiv admin note:
text overlap with arXiv:0912.1820 by other author
Stochastic Discriminative EM
Stochastic discriminative EM (sdEM) is an online-EM-type algorithm for
discriminative training of probabilistic generative models belonging to the
exponential family. In this work, we introduce and justify this algorithm as a
stochastic natural gradient descent method, i.e. a method which accounts for
the information geometry in the parameter space of the statistical model. We
show how this learning algorithm can be used to train probabilistic generative
models by minimizing different discriminative loss functions, such as the
negative conditional log-likelihood and the Hinge loss. The resulting models
trained by sdEM are always generative (i.e. they define a joint probability
distribution) and, in consequence, allows to deal with missing data and latent
variables in a principled way either when being learned or when making
predictions. The performance of this method is illustrated by several text
classification problems for which a multinomial naive Bayes and a latent
Dirichlet allocation based classifier are learned using different
discriminative loss functions.Comment: UAI 2014 paper + Supplementary Material. In Proceedings of the
Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI 2014),
edited by Nevin L. Zhang and Jian Tian. AUAI Pres
A Bayesian Approach to Identify Bitcoin Users
Bitcoin is a digital currency and electronic payment system operating over a
peer-to-peer network on the Internet. One of its most important properties is
the high level of anonymity it provides for its users. The users are identified
by their Bitcoin addresses, which are random strings in the public records of
transactions, the blockchain. When a user initiates a Bitcoin-transaction, his
Bitcoin client program relays messages to other clients through the Bitcoin
network. Monitoring the propagation of these messages and analyzing them
carefully reveal hidden relations. In this paper, we develop a mathematical
model using a probabilistic approach to link Bitcoin addresses and transactions
to the originator IP address. To utilize our model, we carried out experiments
by installing more than a hundred modified Bitcoin clients distributed in the
network to observe as many messages as possible. During a two month observation
period we were able to identify several thousand Bitcoin clients and bind their
transactions to geographical locations
- âŠ