234 research outputs found
Tractability of Theory Patching
In this paper we consider the problem of `theory patching', in which we are
given a domain theory, some of whose components are indicated to be possibly
flawed, and a set of labeled training examples for the domain concept. The
theory patching problem is to revise only the indicated components of the
theory, such that the resulting theory correctly classifies all the training
examples. Theory patching is thus a type of theory revision in which revisions
are made to individual components of the theory. Our concern in this paper is
to determine for which classes of logical domain theories the theory patching
problem is tractable. We consider both propositional and first-order domain
theories, and show that the theory patching problem is equivalent to that of
determining what information contained in a theory is `stable' regardless of
what revisions might be performed to the theory. We show that determining
stability is tractable if the input theory satisfies two conditions: that
revisions to each theory component have monotonic effects on the classification
of examples, and that theory components act independently in the classification
of examples in the theory. We also show how the concepts introduced can be used
to determine the soundness and completeness of particular theory patching
algorithms.Comment: See http://www.jair.org/ for any accompanying file
Committee-Based Sample Selection for Probabilistic Classifiers
In many real-world learning tasks, it is expensive to acquire a sufficient
number of labeled examples for training. This paper investigates methods for
reducing annotation cost by `sample selection'. In this approach, during
training the learning program examines many unlabeled examples and selects for
labeling only those that are most informative at each stage. This avoids
redundantly labeling examples that contribute little new information. Our work
follows on previous research on Query By Committee, extending the
committee-based paradigm to the context of probabilistic classification. We
describe a family of empirical methods for committee-based sample selection in
probabilistic classification models, which evaluate the informativeness of an
example by measuring the degree of disagreement between several model variants.
These variants (the committee) are drawn randomly from a probability
distribution conditioned by the training set labeled so far. The method was
applied to the real-world natural language processing task of stochastic
part-of-speech tagging. We find that all variants of the method achieve a
significant reduction in annotation cost, although their computational
efficiency differs. In particular, the simplest variant, a two member committee
with no parameters to tune, gives excellent results. We also show that sample
selection yields a significant reduction in the size of the model used by the
tagger
Metaphor Identification in Large Texts Corpora
Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.United States. Intelligence Advanced Research Projects Activity (IARPA)United States. Dept. of Defense (U.S. Army Research Laboratory Contract W911NF-12-C-0021
All Who Wander: On the Prevalence and Characteristics of Multi-community Engagement
Although analyzing user behavior within individual communities is an active
and rich research domain, people usually interact with multiple communities
both on- and off-line. How do users act in such multi-community environments?
Although there are a host of intriguing aspects to this question, it has
received much less attention in the research community in comparison to the
intra-community case. In this paper, we examine three aspects of
multi-community engagement: the sequence of communities that users post to, the
language that users employ in those communities, and the feedback that users
receive, using longitudinal posting behavior on Reddit as our main data source,
and DBLP for auxiliary experiments. We also demonstrate the effectiveness of
features drawn from these aspects in predicting users' future level of
activity.
One might expect that a user's trajectory mimics the "settling-down" process
in real life: an initial exploration of sub-communities before settling down
into a few niches. However, we find that the users in our data continually post
in new communities; moreover, as time goes on, they post increasingly evenly
among a more diverse set of smaller communities. Interestingly, it seems that
users that eventually leave the community are "destined" to do so from the very
beginning, in the sense of showing significantly different "wandering" patterns
very early on in their trajectories; this finding has potentially important
design implications for community maintainers. Our multi-community perspective
also allows us to investigate the "situation vs. personality" debate from
language usage across different communities.Comment: 11 pages, data available at
https://chenhaot.com/pages/multi-community.html, Proceedings of WWW 2015
(updated references
An Army of Me: Sockpuppets in Online Discussion Communities
In online discussion communities, users can interact and share information
and opinions on a wide variety of topics. However, some users may create
multiple identities, or sockpuppets, and engage in undesired behavior by
deceiving others or manipulating discussions. In this work, we study
sockpuppetry across nine discussion communities, and show that sockpuppets
differ from ordinary users in terms of their posting behavior, linguistic
traits, as well as social network structure. Sockpuppets tend to start fewer
discussions, write shorter posts, use more personal pronouns such as "I", and
have more clustered ego-networks. Further, pairs of sockpuppets controlled by
the same individual are more likely to interact on the same discussion at the
same time than pairs of ordinary users. Our analysis suggests a taxonomy of
deceptive behavior in discussion communities. Pairs of sockpuppets can vary in
their deceptiveness, i.e., whether they pretend to be different users, or their
supportiveness, i.e., if they support arguments of other sockpuppets controlled
by the same user. We apply these findings to a series of prediction tasks,
notably, to identify whether a pair of accounts belongs to the same underlying
user or not. Altogether, this work presents a data-driven view of deception in
online discussion communities and paves the way towards the automatic detection
of sockpuppets.Comment: 26th International World Wide Web conference 2017 (WWW 2017
On the Impact of Emotions on Author Profiling
This is the author’s version of a work that was accepted for publication in Information Processing and Management. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information Processing and Management 52 (2016) 73–92. DOI 10.1016/j.ipm.2015.06.003.[EN] In this paper, we investigate the impact of emotions on author profiling, concretely identifying
age and gender. Firstly, we propose the EmoGraph method for modelling the way people use
the language to express themselves on the basis of an emotion-labelled graph. We apply this
representation model for identifying gender and age in the Spanish partition of the PAN-AP-13
corpus, obtaining comparable results to the best performing systems of the PAN Lab of CLEF.
© 2015 Elsevier B.V. All rights reserved.The work of the first author was partially funded by Autoritas Consulting SA and by Spanish Ministry of Economics under grant ECOPORTUNITY IPT-2012-1220-430000. The work of the second author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP 7 Marie Curie, the DIANA APPLICATIONS: Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems. A special mention to Maria Dolores Rangel Pardo for her linguistic contribution to this investigation.Rangel-Pardo, FM.; Rosso, P. (2016). On the Impact of Emotions on Author Profiling. Information Processing and Management. 52(1):73-92. https://doi.org/10.1016/j.ipm.2015.06.003S739252
- …