1,347 research outputs found
From news to comment: Resources and benchmarks for parsing the language of web 2.0
We investigate the problem of parsing the noisy language of social media. We evaluate four all-Street-Journal-trained statistical parsers (Berkeley, Brown, Malt and MST) on a new dataset containing 1,000 phrase structure trees for sentences from microblogs (tweets) and discussion forum posts. We compare the four parsers on their ability to produce Stanford dependencies for these Web 2.0 sentences. We find that the parsers have a particular problem with tweets and that a substantial part of this problem is related to POS tagging accuracy. We attempt three retraining experiments involving Malt, Brown and an in-house Berkeley-style parser and obtain a statistically significant improvement for all three parsers
Learning to Embed Words in Context for Syntactic Tasks
We present models for embedding words in the context of surrounding words.
Such models, which we refer to as token embeddings, represent the
characteristics of a word that are specific to a given context, such as word
sense, syntactic category, and semantic role. We explore simple, efficient
token embedding models based on standard neural network architectures. We learn
token embeddings on a large amount of unannotated text and evaluate them as
features for part-of-speech taggers and dependency parsers trained on much
smaller amounts of annotated data. We find that predictors endowed with token
embeddings consistently outperform baseline predictors across a range of
context window and training set sizes.Comment: Accepted by ACL 2017 Repl4NLP worksho
Tweet Acts: A Speech Act Classifier for Twitter
Speech acts are a way to conceptualize speech as action. This holds true for
communication on any platform, including social media platforms such as
Twitter. In this paper, we explored speech act recognition on Twitter by
treating it as a multi-class classification problem. We created a taxonomy of
six speech acts for Twitter and proposed a set of semantic and syntactic
features. We trained and tested a logistic regression classifier using a data
set of manually labelled tweets. Our method achieved a state-of-the-art
performance with an average F1 score of more than . We also explored
classifiers with three different granularities (Twitter-wide, type-specific and
topic-specific) in order to find the right balance between generalization and
overfitting for our task.Comment: ICWSM'16, May 17-20, Cologne, Germany. In Proceedings of the 10th
AAAI Conference on Weblogs and Social Media (ICWSM 2016). Cologne, German
Towards Syntactic Iberian Polarity Classification
Lexicon-based methods using syntactic rules for polarity classification rely
on parsers that are dependent on the language and on treebank guidelines. Thus,
rules are also dependent and require adaptation, especially in multilingual
scenarios. We tackle this challenge in the context of the Iberian Peninsula,
releasing the first symbolic syntax-based Iberian system with rules shared
across five official languages: Basque, Catalan, Galician, Portuguese and
Spanish. The model is made available.Comment: 7 pages, 5 tables. Contribution to the 8th Workshop on Computational
Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017)
at EMNLP 201
Universal, Unsupervised (Rule-Based), Uncovered Sentiment Analysis
We present a novel unsupervised approach for multilingual sentiment analysis
driven by compositional syntax-based rules. On the one hand, we exploit some of
the main advantages of unsupervised algorithms: (1) the interpretability of
their output, in contrast with most supervised models, which behave as a black
box and (2) their robustness across different corpora and domains. On the other
hand, by introducing the concept of compositional operations and exploiting
syntactic information in the form of universal dependencies, we tackle one of
their main drawbacks: their rigidity on data that are structured differently
depending on the language concerned. Experiments show an improvement both over
existing unsupervised methods, and over state-of-the-art supervised models when
evaluating outside their corpus of origin. Experiments also show how the same
compositional operations can be shared across languages. The system is
available at http://www.grupolys.org/software/UUUSA/Comment: 19 pages, 5 Tables, 6 Figures. This is the authors version of a work
that was accepted for publication in Knowledge-Based System
A Deep Network Model for Paraphrase Detection in Short Text Messages
This paper is concerned with paraphrase detection. The ability to detect
similar sentences written in natural language is crucial for several
applications, such as text mining, text summarization, plagiarism detection,
authorship authentication and question answering. Given two sentences, the
objective is to detect whether they are semantically identical. An important
insight from this work is that existing paraphrase systems perform well when
applied on clean texts, but they do not necessarily deliver good performance
against noisy texts. Challenges with paraphrase detection on user generated
short texts, such as Twitter, include language irregularity and noise. To cope
with these challenges, we propose a novel deep neural network-based approach
that relies on coarse-grained sentence modeling using a convolutional neural
network and a long short-term memory model, combined with a specific
fine-grained word-level similarity matching model. Our experimental results
show that the proposed approach outperforms existing state-of-the-art
approaches on user-generated noisy social media data, such as Twitter texts,
and achieves highly competitive performance on a cleaner corpus
- âŚ