56 research outputs found
Simple open stance classification for rumour analysis
Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-the-art results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction
Helping crisis responders find the informative needle in the tweet haystack
Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability)
A HMM POS Tagger for Micro-blogging Type Texts
The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one
Efficient named entity annotation through pre-empting
Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities - such as names of people, places and organisations - in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined by annotators, which we call "preempting". This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus
TweetLID : a benchmark for tweet language identification
Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy
Crowdsourcing is an increasingly popular,
collaborative approach for acquiring
annotated corpora. Despite this, reuse
of corpus conversion tools and user interfaces
between projects is still problematic,
since these are not generally made
available. This demonstration will introduce
the new, open-source GATE Crowdsourcing
plugin, which offers infrastructural
support for mapping documents to
crowdsourcing units and back, as well as
automatically generating reusable crowdsourcing
interfaces for NLP classification
and selection tasks. The entire workflow
will be demonstrated on: annotating
named entities; disambiguating words and
named entities with respect to DBpedia
URIs; annotation of opinion holders and
targets; and sentiment
Mental health-related conversations on social media and crisis episodes: a time-series regression analysis
We aimed to investigate whether daily fluctuations in mental health-relevant Twitter posts are associated with daily fluctuations in mental health crisis episodes. We conducted a primary and replicated time-series analysis of retrospectively collected data from Twitter and two London mental healthcare providers. Daily numbers of ‘crisis episodes’ were defined as incident inpatient, home treatment team and crisis house referrals between 2010 and 2014. Higher volumes of depression and schizophrenia tweets were associated with higher numbers of same-day crisis episodes for both sites. After adjusting for temporal trends, seven-day lagged analyses showed significant positive associations on day 1, changing to negative associations by day 4 and reverting to positive associations by day 7. There was a 15% increase in crisis episodes on days with above-median schizophrenia-related Twitter posts. A temporal association was thus found between Twitter-wide mental health-related social media content and crisis episodes in mental healthcare replicated across two services. Seven-day associations are consistent with both precipitating and longer-term risk associations. Sizes of effects were large enough to have potential local and national relevance and further research is needed to evaluate how services might better anticipate times of higher risk and identify the most vulnerable groups
UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval
We present two approaches to time expression identification, as entered in to SemEval2015 Task 6, Clinical TempEval. The first
is a comprehensive rule-based approach that
favoured recall, and which achieved the best
recall for time expression identification in Clinical TempEval. The second is an SVM-based
system built using readily available components, which was able to achieve a competitive F1 in a short development time. We discuss how the two approaches perform relative
to each other, and how characteristics of the
corpus affect the suitability of different approaches and their outcomes
- …
