46 research outputs found
Simple open stance classification for rumour analysis
Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification. The approach profits from a novel set of automatically identifiable problem-specific features, which significantly boost classifier accuracy and achieve above state-of-the-art results on recent benchmark datasets. This calls into question the value of using complex sophisticated models for stance classification without first doing informed feature extraction
Helping crisis responders find the informative needle in the tweet haystack
Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability)
Efficient named entity annotation through pre-empting
Linguistic annotation is time-consuming and expensive. One common annotation task is to mark entities - such as names of people, places and organisations - in text. In a document, many segments of text often contain no entities at all. We show that these segments are worth skipping, and demonstrate a technique for reducing the amount of entity-less text examined by annotators, which we call "preempting". This technique is evaluated in a crowdsourcing scenario, where it provides downstream performance improvements for the same size corpus
A HMM POS Tagger for Micro-blogging Type Texts
The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy
Crowdsourcing is an increasingly popular,
collaborative approach for acquiring
annotated corpora. Despite this, reuse
of corpus conversion tools and user interfaces
between projects is still problematic,
since these are not generally made
available. This demonstration will introduce
the new, open-source GATE Crowdsourcing
plugin, which offers infrastructural
support for mapping documents to
crowdsourcing units and back, as well as
automatically generating reusable crowdsourcing
interfaces for NLP classification
and selection tasks. The entire workflow
will be demonstrated on: annotating
named entities; disambiguating words and
named entities with respect to DBpedia
URIs; annotation of opinion holders and
targets; and sentiment
Novel psychoactive substances: An investigation of temporal trends in social media and electronic health records
Background: Public health monitoring is commonly undertaken in social media but has never been combined with data analysis from electronic health records. This study aimed to investigate the relationship between the emergence of novel psychoactive substances (NPS) in social media and their appearance in a large mental health database.
Methods: Insufficient numbers of mentions of other NPS in case records meant that the study focused on mephedrone. Data were extracted on the number of mephedrone (i) references in the clinical record at the South London and Maudsley NHS Trust, London, UK, (ii) mentions in Twitter, (iii) related searches in Google and (iv) visits in Wikipedia. The characteristics of current mephedrone users in the clinical record were also established.
Results: Increased activity related to mephedrone searches in Google and visits in Wikipedia preceded a peak in mephedrone-related references in the clinical record followed by a spike in the other 3 data sources in early 2010, when mephedrone was assigned a ‘class B’ status. Features of current mephedrone users widely matched those from community studies.
Conclusions: Combined analysis of information from social media and data from mental health records may assist public health and clinical surveillance for certain substance-related events of interest. There exists potential for early warning systems for health-care practitioners
UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval
We present two approaches to time expression identification, as entered in to SemEval2015 Task 6, Clinical TempEval. The first
is a comprehensive rule-based approach that
favoured recall, and which achieved the best
recall for time expression identification in Clinical TempEval. The second is an SVM-based
system built using readily available components, which was able to achieve a competitive F1 in a short development time. We discuss how the two approaches perform relative
to each other, and how characteristics of the
corpus affect the suitability of different approaches and their outcomes
Analysis of Temporal Expressions Annotated in Clinical Notes
Annotating the semantics of time in language is important. THYME is a recent temporal annotation standard for clinical texts. This paper examines temporal expressions in the first major corpus
released under this standard. It investigates where the standard has proven difficult to apply, and
gives a series of recommendations regarding temporal annotation in this important domain