Search CORE

1,339 research outputs found

Cross-Lingual Classification of Crisis Data

Author: A Tonon
Grégoire Burel
H Gao
J Rogstadius
N Cristianini
Prashant Khare
R Navigli
R Power
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/09/2018
Field of study

Many citizens nowadays flock to social media during crises to share or acquire the latest information about the event. Due to the sheer volume of data typically circulated during such events, it is necessary to be able to efficiently filter out irrelevant posts, thus focusing attention on the posts that are truly relevant to the crisis. Current methods for classifying the relevance of posts to a crisis or set of crises typically struggle to deal with posts in different languages, and it is not viable during rapidly evolving crisis situations to train new models for each language. In this paper we test statistical and semantic classification approaches on cross-lingual datasets from 30 crisis events, consisting of posts written mainly in English, Spanish, and Italian. We experiment with scenarios where the model is trained on one language and tested on another, and where the data is translated to a single language. We show that the addition of semantic features extracted from external knowledge bases improve accuracy over a purely statistical model

Crossref

Open Research Online (The Open University)

White Rose Research Online

Comparing Supervised Learning Methods for Classifying Spanish Tweets Comparación de Métodos de Aprendizaje Supervisado para la Clasificación de Tweets en Español

Author: Ernesto Cuadros
Javier Tejada
Jorge Valverde
Publication venue
Publication date: 03/04/2020
Field of study

Resumen: El presente paper presenta un conjunto de experimentos para abordar la tarea de clasificación global de polaridad de tweets en español del TASS 2015. En este trabajo se hace una comparación entre los principales algoritmos de clasificación supervisados para el Análisis de Sentimientos: Support Vector Machines, Naive Bayes, Entropía Máxima y Árboles de Decisión. Se propone también mejorar el rendimiento de estos clasificadores utilizando una técnica de reducción de clases y luego un algoritmo de votación llamado Naive Voting. Los resultados muestran que nuestra propuesta supera los otros métodos de aprendizaje de máquina propuestos en este trabajo. Palabras clave: Análisis de Sentimientos, Métodos Supervisados, Tweets Españoles Abstract: This paper presents a set of experiments to address the global polarity classification task of Spanish Tweets of TASS 2015. In this work, we compare the main supervised classification algorithms for Sentiment Analysis: Support Vector Machines, Naive Bayes, Maximum Entropy and Decision Trees. We propose to improve the performance of these classifiers using a class reduction technique and then a voting algorithm called Naive Voting. Results show that our proposal outperforms the other machine learning methods proposed in this work

CiteSeerX

Recommended from our members

Crisis Event Extraction Service (CREES) - Automatic Detection and Classification of Crisis-related Content on Social Media

Author: Alani Harith
Burel Gregoire
Publication venue
Publication date: 18/05/2018
Field of study

Social media posts tend to provide valuable reports during crises. However, this information can be hidden in large amounts of unrelated documents. Providing tools that automatically identify relevant posts, event types (e.g., hurricane, floods, etc.) and information categories (e.g., reports on affected individuals, donations and volunteering, etc.) in social media posts is vital for their efficient handling and consumption. We introduce the Crisis Event Extraction Service (CREES), an open-source web API that automatically classifies posts during crisis situations. The API provides annotations for crisis-related documents, event types and information categories through an easily deployable and accessible web API that can be integrated into multiple platform and tools. The annotation service is backed by Convolutional Neural Networks (CNNs) and validated against traditional machine learning models. Results show that the CNN-based API results can be relied upon when dealing with specific crises with the benefits associated with the usage word embeddings

Open Research Online (The Open University)

Co-training for Demographic Classification Using Deep Learning from Label Proportions

Author: Ardehaly Ehsan Mohammady
Culotta Aron
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/09/2017
Field of study

Deep learning algorithms have recently produced state-of-the-art accuracy in many classification tasks, but this success is typically dependent on access to many annotated training examples. For domains without such data, an attractive alternative is to train models with light, or distant supervision. In this paper, we introduce a deep neural network for the Learning from Label Proportion (LLP) setting, in which the training data consist of bags of unlabeled instances with associated label distributions for each bag. We introduce a new regularization layer, Batch Averager, that can be appended to the last layer of any deep neural network to convert it from supervised learning to LLP. This layer can be implemented readily with existing deep learning packages. To further support domains in which the data consist of two conditionally independent feature views (e.g. image and text), we propose a co-training algorithm that iteratively generates pseudo bags and refits the deep LLP model to improve classification accuracy. We demonstrate our models on demographic attribute classification (gender and race/ethnicity), which has many applications in social media analysis, public health, and marketing. We conduct experiments to predict demographics of Twitter users based on their tweets and profile image, without requiring any user-level annotations for training. We find that the deep LLP approach outperforms baselines for both text and image features separately. Additionally, we find that co-training algorithm improves image and text classification by 4% and 8% absolute F1, respectively. Finally, an ensemble of text and image classifiers further improves the absolute F1 measure by 4% on average

arXiv.org e-Print Archive

Crossref

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

Author: Grcar Miha
Mozetic Igor
Smailovic Jasmina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/02/2016
Field of study

What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

arXiv.org e-Print Archive

Common Language Resources and Technology Infrastructure - Slovenia

Directory of Open Access Journals

PubMed Central

Digital repository of Slovenian research organizations

Data Innovation for International Development: An overview of natural language processing for qualitative data analysis

Author: Broniecki Philipp
Hanchar Anna
Mikhaylov Slava J.
Publication venue
Publication date: 16/09/2017
Field of study

Availability, collection and access to quantitative data, as well as its limitations, often make qualitative data the resource upon which development programs heavily rely. Both traditional interview data and social media analysis can provide rich contextual information and are essential for research, appraisal, monitoring and evaluation. These data may be difficult to process and analyze both systematically and at scale. This, in turn, limits the ability of timely data driven decision-making which is essential in fast evolving complex social systems. In this paper, we discuss the potential of using natural language processing to systematize analysis of qualitative data, and to inform quick decision-making in the development context. We illustrate this with interview data generated in a format of micro-narratives for the UNDP Fragments of Impact project

arXiv.org e-Print Archive

University of Birmingham Research Portal

Recommended from our members

Identifying and Processing Crisis Information from Social Media

Author: Khare Prashant
Publication venue
Publication date: 26/02/2020
Field of study

Social media platforms play a crucial role in how people communicate, particularly during crisis situations such as natural disasters. People share and disseminate information on social media platforms that relates to updates, alerts, rescue and relief requests among other crisis relevant information. Hurricane Harvey and Hurricane Sandy saw over tens of millions of posts getting generated, on Twitter, in a short span of time. The ambit of such posts spreads across a wide range such as personal and official communications, and citizen sensing, to mention a few. This makes social media platforms a source of vital information to different stakeholders in crisis situations such as impacted communities, relief agencies, and civic authorities. However, the overwhelming volume of data generated during such times, makes it impossible to manually identify information relevant to crisis. Additionally, a large portion of posts in voluminous streams is not relevant or bears minimal relevance to crisis situations. This has steered much research towards exploring methods that can automatically identify crisis relevant information from voluminous streams of data during such scenarios. However, the problem of identifying crisis relevant information from social media platforms, such as Twitter, is not trivial given the nature of unstructured text such as short text length and syntactic variations among other challenges. A key objective, while creating automatic crisis relevancy classification systems, is to make them adaptable to a wide range of crisis types and languages. Many related approaches rely on statistical features which are quantifiable properties and linguistic properties of the text. A general approach is to train the classification model on labelled data acquired from crisis events and evaluate on other crisis events. A key aspect missing from explored literature is the validity of crisis relevancy classification models when applied to data from unseen types of crisis events and languages. For instance, how would the accuracy of a crisis relevancy classification model, trained on earthquake type of events, change when applied to flood type of events. Or, how would a model perform when trained on crisis data in English but applied to data in Italian. This thesis investigates these problems from a semantics perspective, where the challenges posed by diverse types of crisis and language variations are seen as the problems that can be tackled by enriching the data semantically. The use of knowledge bases such as DBpedia, BabelNet, and Wikipedia, for semantic enrichment of data in text classification problems has often been studied. Semantic enrichment of data through entity linking and expansion of context via knowledge bases can take advantage of connections between different concepts and thus enhance contextual coherency across crisis types and languages. Several previous works have focused on similar problems and proposed approaches using statistical features and/or non-semantic features. The use of semantics extracted through knowledge graphs has remained unexplored in building crisis relevancy classifiers that are adaptive to varying crisis types and multilingual data. Experiments conducted in this thesis consider data from Twitter, a micro-blogging social media platform, and analyse multiple aspects of crisis data classification. The results obtained through various analyses in this thesis demonstrate the value of semantic enrichment of text through knowledge graphs in improving the adaptability of crisis relevancy classifiers across crisis types and languages, in comparison to statistical features as often used in much of the related work

Open Research Online (The Open University)