296 research outputs found
Multilingual Text Classification from Twitter during Emergencies
Social media such as Twitter are a valuable source of information due to their diffusion among citizens and to their speed in sharing data worldwide. However, it is challenging to automatically extract information from such data, given the huge amount of useless content. We propose a multilingual tool that automatically categorizes tweets according to their information content. To achieve real-time classification while supporting any language, we apply a deep learning classifier, using multilingual word embeddings. This allows our solution to be trained on one language and to apply it to any other language via zero-shot inference achieving acceptable performance loss
Recommended from our members
Identifying and Processing Crisis Information from Social Media
Social media platforms play a crucial role in how people communicate, particularly during crisis situations such as natural disasters. People share and disseminate information on social media platforms that relates to updates, alerts, rescue and relief requests among other crisis relevant information. Hurricane Harvey and Hurricane Sandy saw over tens of millions of posts getting generated, on Twitter, in a short span of time. The ambit of such posts spreads across a wide range such as personal and official communications, and citizen sensing, to mention a few. This makes social media platforms a source of vital information to different stakeholders in crisis situations such as impacted communities, relief agencies, and civic authorities. However, the overwhelming volume of data generated during such times, makes it impossible to manually identify information relevant to crisis. Additionally, a large portion of posts in voluminous streams is not relevant or bears minimal relevance to crisis situations.
This has steered much research towards exploring methods that can automatically identify crisis relevant information from voluminous streams of data during such scenarios. However, the problem of identifying crisis relevant information from social media platforms, such as Twitter, is not trivial given the nature of unstructured text such as short text length and syntactic variations among other challenges. A key objective, while creating automatic crisis relevancy classification systems, is to make them adaptable to a wide range of crisis types and languages. Many related approaches rely on statistical features which are quantifiable properties and linguistic properties of the text. A general approach is to train the classification model on labelled data acquired from crisis events and evaluate on other crisis events. A key aspect missing from explored literature is the validity of crisis relevancy classification models when applied to data from unseen types of crisis events and languages. For instance, how would the accuracy of a crisis relevancy classification model, trained on earthquake type of events, change when applied to flood type of events. Or, how would a model perform when trained on crisis data in English but applied to data in Italian.
This thesis investigates these problems from a semantics perspective, where the challenges posed by diverse types of crisis and language variations are seen as the problems that can be tackled by enriching the data semantically. The use of knowledge bases such as DBpedia, BabelNet, and Wikipedia, for semantic enrichment of data in text classification problems has often been studied. Semantic enrichment of data through entity linking and expansion of context via knowledge bases can take advantage of connections between different concepts and thus enhance contextual coherency across crisis types and languages. Several previous works have focused on similar problems and proposed approaches using statistical features and/or non-semantic features. The use of semantics extracted through knowledge graphs has remained unexplored in building crisis relevancy classifiers that are adaptive to varying crisis types and multilingual data. Experiments conducted in this thesis consider data from Twitter, a micro-blogging social media platform, and analyse multiple aspects of crisis data classification. The results obtained through various analyses in this thesis demonstrate the value of semantic enrichment of text through knowledge graphs in improving the adaptability of crisis relevancy classifiers across crisis types and languages, in comparison to statistical features as often used in much of the related work
Helping crisis responders find the informative needle in the tweet haystack
Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this problem. Messages are filtered for informativeness based on a definition of the concept drawn from prior research and crisis response experts. Informative messages are tagged for actionable data -- for example, people in need, threats to rescue efforts, changes in environment, and so on. In all, eight categories of actionability are identified. The two components -- informativeness and actionability classification -- are packaged together as an openly-available tool called Emina (Emergent Informativeness and Actionability)
Coping with low data availability for social media crisis message categorisation
During crisis situations, social media allows people to quickly share
information, including messages requesting help. This can be valuable to
emergency responders, who need to categorise and prioritise these messages
based on the type of assistance being requested. However, the high volume of
messages makes it difficult to filter and prioritise them without the use of
computational techniques. Fully supervised filtering techniques for crisis
message categorisation typically require a large amount of annotated training
data, but this can be difficult to obtain during an ongoing crisis and is
expensive in terms of time and labour to create.
This thesis focuses on addressing the challenge of low data availability when
categorising crisis messages for emergency response. It first presents domain
adaptation as a solution for this problem, which involves learning a
categorisation model from annotated data from past crisis events (source
domain) and adapting it to categorise messages from an ongoing crisis event
(target domain). In many-to-many adaptation, where the model is trained on
multiple past events and adapted to multiple ongoing events, a multi-task
learning approach is proposed using pre-trained language models. This approach
outperforms baselines and an ensemble approach further improves performance..
Information Refinement Technologies for Crisis Informatics: User Expectations and Design Implications for Social Media and Mobile Apps in Crises
In the past 20 years, mobile technologies and social media have not only been established in everyday life, but also in crises, disasters, and emergencies. Especially large-scale events, such as 2012 Hurricane Sandy or the 2013 European Floods, showed that citizens are not passive victims but active participants utilizing mobile and social information and communication technologies (ICT) for crisis response (Reuter, Hughes, et al., 2018). Accordingly, the research field of crisis informatics emerged as a multidisciplinary field which combines computing and social science knowledge of disasters and is rooted in disciplines such as human-computer interaction (HCI), computer science (CS), computer supported cooperative work (CSCW), and information systems (IS). While citizens use personal ICT to respond to a disaster to cope with uncertainty, emergency services such as fire and police departments started using available online data to increase situational awareness and improve decision making for a better crisis response (Palen & Anderson, 2016). When looking at even larger crises, such as the ongoing COVID-19 pandemic, it becomes apparent the challenges of crisis informatics are amplified (Xie et al., 2020). Notably, information is often not available in perfect shape to assist crisis response: the dissemination of high-volume, heterogeneous and highly semantic data by citizens, often referred to as big social data (Olshannikova et al., 2017), poses challenges for emergency services in terms of access, quality and quantity of information. In order to achieve situational awareness or even actionable information, meaning the right information for the right person at the right time (Zade et al., 2018), information must be refined according to event-based factors, organizational requirements, societal boundary conditions and technical feasibility. In order to research the topic of information refinement, this dissertation combines the methodological framework of design case studies (Wulf et al., 2011) with principles of design science research (Hevner et al., 2004). These extended design case studies consist of four phases, each contributing to research with distinct results. This thesis first reviews existing research on use, role, and perception patterns in crisis informatics, emphasizing the increasing potentials of public participation in crisis response using social media. Then, empirical studies conducted with the German population reveal positive attitudes and increasing use of mobile and social technologies during crises, but also highlight barriers of use and expectations towards emergency services to monitor and interact in media. The findings led to the design of innovative ICT artefacts, including visual guidelines for citizens’ use of social media in emergencies (SMG), an emergency service web interface for aggregating mobile and social data (ESI), an efficient algorithm for detecting relevant information in social media (SMO), and a mobile app for bidirectional communication between emergency services and citizens (112.social). The evaluation of artefacts involved the participation of end-users in the application field of crisis management, pointing out potentials for future improvements and research potentials. The thesis concludes with a framework on information refinement for crisis informatics, integrating event-based, organizational, societal, and technological perspectives
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested
Classification algorithms for Big Data with applications in the urban security domain
A classification algorithm is a versatile tool, that can serve as a predictor for the
future or as an analytical tool to understand the past. Several obstacles prevent
classification from scaling to a large Volume, Velocity, Variety or Value. The aim
of this thesis is to scale distributed classification algorithms beyond current limits,
assess the state-of-practice of Big Data machine learning frameworks and validate
the effectiveness of a data science process in improving urban safety.
We found in massive datasets with a number of large-domain categorical features
a difficult challenge for existing classification algorithms. We propose associative
classification as a possible answer, and develop several novel techniques to distribute
the training of an associative classifier among parallel workers and improve the final
quality of the model. The experiments, run on a real large-scale dataset with more
than 4 billion records, confirmed the quality of the approach.
To assess the state-of-practice of Big Data machine learning frameworks and
streamline the process of integration and fine-tuning of the building blocks, we
developed a generic, self-tuning tool to extract knowledge from network traffic
measurements. The result is a system that offers human-readable models of the data
with minimal user intervention, validated by experiments on large collections of
real-world passive network measurements.
A good portion of this dissertation is dedicated to the study of a data science
process to improve urban safety. First, we shed some light on the feasibility of a
system to monitor social messages from a city for emergency relief. We then propose
a methodology to mine temporal patterns in social issues, like crimes. Finally,
we propose a system to integrate the findings of Data Science on the citizenry’s
perception of safety and communicate its results to decision makers in a timely
manner. We applied and tested the system in a real Smart City scenario, set in Turin,
Italy
Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers
In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets
Data-Driven Techniques For Vulnerability Assessments
Security vulnerabilities have been puzzling researchers and practitioners for decades.As highlighted by the recent WannaCry and NotPetya ransomware campaigns, which resulted in billions of dollars of losses, weaponized exploits against vulnerabilities remain one of the main tools for cybercrime.
The upward trend in the number of vulnerabilities reported annually and technical challenges in the way of remediation lead to large exposure windows for the vulnerable populations.
On the other hand, due to sustained efforts in application and operating system security, few vulnerabilities are exploited in real-world attacks.
Existing metrics for severity assessments err on the side of caution and overestimate the risk posed by vulnerabilities, further affecting remediation efforts that rely on prioritization.
In this dissertation we show that severity assessments can be improved by taking into account public information about vulnerabilities and exploits.The disclosure of vulnerabilities is followed by artifacts such as social media discussions, write-ups and proof-of-concepts, containing technical information related to the vulnerabilities and their exploitation.
These artifacts can be mined to detect active exploits or predict their development.
However, we first need to understand: What features are required for different tasks? What biases are present in public data and how are data-driven systems affected? What security threats do these systems face when deployed operationally?
We explore the questions by first collecting vulnerability-related posts on social media and analyzing the community and the content of their discussions.This analysis reveals that victims of attacks often share their experience online, and we leverage this finding to build an early detector of exploits active in the wild.
Our detector significantly improves on the precision of existing severity metrics and can detect active exploits a median of 5 days earlier than a commercial intrusion prevention product.
Next, we investigate the utility of various artifacts in predicting the development of functional exploits. We engineer features causally linked to the ease of exploitation, highlight trade-offs between timeliness and predictive utility of various artifacts, and characterize the biases that affect the ground truth for exploit prediction tasks.
Using these insights, we propose a machine learning-based system that continuously collects artifacts and predicts the likelihood of exploits being developed against these vulnerabilities.
We demonstrate our system's practical utility through its ability to highlight critical vulnerabilities and predict imminent exploits.
Lastly, we explore the adversarial threats faced by data-driven security systems that rely on inputs of unknown provenance.We propose a framework for defining algorithmic threat models and for exploring adversaries with various degrees of knowledge and capabilities.
Using this framework, we model realistic adversaries that could target our systems, design data poisoning attacks to measure their robustness, and highlight promising directions for future defenses against such attacks
- …