23 research outputs found
Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model
Sport and exercise contribute to health and well-being in cities. While previous research has mainly focused on activities at specific locations such as sport facilities, "informal sport" that occur at arbitrary locations across the city have been largely neglected. Such activities are more challenging to observe, but this challenge may be addressed using data collected from social media platforms, because social media users regularly generate content related to sports and exercise at given locations. This allows studying all sport, including those "informal sport" which are at arbitrary locations, to better understand sports and exercise-related activities in cities. However, user-generated geographical information available on social media platforms is becoming scarcer and coarser. This places increased emphasis on extracting location information from free-form text content on social media, which is complicated by multilingualism and informal language. To support this effort, this article presents an end-to-end deep learning-based bilingual toponym recognition model for extracting location information from social media content related to sports and exercise. We show that our approach outperforms five state-of-the-art deep learning and machine learning models. We further demonstrate how our model can be deployed in a geoparsing framework to support city planners in promoting healthy and active lifestyles.Peer reviewe
MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities
In this paper, we introduce the MLM (Multiple Languages and Modalities)
dataset - a new resource to train and evaluate multitask systems on samples in
multiple modalities and three languages. The generation process and inclusion
of semantic data provide a resource that further tests the ability for
multitask systems to learn relationships between entities. The dataset is
designed for researchers and developers who build applications that perform
multiple tasks on data encountered on the web and in digital archives. A second
version of MLM provides a geo-representative subset of the data with weighted
samples for countries of the European Union. We demonstrate the value of the
resource in developing novel applications in the digital humanities with a
motivating use case and specify a benchmark set of tasks to retrieve modalities
and locate entities in the dataset. Evaluation of baseline multitask and single
task systems on the full and geo-representative versions of MLM demonstrate the
challenges of generalising on diverse data. In addition to the digital
humanities, we expect the resource to contribute to research in multimodal
representation learning, location estimation, and scene understanding
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
Location Reference Recognition from Texts: A Survey and Comparison
A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs
Twitter as an Indicator of Sports Activities in the Helsinki Metropolitan Area
Fyysinen aktiivisuus vaikuttaa vahvasti yksilön terveyteen ja hyvinvointiin. Alueellisen eriytymisen ehkäisyn ja ympäristöllisen tasa-arvon kannalta on tärkeää, että eri alueiden asukkailla on yhtäläiset mahdollisuudet harrastaa liikuntaa. Avoimesti saatavilla olevia kattavia tutkimuksia ihmisten fyysisestä aktiivisuudesta eri puolilla pääkaupunkiseutua ei juurikaan ole tehty, paikallisia liikuntabarometrejä lukuun ottamatta. Virallisten ja kattavien tietolähteiden puutteessa käyttäjien itse tuottamaa dataa, kuten sosiaalisen median dataa, voidaan mahdollisesti käyttää fyysisen aktiivisuuden arviointiin. Tässä tutkielmassa pyrin vastaamaan kysymyksiin: 1) kuinka Twitter-dataa voidaan käyttää indikaattorina liikunnallisen aktiivisuuden arviointiin, 2) miten liikunta-aiheistet twiitit ovat jakautuneet pääkaupunkiseudulla ja 3) mitkä sosio-ekonomiset tekijät selittävät twiittien lukumäärää alueella.
Liikunta-aiheisten twiittien keräämiseen hyödynsin hakua urheiluun ja liikuntaan liittyvien avainsanalistojen avulla. Haetut avainsanat sisälsivät suomen-, englannin- ja vironkielisiä termejä. Tutkimuksen alueellisen luonteen takia tarvitsin geotägättyjä twiittejä, joihin on liitetty tieto paikan koordinaateista. Vain alle 1 % twiiteistä sisältää geotägin, joten hyödynsin geoparsing-tekniikkaa tuottaakseni lisää paikkaan sidottua aineistoa. Geoparsing tarkoittaa paikan nimien tunnistamista tekstistä ja niiden muuttamista koordinaateiksi. Yhdistin geotägätyt ja geoparsing-tekniikalla sijoitetut twiitit ja ryhmitin datan postinumeroalueittain. Postinumeroalueittain ryhmitetystä datasta tein spatiaalisia ja tilastollisia analyysejä mitatakseni spatiaalista autokorrelaatiota sekä korrelaatiota eri sosio-ekonomisten muuttujien kanssa.
Tulokseni osoittavat, että urheilu- ja liikunta-aiheiset twiitit keskittyvät pääasiassa Helsingin keskustaan, mihin myös väestö on keskittynyt. Helsingin keskustan lisäksi on nähtävissä paikallisempia klustereita Tapiolassa, Leppävaarassa, Tikkurilassa ja Pasilassa. Twiittien urheilulajittainen tarkastelu paljastaa mailapeli- ja hiihtotwiittien keskittyneen voimakkaasti vastaavien urheilupaikkojen ympärille. Tilastoanalyysit osoittavat, että postinumeroalueen tuloilla ja koulutustasolla ei ole yhteyttä alueella havaittuun urheilutwiittien määrään. Parhaiten urheilutwiittien määrää ennustaa liikuntapaikkojen määrä, työllisyystaso ja lasten (0–14-vuotiaat) osuus väestöstä. Avaimia onnistuneeseen vastaavaan Twitter-tutkimukseen ovat geoparsing, riittävä datan määrä ja tarpeeksi hyvä kielimalli. Tämän tutkimuksen lupaavista tuloksista huolimatta Twitteriä fyysisen aktiivisuuden indikaattorina tulee tutkia lisää kartoittamalla tarkemmin sosiaalisen median sisäsyntyisiä vinoumia ennen kuin Twitter-tutkimusten tuloksia voidaan soveltaa oikean elämän ratkaisuihin.Being physically active is one of the key aspects of health. Thus, equal opportunities for exercising in different places is one important factor of environmental justice and segregation prevention. Currently, there are no openly available scientific studies about actual physical activities in different parts of the Helsinki Metropolitan Area other than sports barometers. In the lack of comprehensive official data sources, user-generated data, like social media, may be used as a proxy for measuring the levels and geographical distribution of sports activities. In this thesis, I aim to assess 1) how Twitter tweets could be used as an indicator of sports activities, 2) how the sports tweets are distributed spatially and 3) which socio-economic factors can predict the number of sports tweets.
For recognizing the tweets related to sports, out of 38.5 million tweets, I used Named Entity Matching with a list of sports-related keywords in Finnish, English and Estonian. Due to the spatial nature of my study, I needed tweets that contain a geotag, meaning that the tweet is attached to coordinates that indicate a location. However, only about 1% of tweets contain a geotag, and since 2019 Twitter doesn’t support precise geotagging anymore with some exceptions. Therefore, I implemented geoparsing methods to search for location names in the text and transform them to coordinates if the mentioned place was within the study area. After that, I aggregated the posts to postal code areas and used statistical and spatial methods to measure spatial autocorrelation and correlation with different socio-economic variables to examine the spatial patterns and socio-economic factors that affect the tweeting about sports.
My results show that the sports tweets are concentrated mainly in the center of Helsinki, where the population is also concentrated. The distribution of the sports tweets exhibits local clusters like Tapiola, Leppävaara, Tikkurila and Pasila besides the largest cluster in the center of Helsinki. Sports-wise mapping of the tweets reveals that for example racket sport and skiing tweets are heavily concentrated around the corresponding facilities. Statistical analyses indicate that the number of tweets per inhabitant does not correlate with the education level or the amount of average income in the postal code area. The factors that predict the number of tweets per inhabitant are number of sports facilities per inhabitant, employment, and percentage of children (0-14 years old) in the postal code area. Keys to a successful study when analyzing Twitter data are geoparsing, having enough data, and a good language model to process it. Despite the promising results of this study, Twitter as indicator of physical activity should be studied more to better understand the kind of bias it inherently has before basing real-life decisions on Twitter research
Real-Time Event Analysis and Spatial Information Extraction From Text Using Social Media Data
Since the advent of websites that enable users to participate and interact with each other by sharing content in different forms, a plethora of possibly relevant information is at scientists\u27 fingertips. Consequently, this thesis elaborates on two distinct approaches to extract valuable information from social media data and sketches out the potential joint use case in the domain of natural disasters
LOCATION MENTION PREDICTION FROM DISASTER TWEETS
While utilizing Twitter data for crisis management is of interest to different response authorities, a critical challenge that hinders the utilization of such data is the scarcity of automated tools that extract and resolve geolocation information. This dissertation focuses on the Location Mention Prediction (LMP) problem that consists of Location Mention Recognition (LMR) and Location Mention Disambiguation (LMD) tasks. Our work contributes to studying two main factors that influence the robustness of LMP systems: (i) the dataset used to train the model, and (ii) the learning model. As for the training dataset, we study the best training and evaluation strategies to exploit existing datasets and tools at the onset of disaster events. We emphasize that the size of training data matters and recommend considering the data domain, the disaster domain, and geographical proximity when training LMR models. We further construct the public IDRISI datasets, the largest to date English and first Arabic datasets for the LMP tasks. Rigorous analysis and experiments show that the IDRISI datasets are diverse, and domain and geographically generalizable, compared to existing datasets. As for the learning models, the LMP tasks are understudied in the disaster management domain. To address this, we reformulate the LMR and LMD modeling and evaluation to better suit the requirements of the response authorities. Moreover, we introduce competitive and state-of-the-art LMR and LMD models that are compared against a representative set of baselines for both Arabic and English languages
Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino
On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges