23 research outputs found

    Extracting locations from sport and exercise-related social media messages using a neural network-based bilingual toponym recognition model

    Get PDF
    Sport and exercise contribute to health and well-being in cities. While previous research has mainly focused on activities at specific locations such as sport facilities, "informal sport" that occur at arbitrary locations across the city have been largely neglected. Such activities are more challenging to observe, but this challenge may be addressed using data collected from social media platforms, because social media users regularly generate content related to sports and exercise at given locations. This allows studying all sport, including those "informal sport" which are at arbitrary locations, to better understand sports and exercise-related activities in cities. However, user-generated geographical information available on social media platforms is becoming scarcer and coarser. This places increased emphasis on extracting location information from free-form text content on social media, which is complicated by multilingualism and informal language. To support this effort, this article presents an end-to-end deep learning-based bilingual toponym recognition model for extracting location information from social media content related to sports and exercise. We show that our approach outperforms five state-of-the-art deep learning and machine learning models. We further demonstrate how our model can be deployed in a geoparsing framework to support city planners in promoting healthy and active lifestyles.Peer reviewe

    MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages and Modalities

    Full text link
    In this paper, we introduce the MLM (Multiple Languages and Modalities) dataset - a new resource to train and evaluate multitask systems on samples in multiple modalities and three languages. The generation process and inclusion of semantic data provide a resource that further tests the ability for multitask systems to learn relationships between entities. The dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on the web and in digital archives. A second version of MLM provides a geo-representative subset of the data with weighted samples for countries of the European Union. We demonstrate the value of the resource in developing novel applications in the digital humanities with a motivating use case and specify a benchmark set of tasks to retrieve modalities and locate entities in the dataset. Evaluation of baseline multitask and single task systems on the full and geo-representative versions of MLM demonstrate the challenges of generalising on diverse data. In addition to the digital humanities, we expect the resource to contribute to research in multimodal representation learning, location estimation, and scene understanding

    Location Reference Recognition from Texts: A Survey and Comparison

    Full text link
    A vast amount of location information exists in unstructured texts, such as social media posts, news stories, scientific articles, web pages, travel blogs, and historical archives. Geoparsing refers to recognizing location references from texts and identifying their geospatial representations. While geoparsing can benefit many domains, a summary of its specific applications is still missing. Further, there is a lack of a comprehensive review and comparison of existing approaches for location reference recognition, which is the first and core step of geoparsing. To fill these research gaps, this review first summarizes seven typical application domains of geoparsing: geographic information retrieval, disaster management, disease surveillance, traffic management, spatial humanities, tourism management, and crime management. We then review existing approaches for location reference recognition by categorizing these approaches into four groups based on their underlying functional principle: rule-based, gazetteer matching–based, statistical learning-–based, and hybrid approaches. Next, we thoroughly evaluate the correctness and computational efficiency of the 27 most widely used approaches for location reference recognition based on 26 public datasets with different types of texts (e.g., social media posts and news stories) containing 39,736 location references worldwide. Results from this thorough evaluation can help inform future methodological developments and can help guide the selection of proper approaches based on application needs

    Twitter as an Indicator of Sports Activities in the Helsinki Metropolitan Area

    Get PDF
    Fyysinen aktiivisuus vaikuttaa vahvasti yksilön terveyteen ja hyvinvointiin. Alueellisen eriytymisen ehkäisyn ja ympäristöllisen tasa-arvon kannalta on tärkeää, että eri alueiden asukkailla on yhtäläiset mahdollisuudet harrastaa liikuntaa. Avoimesti saatavilla olevia kattavia tutkimuksia ihmisten fyysisestä aktiivisuudesta eri puolilla pääkaupunkiseutua ei juurikaan ole tehty, paikallisia liikuntabarometrejä lukuun ottamatta. Virallisten ja kattavien tietolähteiden puutteessa käyttäjien itse tuottamaa dataa, kuten sosiaalisen median dataa, voidaan mahdollisesti käyttää fyysisen aktiivisuuden arviointiin. Tässä tutkielmassa pyrin vastaamaan kysymyksiin: 1) kuinka Twitter-dataa voidaan käyttää indikaattorina liikunnallisen aktiivisuuden arviointiin, 2) miten liikunta-aiheistet twiitit ovat jakautuneet pääkaupunkiseudulla ja 3) mitkä sosio-ekonomiset tekijät selittävät twiittien lukumäärää alueella. Liikunta-aiheisten twiittien keräämiseen hyödynsin hakua urheiluun ja liikuntaan liittyvien avainsanalistojen avulla. Haetut avainsanat sisälsivät suomen-, englannin- ja vironkielisiä termejä. Tutkimuksen alueellisen luonteen takia tarvitsin geotägättyjä twiittejä, joihin on liitetty tieto paikan koordinaateista. Vain alle 1 % twiiteistä sisältää geotägin, joten hyödynsin geoparsing-tekniikkaa tuottaakseni lisää paikkaan sidottua aineistoa. Geoparsing tarkoittaa paikan nimien tunnistamista tekstistä ja niiden muuttamista koordinaateiksi. Yhdistin geotägätyt ja geoparsing-tekniikalla sijoitetut twiitit ja ryhmitin datan postinumeroalueittain. Postinumeroalueittain ryhmitetystä datasta tein spatiaalisia ja tilastollisia analyysejä mitatakseni spatiaalista autokorrelaatiota sekä korrelaatiota eri sosio-ekonomisten muuttujien kanssa. Tulokseni osoittavat, että urheilu- ja liikunta-aiheiset twiitit keskittyvät pääasiassa Helsingin keskustaan, mihin myös väestö on keskittynyt. Helsingin keskustan lisäksi on nähtävissä paikallisempia klustereita Tapiolassa, Leppävaarassa, Tikkurilassa ja Pasilassa. Twiittien urheilulajittainen tarkastelu paljastaa mailapeli- ja hiihtotwiittien keskittyneen voimakkaasti vastaavien urheilupaikkojen ympärille. Tilastoanalyysit osoittavat, että postinumeroalueen tuloilla ja koulutustasolla ei ole yhteyttä alueella havaittuun urheilutwiittien määrään. Parhaiten urheilutwiittien määrää ennustaa liikuntapaikkojen määrä, työllisyystaso ja lasten (0–14-vuotiaat) osuus väestöstä. Avaimia onnistuneeseen vastaavaan Twitter-tutkimukseen ovat geoparsing, riittävä datan määrä ja tarpeeksi hyvä kielimalli. Tämän tutkimuksen lupaavista tuloksista huolimatta Twitteriä fyysisen aktiivisuuden indikaattorina tulee tutkia lisää kartoittamalla tarkemmin sosiaalisen median sisäsyntyisiä vinoumia ennen kuin Twitter-tutkimusten tuloksia voidaan soveltaa oikean elämän ratkaisuihin.Being physically active is one of the key aspects of health. Thus, equal opportunities for exercising in different places is one important factor of environmental justice and segregation prevention. Currently, there are no openly available scientific studies about actual physical activities in different parts of the Helsinki Metropolitan Area other than sports barometers. In the lack of comprehensive official data sources, user-generated data, like social media, may be used as a proxy for measuring the levels and geographical distribution of sports activities. In this thesis, I aim to assess 1) how Twitter tweets could be used as an indicator of sports activities, 2) how the sports tweets are distributed spatially and 3) which socio-economic factors can predict the number of sports tweets. For recognizing the tweets related to sports, out of 38.5 million tweets, I used Named Entity Matching with a list of sports-related keywords in Finnish, English and Estonian. Due to the spatial nature of my study, I needed tweets that contain a geotag, meaning that the tweet is attached to coordinates that indicate a location. However, only about 1% of tweets contain a geotag, and since 2019 Twitter doesn’t support precise geotagging anymore with some exceptions. Therefore, I implemented geoparsing methods to search for location names in the text and transform them to coordinates if the mentioned place was within the study area. After that, I aggregated the posts to postal code areas and used statistical and spatial methods to measure spatial autocorrelation and correlation with different socio-economic variables to examine the spatial patterns and socio-economic factors that affect the tweeting about sports. My results show that the sports tweets are concentrated mainly in the center of Helsinki, where the population is also concentrated. The distribution of the sports tweets exhibits local clusters like Tapiola, Leppävaara, Tikkurila and Pasila besides the largest cluster in the center of Helsinki. Sports-wise mapping of the tweets reveals that for example racket sport and skiing tweets are heavily concentrated around the corresponding facilities. Statistical analyses indicate that the number of tweets per inhabitant does not correlate with the education level or the amount of average income in the postal code area. The factors that predict the number of tweets per inhabitant are number of sports facilities per inhabitant, employment, and percentage of children (0-14 years old) in the postal code area. Keys to a successful study when analyzing Twitter data are geoparsing, having enough data, and a good language model to process it. Despite the promising results of this study, Twitter as indicator of physical activity should be studied more to better understand the kind of bias it inherently has before basing real-life decisions on Twitter research

    Real-Time Event Analysis and Spatial Information Extraction From Text Using Social Media Data

    Get PDF
    Since the advent of websites that enable users to participate and interact with each other by sharing content in different forms, a plethora of possibly relevant information is at scientists\u27 fingertips. Consequently, this thesis elaborates on two distinct approaches to extract valuable information from social media data and sketches out the potential joint use case in the domain of natural disasters

    LOCATION MENTION PREDICTION FROM DISASTER TWEETS

    Get PDF
    While utilizing Twitter data for crisis management is of interest to different response authorities, a critical challenge that hinders the utilization of such data is the scarcity of automated tools that extract and resolve geolocation information. This dissertation focuses on the Location Mention Prediction (LMP) problem that consists of Location Mention Recognition (LMR) and Location Mention Disambiguation (LMD) tasks. Our work contributes to studying two main factors that influence the robustness of LMP systems: (i) the dataset used to train the model, and (ii) the learning model. As for the training dataset, we study the best training and evaluation strategies to exploit existing datasets and tools at the onset of disaster events. We emphasize that the size of training data matters and recommend considering the data domain, the disaster domain, and geographical proximity when training LMR models. We further construct the public IDRISI datasets, the largest to date English and first Arabic datasets for the LMP tasks. Rigorous analysis and experiments show that the IDRISI datasets are diverse, and domain and geographically generalizable, compared to existing datasets. As for the learning models, the LMP tasks are understudied in the disaster management domain. To address this, we reformulate the LMR and LMD modeling and evaluation to better suit the requirements of the response authorities. Moreover, we introduce competitive and state-of-the-art LMR and LMD models that are compared against a representative set of baselines for both Arabic and English languages

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore