15 research outputs found

    GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring

    No full text
    Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1 a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.</jats:p

    GeoNLPlify : Une augmentation spatiale de corpus liés aux crises pour des tâches de classification

    No full text
    International audienceL'article "Deux cygnes retrouvés mort au Parc de la Tête d'Or à Lyon" parle-t-il de l'épidémie de grippe aviaire ? Nos travaux proposent d'utiliser l'information spatiale pour générer des données artificielles étiquetées afin d'améliorer les classifications de textes basées sur BERT. Ainsi, après avoir mis en évidence, par des méthodes d'explicabilité, l'importance de l'information spatiale dans les corpus liés à des crises, nous proposons différentes stratégies d'augmentation de données qui tirent profit de ce constat. Notre méthode, GeoNLPlify, est évaluée sur des jeux de données publics (PADI-web et Cri-sisNLP) et comparée aux augmentations de données classiques

    SNEToolkit: Spatial named entities disambiguation toolkit

    No full text
    International audience‘‘Can you tell me where San Jose is located?’’ ‘‘Uh! Do you know that there are more than 1700 locations named San Jose in the world?’’ The official name of a location is often not the name with which we are familiar. Spatial named entity (SNE) disambiguation is the process of identifying and assigning precise coordinates to a place name that can be identified in a text. This task is not always straightforward, especially when the place name in question is ambiguous for various reasons. In this context, we are interested in the disambiguation of spatial named entities that can be identified in a textual document on a country level. The solution that we propose is based on a set of techniques that allow us to disambiguate the spatial entity considering the context in which it is mentioned from a certain number of characteristics that are specific to it. The solution uses as input a textual document and extricates the named entities identified therein while associating them with the correct coordinates. SNE disambiguation is designed to support the process of fast exploration of spatiotemporal data analysis, most often for event tracking. The proposed approach was tested on 1360 SNEs extracted from the GeoVirus dataset. The results show that SNEToolkit outperformed the baseline, the standard Geonames geocoder, with a recall value of 0.911 against a recall value of 0.871 for the baseline. A flexible Python package is provided for end users

    Animal disease surveillance: How to represent textual data for classifying epidemiological information

    No full text
    International audienceThe value of informal sources in increasing the timeliness of disease outbreak detection and providing detailed epidemiological information in the early warning and preparedness context is recognized. This study evaluates machine learning methods for classifying information from animal disease-related news at a fine-grained level (i. e., epidemiological topic). We compare two textual representations, the bag-of-words method and a distributional approach, i.e., word embeddings. Both representations performed well for binary relevance classification (F-measure of 0.839 and 0.871, respectively). Bag-of-words representation was outperformed by word embedding representation for classifying sentences into fine-grained epidemiological topics (F-measure of 0.745). Our results suggest that the word embedding approach is of interest in the context of low-frequency classes in a specialized domain. However, this representation did not bring significant performance improvements for binary relevance classification, indicating that the textual representation should be adapted to each classification task

    H-TFIDF: What makes areas specific over time in the massive flow of tweets related to the covid pandemic?

    Get PDF
    International audienceData produced by social networks may contain weak signals of possible epidemic outbreaks. In this paper, we focus on Twitter data during the waiting period before the appearance of COVID-19 first cases outside China. Among the huge flow of tweets that reflects a global growing concern in all countries, we propose to analyze such data with an adaptation of the TF-IDF measure. It allows the users to extract the discriminant vocabularies used across time and space. The results are then discussed to show how the specific spatio-temporal anchoring of the extracted terms make it possible to follow the crisis dynamics on different scales of time and space

    One-Year analysis of rain and rain erosivity in a tropical volcanic island from UHF wind profiler measurements

    No full text
    International audienceCommunication about One-Year analysis of rain and rain erosivity in a tropical volcanic island from UHF wind profiler measurement

    Modélisation de la dynamique des territoires : méta-données et lacs de données dédiés à l'information spatiale

    No full text
    International audienceData lake management requires an efficient metadata management system. Some works have already addressed this aspect in order to describe the datasets recorded and ensure their proper use. However, little work has been done on data lake dedicated to spatial information. However, geographical dimension is fundamental when we wish to explore the different trajectories of development projects within a territory. In this article, we are particularly interested in the implementation of a data lake for Montpellier metropolis. The proposed conceptual solution is based on the ISO 19115 standard to describe extended spatial metadata within the context of data lakes. The implementation based on HDFS and GeoNetwork is presented and discussed.La gestion efficace d'un lac de données nécessite un système de gestion de méta-données performant. De nombreux travaux se sont penchés sur cet aspect en proposant des solutions. Néanmoins, peu de travaux se sont intéressés aux lacs de données dédiés aux informations spatiales. Pourtant, cette dimension géographique est fondamentale dès lors que l'on souhaite explorer les différentes trajectoires de projets d'aménagement au sein d'un même territoire. Dans cet article, nous nous intéressons tout particulièrement à la mise en oeuvre d'un lac de données pour la métropole de Montpellier. La solution conceptuelle proposée s'adosse à la norme ISO 19115 pour décrire des méta-données spatiales qui est étendue dans le cadre des lacs de données. L'implémentation basée sur HDFS et GeoNetwork est présentée et discutée. Le code source est également mis à disposition de la communauté

    Spatial Data Lake for Smart Cities: From Design to Implementation

    No full text
    International audienceIn this paper, we propose a methodology for designing data lake dedicated to Spatial Data and an implementation of this specific framework. Inspired from previous proposals on general data lake Design and based on the Geographic information-Metadata normalization (ISO 19115), the contribution presented in this paper integrates, with the same philosophy, the spatial and thematic dimensions of heterogeneous data (remote sensing images, textual documents and sensor data, etc). To support our proposal, the process has been implemented in a real data project in collaboration with Montpellier Métropole Méditerranée (3M), a metropolis in the South of France. This framework offers a uniform management of the spatial and thematic information embedded in the elements of the data lake
    corecore