7 research outputs found

    Weak Labeling for Cropland Mapping in Africa

    Full text link
    Cropland mapping can play a vital role in addressing environmental, agricultural, and food security challenges. However, in the context of Africa, practical applications are often hindered by the limited availability of high-resolution cropland maps. Such maps typically require extensive human labeling, thereby creating a scalability bottleneck. To address this, we propose an approach that utilizes unsupervised object clustering to refine existing weak labels, such as those obtained from global cropland maps. The refined labels, in conjunction with sparse human annotations, serve as training data for a semantic segmentation network designed to identify cropland areas. We conduct experiments to demonstrate the benefits of the improved weak labels generated by our method. In a scenario where we train our model with only 33 human-annotated labels, the F_1 score for the cropland category increases from 0.53 to 0.84 when we add the mined negative labels.Comment: 5 page

    AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

    Get PDF
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

    Analyse des marchés avec les données en ligne : combinaison de l'apprentissage machine et de l'économétrie

    No full text
    Les mĂ©thodes d'apprentissage automatique (ML) deviennent courantes dans de nombreuses sciences pour la modĂ©lisation de donnĂ©es massives. En effet, l'adoption croissante d'internet entraĂźne une croissance exponentielle des donnĂ©es gĂ©nĂ©rĂ©es en ligne. La comprĂ©hension des marchĂ©s Ă©conomiques sous-jacents nĂ©cessite de nouvelles techniques et de nouveaux outils. Les interfaces de programmation d'applications (APIs) et les techniques de web scraping sont aujourd'hui nĂ©cessaires pour la collecte de donnĂ©es en ligne. Dans le mĂȘme temps, le ML a fourni une gamme Ă©tendue d'outils pour extraire des informations de ces donnĂ©es massives et complexes. Cependant, l'adoption du ML en Economie est encore limitĂ©e, principalement en raison de son manque d'interprĂ©tabilitĂ©. Dans cette thĂšse, nous faisons quelques applications pratiques de l'utilisation de donnĂ©es massives en ligne pour rĂ©soudre des problĂšmes Ă©conomiques tout en utilisant et en suggĂ©rant des mĂ©thodes interprĂ©tables. Au chapitre 1, nous utilisons plus de 300.000 offres d'emploi de PĂŽle Emploi pour comprendre pourquoi les employeurs choisissent de nĂ©gocier les salaires avec les demandeurs d'emploi. Au chapitre 2, nous dĂ©veloppons une mĂ©thode pour extraire des indicateurs de sentiment d’articles de presse et les dĂ©composer en plusieurs dimensions appelĂ©es aspects. Nous appliquons notre mĂ©thode Ă  prĂšs de 600.000 articles de presse pour construire des indicateurs Ă©conomiques prĂ©coces comme outils supplĂ©mentaires Ă  la prise de dĂ©cision. Au chapitre 3, nous proposons une nouvelle classe de modĂšles Ă©conomĂ©triques qui rivalisent avec les modĂšles de ML sur des ensembles de donnĂ©es standard tout en Ă©tant beaucoup plus interprĂ©tables.Machine learning (ML) methods are becoming common in many sciences regarding massive data modeling. Indeed, the increasing internet adoption is leading to an exponential growth of data generated online. Understanding the underlying economic markets requires new techniques and tools. Application Programming Interfaces (APIs) and web scraping techniques are now required for online data collection. At the same time, ML has provided a wide range of tools for extracting insights from these massive and complex data. However, the adoption of ML methods in Economics is still limited, mainly due to their lack of interpretability. In this thesis, we make some practical applications of using massive online data to solve economic problems while using and suggesting interpretable methods. In Chapter 1, we use over 300,000 job offers from PĂŽle Emploi to understand why employers choose to negotiate wages with job seekers. In Chapter 2, we develop a method to extract sentiment indicators from news articles and decompose them into several dimensions called aspects. We apply our approach to nearly 600,000 news articles to construct early economic indicators as additional decision-making tools. In Chapter 3, we propose a new class of econometric models that rival ML models on standard data sets while being much more interpretable

    Interpretable Machine Learning Using Partial Linear Models*

    No full text
    International audienceDespite their high predictive performance, random forest and gradient boosting are often considered as black boxes which has raised concerns from practitioners and regulators. As an alternative, we suggest using partial linear models that are inherently interpretable. Specifically, we propose to combine parametric and non‐parametric functions to accurately capture linearities and non‐linearities prevailing between dependent and explanatory variables, and a variable selection procedure to control for overfitting issues. Estimation relies on a two‐step procedure building upon the double residual method. We illustrate the predictive performance and interpretability of our approach on a regression problem

    MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

    No full text
    African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Full text link
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology
    corecore