5 research outputs found

    AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

    Get PDF
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

    Analyse des marchés avec les données en ligne : combinaison de l'apprentissage machine et de l'économétrie

    No full text
    Les méthodes d'apprentissage automatique (ML) deviennent courantes dans de nombreuses sciences pour la modélisation de données massives. En effet, l'adoption croissante d'internet entraîne une croissance exponentielle des données générées en ligne. La compréhension des marchés économiques sous-jacents nécessite de nouvelles techniques et de nouveaux outils. Les interfaces de programmation d'applications (APIs) et les techniques de web scraping sont aujourd'hui nécessaires pour la collecte de données en ligne. Dans le même temps, le ML a fourni une gamme étendue d'outils pour extraire des informations de ces données massives et complexes. Cependant, l'adoption du ML en Economie est encore limitée, principalement en raison de son manque d'interprétabilité. Dans cette thèse, nous faisons quelques applications pratiques de l'utilisation de données massives en ligne pour résoudre des problèmes économiques tout en utilisant et en suggérant des méthodes interprétables. Au chapitre 1, nous utilisons plus de 300.000 offres d'emploi de Pôle Emploi pour comprendre pourquoi les employeurs choisissent de négocier les salaires avec les demandeurs d'emploi. Au chapitre 2, nous développons une méthode pour extraire des indicateurs de sentiment d’articles de presse et les décomposer en plusieurs dimensions appelées aspects. Nous appliquons notre méthode à près de 600.000 articles de presse pour construire des indicateurs économiques précoces comme outils supplémentaires à la prise de décision. Au chapitre 3, nous proposons une nouvelle classe de modèles économétriques qui rivalisent avec les modèles de ML sur des ensembles de données standard tout en étant beaucoup plus interprétables.Machine learning (ML) methods are becoming common in many sciences regarding massive data modeling. Indeed, the increasing internet adoption is leading to an exponential growth of data generated online. Understanding the underlying economic markets requires new techniques and tools. Application Programming Interfaces (APIs) and web scraping techniques are now required for online data collection. At the same time, ML has provided a wide range of tools for extracting insights from these massive and complex data. However, the adoption of ML methods in Economics is still limited, mainly due to their lack of interpretability. In this thesis, we make some practical applications of using massive online data to solve economic problems while using and suggesting interpretable methods. In Chapter 1, we use over 300,000 job offers from Pôle Emploi to understand why employers choose to negotiate wages with job seekers. In Chapter 2, we develop a method to extract sentiment indicators from news articles and decompose them into several dimensions called aspects. We apply our approach to nearly 600,000 news articles to construct early economic indicators as additional decision-making tools. In Chapter 3, we propose a new class of econometric models that rival ML models on standard data sets while being much more interpretable

    MasakhaNER 2.0:Africa-centric Transfer Learning for Named Entity Recognition

    No full text
    African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

    AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

    Full text link
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology
    corecore