13 research outputs found

    Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Full text link
    With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

    Get PDF
    African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity recognition (NER). We create the largest human-annotated NER dataset for 20 African languages, and we study the behavior of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, demonstrating that the choice of source language significantly affects performance. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points across 20 languages compared to using English. Our results highlight the need for benchmark datasets and models that cover typologically-diverse African languages

    AfriQA:Cross-lingual Open-Retrieval Question Answering for African Languages

    Get PDF
    African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology

    Implementing electronic health records (EHRs): health care provider perceptions before and after transition from a local basic EHR to a commercial comprehensive EHR

    No full text
    We assessed changes in the percentage of providers with positive perceptions of electronic health record (EHR) benefit before and after transition from a local basic to a commercial comprehensive EHR.Changes in the percentage of providers with positive perceptions of EHR benefit were captured via a survey of academic health care providers before (baseline) and at 6-12 months (short term) and 12-24 months (long term) after the transition. We analyzed 32 items for the overall group and by practice setting, provider age, and specialty using separate multivariable-adjusted random effects logistic regression models.A total of 223 providers completed all 3 surveys (30% response rate): 85.6% had outpatient practices, 56.5% were >45 years old, and 23.8% were primary care providers. The percentage of providers with positive perceptions significantly increased from baseline to long-term follow-up for patient communication, hospital transitions - access to clinical information, preventive care delivery, preventive care prompt, preventive lab prompt, satisfaction with system reliability, and sharing medical information (P
    corecore