9 research outputs found

    GEMv2 : Multilingual NLG benchmarking in a single line of code

    Get PDF
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

    GEMv2 : Multilingual NLG benchmarking in a single line of code

    Get PDF
    Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

    Contribution à la génération de langage naturel : systÚmes et évaluation

    No full text
    In recent years, the Natural Language Generation (NLG) field has changed drastically. This shift, which can be partially attributed to the notable advance in hardware, led to recent efforts in NLG to be focused on data-driven methods leveraging large pretrained Neural Networks (NNs). However, this progress gave rise to new challenges related to computational requirements, accessibility, and evaluation strategies, to name a few. In this dissertation, we are primarily concerned with contributing to the efforts to mitigate these challenges.To address the lack of monolingual generative models for some languages, we start by introducing BARThez and AraBART, the first large-scale pretrained seq2seq models for French and Arabic, respectively. Being based on BART, these models are particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. On the other hand, We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines.Finally, we focus on the NLG system evaluation by proposing DATScore and FrugalScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation and other NLG tasks. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. On the other hand, FrugalScore is an approach to learn a fixed, low-cost version of any expensive NLG metric while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude fewer parameters and running several times faster. On average overall learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times fewer parameters than the original metrics.Ces derniĂšres annĂ©es, le domaine de la gĂ©nĂ©ration du langage naturel (GLN) a radicalement changĂ©. Ce changement, qui peut ĂȘtre en partie attribuĂ© Ă  l'avancĂ©e notable du matĂ©riel, a conduit les rĂ©cents efforts du GLN Ă  se concentrer sur des mĂ©thodes basĂ©es sur les donnĂ©es tirant parti de grands rĂ©seaux de neurones prĂ©-entraĂźnĂ©s. Cependant, ces progrĂšs ont donnĂ© lieu Ă  de nouveaux dĂ©fis liĂ©s aux exigences de calcul, Ă  l'accessibilitĂ© et aux stratĂ©gies d'Ă©valuation, pour n'en nommer que quelques-uns. Dans cette thĂšse, nous nous intĂ©ressons principalement Ă  contribuer aux efforts visant Ă  attĂ©nuer ces dĂ©fis.Pour remĂ©dier au manque de modĂšles gĂ©nĂ©ratifs monolingues pour certaines langues, nous commençons par prĂ©senter BARThez et AraBART, les premiers modĂšles seq2seq prĂ©-entraĂźnĂ©s Ă  grande Ă©chelle pour le Français et l'Arabe, respectivement. BasĂ©s sur BART, ces modĂšles sont particuliĂšrement bien adaptĂ©s aux tĂąches gĂ©nĂ©ratives. Nous Ă©valuons BARThez sur cinq tĂąches discriminantes du benchmark FLUE et deux tĂąches gĂ©nĂ©ratives d'un nouvel ensemble de donnĂ©es de rĂ©sumĂ©, OrangeSum, que nous avons crĂ©Ă© pour cette recherche. Nous montrons que BARThez est trĂšs compĂ©titif avec les modĂšles de langue française basĂ©s sur BERT tels que CamemBERT et FlauBERT. Nous poursuivons Ă©galement le prĂ©-entraĂźnement d'un BART multilingue sur le corpus de BARThez, et montrons que notre modĂšle rĂ©sultant, mBARThez, amĂ©liore considĂ©rablement les performances gĂ©nĂ©ratives de BARThez. D'autre part, nous montrons qu'AraBART obtient les meilleures performances sur plusieurs ensembles de donnĂ©es de rĂ©sumĂ© abstractif, surpassant des bases de rĂ©fĂ©rence solides.Enfin, nous nous concentrons sur l'Ă©valuation des systĂšmes GLN en proposant DATScore et FrugalScore. DATScore utilise des techniques d'augmentation des donnĂ©es pour amĂ©liorer l'Ă©valuation de la traduction automatique et d'autres tĂąches GLN. Notre principale conclusion est que l'introduction de traductions enrichies de donnĂ©es des textes source et de rĂ©fĂ©rence est trĂšs utile pour Ă©valuer la qualitĂ© de la traduction gĂ©nĂ©rĂ©e. Nous proposons Ă©galement deux nouvelles stratĂ©gies de calcul de la moyenne des scores et de pondĂ©ration des termes pour amĂ©liorer le processus original de calcul des scores de BARTScore. Les rĂ©sultats expĂ©rimentaux sur WMT montrent que DATScore est mieux corrĂ©lĂ© avec les mĂ©ta-Ă©valuations humaines que les autres mĂ©triques rĂ©centes de l'Ă©tat de l'art, en particulier pour les langues Ă  faibles ressources. D'autre part, FrugalScore est une approche pour apprendre une version fixe et peu coĂ»teuse de toute mĂ©trique GLN coĂ»teuse tout en conservant la plupart de ses performances d'origine. Des expĂ©riences avec BERTScore et MoverScore sur sur le rĂ©sumĂ© et la traduction montrent que FrugalScore est comparable avec les mĂ©triques d'origine (et parfois mieux), tout en ayant plusieurs ordres de grandeur de moins de paramĂštres et en s'exĂ©cutant plusieurs fois plus rapidement. En moyenne, sur l'ensemble des mĂ©triques, tĂąches et variantes apprises, FrugalScore conserve 96,8% des performances, s'exĂ©cute 24 fois plus rapidement et comporte 35 fois moins deparamĂštres que les mĂ©triques d'origine

    Contribution à la génération de langage naturel : systÚmes et évaluation

    No full text
    Ces derniĂšres annĂ©es, le domaine de la gĂ©nĂ©ration du langage naturel (GLN) a radicalement changĂ©. Ce changement, qui peut ĂȘtre en partie attribuĂ© Ă  l'avancĂ©e notable du matĂ©riel, a conduit les rĂ©cents efforts du GLN Ă  se concentrer sur des mĂ©thodes basĂ©es sur les donnĂ©es tirant parti de grands rĂ©seaux de neurones prĂ©-entraĂźnĂ©s. Cependant, ces progrĂšs ont donnĂ© lieu Ă  de nouveaux dĂ©fis liĂ©s aux exigences de calcul, Ă  l'accessibilitĂ© et aux stratĂ©gies d'Ă©valuation, pour n'en nommer que quelques-uns. Dans cette thĂšse, nous nous intĂ©ressons principalement Ă  contribuer aux efforts visant Ă  attĂ©nuer ces dĂ©fis.Pour remĂ©dier au manque de modĂšles gĂ©nĂ©ratifs monolingues pour certaines langues, nous commençons par prĂ©senter BARThez et AraBART, les premiers modĂšles seq2seq prĂ©-entraĂźnĂ©s Ă  grande Ă©chelle pour le Français et l'Arabe, respectivement. BasĂ©s sur BART, ces modĂšles sont particuliĂšrement bien adaptĂ©s aux tĂąches gĂ©nĂ©ratives. Nous Ă©valuons BARThez sur cinq tĂąches discriminantes du benchmark FLUE et deux tĂąches gĂ©nĂ©ratives d'un nouvel ensemble de donnĂ©es de rĂ©sumĂ©, OrangeSum, que nous avons crĂ©Ă© pour cette recherche. Nous montrons que BARThez est trĂšs compĂ©titif avec les modĂšles de langue française basĂ©s sur BERT tels que CamemBERT et FlauBERT. Nous poursuivons Ă©galement le prĂ©-entraĂźnement d'un BART multilingue sur le corpus de BARThez, et montrons que notre modĂšle rĂ©sultant, mBARThez, amĂ©liore considĂ©rablement les performances gĂ©nĂ©ratives de BARThez. D'autre part, nous montrons qu'AraBART obtient les meilleures performances sur plusieurs ensembles de donnĂ©es de rĂ©sumĂ© abstractif, surpassant des bases de rĂ©fĂ©rence solides.Enfin, nous nous concentrons sur l'Ă©valuation des systĂšmes GLN en proposant DATScore et FrugalScore. DATScore utilise des techniques d'augmentation des donnĂ©es pour amĂ©liorer l'Ă©valuation de la traduction automatique et d'autres tĂąches GLN. Notre principale conclusion est que l'introduction de traductions enrichies de donnĂ©es des textes source et de rĂ©fĂ©rence est trĂšs utile pour Ă©valuer la qualitĂ© de la traduction gĂ©nĂ©rĂ©e. Nous proposons Ă©galement deux nouvelles stratĂ©gies de calcul de la moyenne des scores et de pondĂ©ration des termes pour amĂ©liorer le processus original de calcul des scores de BARTScore. Les rĂ©sultats expĂ©rimentaux sur WMT montrent que DATScore est mieux corrĂ©lĂ© avec les mĂ©ta-Ă©valuations humaines que les autres mĂ©triques rĂ©centes de l'Ă©tat de l'art, en particulier pour les langues Ă  faibles ressources. D'autre part, FrugalScore est une approche pour apprendre une version fixe et peu coĂ»teuse de toute mĂ©trique GLN coĂ»teuse tout en conservant la plupart de ses performances d'origine. Des expĂ©riences avec BERTScore et MoverScore sur sur le rĂ©sumĂ© et la traduction montrent que FrugalScore est comparable avec les mĂ©triques d'origine (et parfois mieux), tout en ayant plusieurs ordres de grandeur de moins de paramĂštres et en s'exĂ©cutant plusieurs fois plus rapidement. En moyenne, sur l'ensemble des mĂ©triques, tĂąches et variantes apprises, FrugalScore conserve 96,8% des performances, s'exĂ©cute 24 fois plus rapidement et comporte 35 fois moins deparamĂštres que les mĂ©triques d'origine.In recent years, the Natural Language Generation (NLG) field has changed drastically. This shift, which can be partially attributed to the notable advance in hardware, led to recent efforts in NLG to be focused on data-driven methods leveraging large pretrained Neural Networks (NNs). However, this progress gave rise to new challenges related to computational requirements, accessibility, and evaluation strategies, to name a few. In this dissertation, we are primarily concerned with contributing to the efforts to mitigate these challenges.To address the lack of monolingual generative models for some languages, we start by introducing BARThez and AraBART, the first large-scale pretrained seq2seq models for French and Arabic, respectively. Being based on BART, these models are particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. On the other hand, We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines.Finally, we focus on the NLG system evaluation by proposing DATScore and FrugalScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation and other NLG tasks. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. On the other hand, FrugalScore is an approach to learn a fixed, low-cost version of any expensive NLG metric while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude fewer parameters and running several times faster. On average overall learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times fewer parameters than the original metrics

    Contribution à la génération de langage naturel : systÚmes et évaluation

    No full text
    In recent years, the Natural Language Generation (NLG) field has changed drastically. This shift, which can be partially attributed to the notable advance in hardware, led to recent efforts in NLG to be focused on data-driven methods leveraging large pretrained Neural Networks (NNs). However, this progress gave rise to new challenges related to computational requirements, accessibility, and evaluation strategies, to name a few. In this dissertation, we are primarily concerned with contributing to the efforts to mitigate these challenges.To address the lack of monolingual generative models for some languages, we start by introducing BARThez and AraBART, the first large-scale pretrained seq2seq models for French and Arabic, respectively. Being based on BART, these models are particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez' corpus, and show our resulting model, mBARThez, to significantly boost BARThez' generative performance. On the other hand, We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines.Finally, we focus on the NLG system evaluation by proposing DATScore and FrugalScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation and other NLG tasks. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. On the other hand, FrugalScore is an approach to learn a fixed, low-cost version of any expensive NLG metric while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude fewer parameters and running several times faster. On average overall learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times fewer parameters than the original metrics.Ces derniĂšres annĂ©es, le domaine de la gĂ©nĂ©ration du langage naturel (GLN) a radicalement changĂ©. Ce changement, qui peut ĂȘtre en partie attribuĂ© Ă  l'avancĂ©e notable du matĂ©riel, a conduit les rĂ©cents efforts du GLN Ă  se concentrer sur des mĂ©thodes basĂ©es sur les donnĂ©es tirant parti de grands rĂ©seaux de neurones prĂ©-entraĂźnĂ©s. Cependant, ces progrĂšs ont donnĂ© lieu Ă  de nouveaux dĂ©fis liĂ©s aux exigences de calcul, Ă  l'accessibilitĂ© et aux stratĂ©gies d'Ă©valuation, pour n'en nommer que quelques-uns. Dans cette thĂšse, nous nous intĂ©ressons principalement Ă  contribuer aux efforts visant Ă  attĂ©nuer ces dĂ©fis.Pour remĂ©dier au manque de modĂšles gĂ©nĂ©ratifs monolingues pour certaines langues, nous commençons par prĂ©senter BARThez et AraBART, les premiers modĂšles seq2seq prĂ©-entraĂźnĂ©s Ă  grande Ă©chelle pour le Français et l'Arabe, respectivement. BasĂ©s sur BART, ces modĂšles sont particuliĂšrement bien adaptĂ©s aux tĂąches gĂ©nĂ©ratives. Nous Ă©valuons BARThez sur cinq tĂąches discriminantes du benchmark FLUE et deux tĂąches gĂ©nĂ©ratives d'un nouvel ensemble de donnĂ©es de rĂ©sumĂ©, OrangeSum, que nous avons crĂ©Ă© pour cette recherche. Nous montrons que BARThez est trĂšs compĂ©titif avec les modĂšles de langue française basĂ©s sur BERT tels que CamemBERT et FlauBERT. Nous poursuivons Ă©galement le prĂ©-entraĂźnement d'un BART multilingue sur le corpus de BARThez, et montrons que notre modĂšle rĂ©sultant, mBARThez, amĂ©liore considĂ©rablement les performances gĂ©nĂ©ratives de BARThez. D'autre part, nous montrons qu'AraBART obtient les meilleures performances sur plusieurs ensembles de donnĂ©es de rĂ©sumĂ© abstractif, surpassant des bases de rĂ©fĂ©rence solides.Enfin, nous nous concentrons sur l'Ă©valuation des systĂšmes GLN en proposant DATScore et FrugalScore. DATScore utilise des techniques d'augmentation des donnĂ©es pour amĂ©liorer l'Ă©valuation de la traduction automatique et d'autres tĂąches GLN. Notre principale conclusion est que l'introduction de traductions enrichies de donnĂ©es des textes source et de rĂ©fĂ©rence est trĂšs utile pour Ă©valuer la qualitĂ© de la traduction gĂ©nĂ©rĂ©e. Nous proposons Ă©galement deux nouvelles stratĂ©gies de calcul de la moyenne des scores et de pondĂ©ration des termes pour amĂ©liorer le processus original de calcul des scores de BARTScore. Les rĂ©sultats expĂ©rimentaux sur WMT montrent que DATScore est mieux corrĂ©lĂ© avec les mĂ©ta-Ă©valuations humaines que les autres mĂ©triques rĂ©centes de l'Ă©tat de l'art, en particulier pour les langues Ă  faibles ressources. D'autre part, FrugalScore est une approche pour apprendre une version fixe et peu coĂ»teuse de toute mĂ©trique GLN coĂ»teuse tout en conservant la plupart de ses performances d'origine. Des expĂ©riences avec BERTScore et MoverScore sur sur le rĂ©sumĂ© et la traduction montrent que FrugalScore est comparable avec les mĂ©triques d'origine (et parfois mieux), tout en ayant plusieurs ordres de grandeur de moins de paramĂštres et en s'exĂ©cutant plusieurs fois plus rapidement. En moyenne, sur l'ensemble des mĂ©triques, tĂąches et variantes apprises, FrugalScore conserve 96,8% des performances, s'exĂ©cute 24 fois plus rapidement et comporte 35 fois moins deparamĂštres que les mĂ©triques d'origine

    DATScore: Evaluating Translation with Data Augmented Translations

    No full text
    International audienc

    Word sense induction with agglomerative clustering and mutual information maximization

    No full text
    Word sense induction (WSI) is a challenging problem in natural language processing that involves the unsupervised automatic detection of a word’s senses (i.e., meanings). Recent work achieves significant results on the WSI task by pre-training a language model that can exclusively disambiguate word senses. In contrast, others employ off-the-shelf pre-trained language models with additional strategies to induce senses. This paper proposes a novel unsupervised method based on hierarchical clustering and invariant information clustering (IIC). The IIC loss is used to train a small model to optimize the mutual information between two vector representations of a target word occurring in a pair of synthetic paraphrases. This model is later used in inference mode to extract a higher-quality vector representation to be used in the hierarchical clustering. We evaluate our method on two WSI tasks and in two distinct clustering configurations (fixed and dynamic number of clusters). We empirically show that our approach is at least on par with the state-of-the-art baselines, outperforming them in several configurations. The code and data to reproduce this work are available to the public11 https://github.com/hadi-abdine/wsi-mim.

    Evaluation of a quality improvement intervention to reduce anastomotic leak following right colectomy (EAGLE): pragmatic, batched stepped-wedge, cluster-randomized trial in 64 countries

    Get PDF
    Background Anastomotic leak affects 8 per cent of patients after right colectomy with a 10-fold increased risk of postoperative death. The EAGLE study aimed to develop and test whether an international, standardized quality improvement intervention could reduce anastomotic leaks. Methods The internationally intended protocol, iteratively co-developed by a multistage Delphi process, comprised an online educational module introducing risk stratification, an intraoperative checklist, and harmonized surgical techniques. Clusters (hospital teams) were randomized to one of three arms with varied sequences of intervention/data collection by a derived stepped-wedge batch design (at least 18 hospital teams per batch). Patients were blinded to the study allocation. Low- and middle-income country enrolment was encouraged. The primary outcome (assessed by intention to treat) was anastomotic leak rate, and subgroup analyses by module completion (at least 80 per cent of surgeons, high engagement; less than 50 per cent, low engagement) were preplanned. Results A total 355 hospital teams registered, with 332 from 64 countries (39.2 per cent low and middle income) included in the final analysis. The online modules were completed by half of the surgeons (2143 of 4411). The primary analysis included 3039 of the 3268 patients recruited (206 patients had no anastomosis and 23 were lost to follow-up), with anastomotic leaks arising before and after the intervention in 10.1 and 9.6 per cent respectively (adjusted OR 0.87, 95 per cent c.i. 0.59 to 1.30; P = 0.498). The proportion of surgeons completing the educational modules was an influence: the leak rate decreased from 12.2 per cent (61 of 500) before intervention to 5.1 per cent (24 of 473) after intervention in high-engagement centres (adjusted OR 0.36, 0.20 to 0.64; P < 0.001), but this was not observed in low-engagement hospitals (8.3 per cent (59 of 714) and 13.8 per cent (61 of 443) respectively; adjusted OR 2.09, 1.31 to 3.31). Conclusion Completion of globally available digital training by engaged teams can alter anastomotic leak rates. Registration number: NCT04270721 (http://www.clinicaltrials.gov)
    corecore