28 research outputs found
Unsupervised Data Augmentation for Less-Resourced Languages with no Standardized Spelling
International audienceNon-standardized languages are a challenge to the construction of representative linguistic resources and to the development of efficient natural language processing tools: when spelling is not determined by a consensual norm, a multiplicity of alternative written forms can be encountered for a given word, inducing a large proportion of out-of-vocabulary words. To embrace this diversity, we propose a methodology based on crowdsourcing alternative spellings from which variation rules are automatically extracted. The rules are further used to match out-of-vocabulary words with one of their spelling variants. This virtuous process enables the unsupervised augmentation of multi-variant lexicons without requiring manual rule definition by experts. We apply this multilingual methodology on Al-satian, a French regional language and provide (i) an intrinsic evaluation of the correctness of the obtained variants pairs, (ii) an extrinsic evaluation on a downstream task: part-of-speech tagging. We show that in a low-resource scenario, collecting spelling variants for only 145 words can lead to (i) the generation of 876 additional variant pairs, (ii) a diminution of out-of-vocabulary words improving the tagging performance by 1 to 4%
Sciences participatives et diversité linguistique Retours d'expériences
National audienceCertaines langues pâtissent d’un manque de ressources au sens large, qu’elles soient humaines,linguistiques ou financières, en particulier pour produire les outils de traitement automatiquenécessaires à leur intégration numérique. Pour ces langues, dites « peu dotées », la productionparticipative apparaît comme un moyen prometteur de mettre à profit la présence croissante delocuteurs sur Internet
Katana and Grand Guru: a Game of the Lost Words (DEMO)
International audienceWe present here a prototype of a role playing game which allows to both i) crowdsource lexical units (including idioms) for a language and ii) help the player improve their knowledge of the language. Our implementation of the game is focused on non-standardized languages, for which the intergenerational transmission is not as efficient as it used to be. In order to address this, we incentivize the participation of a "Grand Guru", from whom the player needs help to fulfill their mission
Water Condensation Zones around Main Sequence Stars
Understanding the set of conditions that allow rocky planets to have liquid
water on their surface -- in the form of lakes, seas or oceans -- is a major
scientific step to determine the fraction of planets potentially suitable for
the emergence and development of life as we know it on Earth. This effort is
also necessary to define and refine the so-called "Habitable Zone" (HZ) in
order to guide the search for exoplanets likely to harbor remotely detectable
life forms. Until now, most numerical climate studies on this topic have
focused on the conditions necessary to maintain oceans, but not to form them in
the first place. Here we use the three-dimensional Generic Planetary Climate
Model (PCM), historically known as the LMD Generic Global Climate Model (GCM),
to simulate water-dominated planetary atmospheres around different types of
Main-Sequence stars. The simulations are designed to reproduce the conditions
of early ocean formation on rocky planets due to the condensation of the
primordial water reservoir at the end of the magma ocean phase. We show that
the incoming stellar radiation (ISR) required to form oceans by condensation is
always drastically lower than that required to vaporize oceans. We introduce a
Water Condensation Limit, which lies at significantly lower ISR than the inner
edge of the HZ calculated with three-dimensional numerical climate simulations.
This difference is due to a behavior change of water clouds, from low-altitude
dayside convective clouds to high-altitude nightside stratospheric clouds.
Finally, we calculated transit spectra, emission spectra and thermal phase
curves of TRAPPIST-1b, c and d with H2O-rich atmospheres, and compared them to
CO2 atmospheres and bare rock simulations. We show using these observables that
JWST has the capability to probe steam atmospheres on low-mass planets, and
could possibly test the existence of nightside water clouds.Comment: Accepted for publication in Astronomy & Astrophysic
Creating expert knowledge by relying on language learners : a generic approach for mass-producing language resources by combining implicit crowdsourcing and language learning
We introduce in this paper a generic approach to combine implicit crowdsourcing and language learning in order to mass-produce language resources (LRs) for any language for which a crowd of language learners can be involved. We present the approach by explaining its core paradigm that consists in pairing specific types of LRs with specific exercises, by detailing both its strengths and challenges, and by discussing how much these challenges have been addressed at present. Accordingly, we also report on on-going proof-of-concept efforts aiming at developing the first prototypical implementation of the approach in order to correct and extend an LR called ConceptNet based on the input crowdsourced from language learners. We then present an international network called the European Network for Combining Language Learning with Crowdsourcing Techniques (enetCollect) that provides the context to accelerate the implementation of the generic approach. Finally, we exemplify how it can be used in several language learning scenarios to produce a multitude of NLP resources and how it can therefore alleviate the long-standing NLP issue of the lack of LRs.peer-reviewe
Crowdsourcing linguistic resources for natural non-standardised languages processing
Les sciences participatives, et en particulier la myriadisation (crowdsourcing) bénévole, représentent un moyen peu exploité de créer des ressources langagières pour certaines langues encore peu dotées, et ce malgré la présence de locuteurs sur le Web. Nous présentons dans ce travail les expériences que nous avons menées pour permettre la myriadisation de ressources langagières dans le cadre du développement d'un outil d'annotation automatique en parties du discours. Nous avons appliqué cette méthodologie à trois langues non standardisées, en l'occurrence l'alsacien, le créole guadeloupéen et le créole mauricien. Pour des raisons historiques différentes, de multiples pratiques (ortho)graphiques co-existent en effet pour ces trois langues. Les difficultés posées par l'existence de cette variation nous ont menée à proposer diverses tâches de myriadisation permettant la collecte de corpus bruts, d’annotations en parties du discours, et de variantes graphiques.L'analyse intrinsèque et extrinsèque de ces ressources, utilisées pour le développement d'outils d'annotation automatique, montrent l'intérêt d'utiliser la myriadisation dans un cadre linguistique non standardisé : les locuteurs ne sont pas ici considérés comme un ensemble uniforme de contributeurs dont les efforts cumulés permettent d'achever une tâche particulière, mais comme un ensemble de détenteurs de connaissances complémentaires. Les ressources qu'ils produisent collectivement permettent de développer des outils plus robustes à la variation rencontrée.Les plateformes développées, les ressources langagières, ainsi que les modèles de taggers entraînés sont librement disponibles.Citizen science, in particular voluntary crowdsourcing, represents a little experimented solution to produce language resources for some languages which are still little resourced despite the presence of sufficient speakers online. We present in this work the experiments we have led to enable the crowdsourcing of linguistic resources for the development of automatic part-of-speech annotation tools. We have applied the methodology to three non-standardised languages, namely Alsatian, Guadeloupean Creole and Mauritian Creole. For different historical reasons, multiple (ortho)-graphic practices coexist for these three languages. The difficulties encountered by the presence of this variation phenomenon led us to propose various crowdsourcing tasks that allow the collection of raw corpora, part-of-speech annotations, and graphic variants. The intrinsic and extrinsic analysis of these resources, used for the development of automatic annotation tools, show the interest of using crowdsourcing in a non-standardized linguistic framework: the participants are not seen in this context a uniform set of contributors whose cumulative efforts allow the completion of a particular task, but rather as a set of holders of complementary knowledge. The resources they collectively produce make possible the development of tools that embrace the variation.The platforms developed, the language resources, as well as the models of trained taggers are freely available
Myriadisation de ressources linguistiques pour le traitement automatique de langues non standardisées
Citizen science, in particular voluntary crowdsourcing, represents a little experimented solution to produce language resources for some languages which are still little resourced despite the presence of sufficient speakers online. We present in this work the experiments we have led to enable the crowdsourcing of linguistic resources for the development of automatic part-of-speech annotation tools. We have applied the methodology to three non-standardised languages, namely Alsatian, Guadeloupean Creole and Mauritian Creole. For different historical reasons, multiple (ortho)-graphic practices coexist for these three languages. The difficulties encountered by the presence of this variation phenomenon led us to propose various crowdsourcing tasks that allow the collection of raw corpora, part-of-speech annotations, and graphic variants.The intrinsic and extrinsic analysis of these resources, used for the development of automatic annotation tools, show the interest of using crowdsourcing in a non-standardized linguistic framework: the participants are not seen in this context a uniform set of contributors whose cumulative efforts allow the completion of a particular task, but rather as a set of holders of complementary knowledge. The resources they collectively produce make possible the development of tools that embrace the variation.The platforms developed, the language resources, as well as the models of trained taggers are freely available.Les sciences participatives, et en particulier la myriadisation (crowdsourcing) bénévole, représentent un moyen peu exploité de créer des ressources langagières pour certaines langues encore peu dotées, et ce malgré la présence de locuteurs sur le Web. Nous présentons dans ce travail les expériences que nous avons menées pour permettre la myriadisation de ressources langagières dans le cadre du développement d'un outil d'annotation automatique en parties du discours. Nous avons appliqué cette méthodologie à trois langues non standardisées, en l'occurrence l'alsacien, le créole guadeloupéen et le créole mauricien. Pour des raisons historiques différentes, de multiples pratiques (ortho)graphiques co-existent en effet pour ces trois langues. Les difficultés posées par l'existence de cette variation nous ont menée à proposer diverses tâches de myriadisation permettant la collecte de corpus bruts, d’annotations en parties du discours, et de variantes graphiques. L'analyse intrinsèque et extrinsèque de ces ressources, utilisées pour le développement d'outils d'annotation automatique, montrent l'intérêt d'utiliser la myriadisation dans un cadre linguistique non standardisé : les locuteurs ne sont pas ici considérés comme un ensemble uniforme de contributeurs dont les efforts cumulés permettent d'achever une tâche particulière, mais comme un ensemble de détenteurs de connaissances complémentaires. Les ressources qu'ils produisent collectivement permettent de développer des outils plus robustes à la variation rencontrée. Les plateformes développées, les ressources langagières, ainsi que les modèles de taggers entraînés sont librement disponibles
Getting to Know the Speakers: a Survey of a Non-Standardized Language Digital Use
International audienceThis paper presents the results of an on-line survey regarding the use on the Internet of a less-resourced non-standardized language: Al-satian. The survey, entitled “Alsatian, the Internet, and You” received 1,224 answers in a two months period starting January 2019. Thepurpose of this survey is twofold. First, we collect generic information on the use of their language by Alsatian speaking Internet users.Second, based on our own experience of crowdsourcing linguistic resources for Alsatian, we use this survey to gather insights on theneeds, abilities and expectations of the speakers in order to make the most of their participation
De la bienveillance aux apprentissages
The term « benevolence » regularly appears in official texts from the Ministry of National Education. The goal is to promote students' well-being to improve the quality of their learning. We have examined the physiological mechanisms which allow it as well as the role of the teacher. Research in Education, scientific research and emotional and social neuroscience explain how a child's well-being maximises the development of their brain, and thus their cognitive ability. The analysis of questionnaires addressed to students and teachers has allowed us to retain and experiment with daily teaching practices contributing to student well-being consistent with official programs. We were able to establish that adopting a benevolent conduct, expressing emotions, particularly through diverse disciplinary areas, and cardiac coherence improve student well-being calling upon the skills of the programs in use.Le terme « bienveillance » apparaît régulièrement dans les textes officiels émanant du Ministère de l’Education Nationale. L’enjeu est de favoriser le bien-être des élèves afin d’améliorer la qualité des apprentissages. Nous nous sommes interrogées sur les mécanismes physiologiques qui le permettent ainsi que sur le rôle de l’enseignant. La recherche en éducation, la recherche scientifique et les neurosciences affectives et sociales expliquent comment le bien-être d’un enfant optimise le développement de son cerveau, donc de ses capacités cognitives. L’analyse de questionnaires à destination d’élèves et d’enseignants nous ont permis de retenir et d’expérimenter des pratiques de classe au quotidien contribuant au bien-être des élèves en adéquation avec les programmes officiels. Nous avons pu établir que l’adoption d’une conduite bienveillante, l’expression des émotions, notamment à travers divers domaines disciplinaires, et la cohérence cardiaque améliorent le bien-être des élèves en sollicitant les compétences des programmes en vigueur