24 research outputs found

    Mark my words ! On the automated prediction of lexical difficulty for foreign language readers

    No full text
    The goal of this doctoral research is to automatically predict difficult words in a text for non-native speakers. This prediction is crucial because good text comprehension is strongly determined by vocabulary. If a text contains too high a percentage of unknown words, the reader is likely to struggle to understand it. In order to provide good support to the non-native reader, we must first be able to predict the number of difficult words. Usually, we do this manually based on expertise or prior vocabulary tests. However, such methods are not practical when we are reading in a computer-based environment such as a tablet or an online learning platform. In these cases, we need to properly automate the predictions. The thesis is divided into three parts. The first part contains a systematic review of the relevant scientific literature. The synthesis includes 50 years of research and 140 peer-reviewed publications on the statistical prediction of lexical competence in non-native readers. Among other things, the analyses show that the scientific scope is divided into two fields of research that have little connection with each other. On the one hand, there is a long tradition of experimental research in foreign language acquisition (SLA) and computer-assisted language learning (CALL). These experimental studies mainly test the effect of certain factors (e.g., repeating difficult words or adding electronic glosses) on learning unrecognized words during reading. On the other hand, recent studies in natural language processing (NLP) rely on artificial intelligence to automatically predict difficult words. Moreover, the literature review points out some limitations that were further studied in this doctoral research. The first limitation is the lack of contextualized measures and predictions. Although we know from research that the context in which a word occurs is an important factor, predictions are often made based on isolated vocabulary tests, among other things. The second limitation is the lack of personalized measures and predictions. Although research in foreign language acquisition has shown that there are many differences among non-native readers, recent studies in artificial intelligence make predictions based on aggregate data. The final limitation is that the majority of studies (74%) focus on English as a foreign language. Consequently, the goal of this doctoral research is a contextualized and personalized approach and a focus on Dutch and French as foreign languages. The second part looks at two measures of lexical difficulty for non-native readers. On the one hand, it investigates how words are introduced in didactic reading materials labeled with CEFR levels. This study introduces a new graded lexical database for Dutch, namely NT2Lex (Tack et al., 2018). The innovative feature of this database is that the frequency per difficulty level was calculated for the meaning of each word, disambiguated based on the sentence context. However, the results show that there are important inconsistencies in how etymologically related translations occur in the Dutch and French databases. Therefore, this difficulty measure does not yet seem valid as a basis for an automated system. On the other hand, it is investigated how non-native speakers themselves perceive difficult words during reading. The perception of difficulty is important to predict because the learner's attention is a determining factor in the learning process (Schmidt, 2001). The study introduces new data for readers of French. An important goal of these data is to make correct predictions for all words in the text, which contrasts with studies in foreign language acquisition that focus on a limited number (Mdn = 22) of target words in the text. Moreover, the analyses show that the data can be used to develop a personalized and contextualized system. The final section looks at two types of predictive models developed on the aforementioned data, namely mixed-effects models and artificial neural networks. The results validate the idea that perceptions of lexical difficulty can be predicted primarily on the basis of "word surprisal", a central concept in information theory. Furthermore, the analyses show that commonly used performance statistics (such as accuracy and F-score) are sensitive to individual differences in rates of difficulty. Because these are therefore not appropriate for comparing predictions for different learners, the D and Phi coefficients are used. Moreover, the results clearly show that a personalized model makes significantly better predictions than a non-personalized model. On the other hand, the results show that a contextualized model can better discriminate difficulty, although these improvements are not always significant for each learner.L'objectif de cette recherche doctorale est la prédiction automatique des mots difficiles dans un texte pour les locuteurs non natifs. Cette prédiction est cruciale car une bonne compréhension d'un texte est fortement déterminée par le vocabulaire. Si un texte contient un pourcentage élevé de mots inconnus, le lecteur aura probablement des difficultés à comprendre le texte. Afin de fournir un bon soutien au lecteur de langue étrangère, nous devons d'abord être en mesure de prédire le nombre de mots difficiles. En général, nous le faisons manuellement en nous basant sur notre expertise ou sur des tests de vocabulaire antérieurs. Cependant, ces méthodes ne sont pas pratiques lorsque nous lisons dans un environnement informatique tel qu'une tablette ou une plateforme d'apprentissage en ligne. Dans ces cas, nous devons automatiser correctement les prédictions. La thèse est divisée en trois parties. La première partie contient un examen systématique de la littérature scientifique pertinente. La synthèse comprend 50 ans de recherche et 140 publications évaluées par des pairs sur la prédiction statistique de la compétence lexicale chez les lecteurs non natifs. Les analyses montrent, entre autres, que le champ scientifique est divisé en deux domaines de recherche peu connectés. D'une part, il existe une longue tradition de recherche expérimentale en matière d'acquisition de langues étrangères (SLA) et d'apprentissage des langues assisté par ordinateur (CALL). Ces études expérimentales testent principalement l'effet de certains facteurs (par exemple, la répétition de mots difficiles ou l'ajout de glosses électroniques) sur l'apprentissage de mots non familiers pendant la lecture. D'autre part, des études récentes sur le traitement du langage naturel (NLP) s'appuient sur l'intelligence artificielle pour prédire automatiquement les mots difficiles. En outre, l'étude de la littérature met en évidence certaines limites qui ont été approfondies dans le cadre de cette recherche doctorale. La première limite est le manque de mesures et de prédictions contextualisées. Bien que la recherche nous ait appris que le contexte dans lequel un mot apparaît est un facteur important, les prédictions sont souvent faites sur la base de tests de vocabulaire isolés, entre autres. La deuxième limite est le manque de mesures et de prédictions personnalisées. Bien que la recherche sur l'acquisition des langues étrangères ait montré qu'il existe de nombreuses différences entre les lecteurs non natifs, des études récentes en intelligence artificielle font des prédictions basées sur des données agrégées. La dernière limite est que la majorité des études (74%) se concentrent sur l'anglais en tant que langue étrangère. L'objectif de cette recherche doctorale est donc une approche contextualisée et personnalisée et une focalisation sur le néerlandais et le français comme langues étrangères. La deuxième partie examine deux mesures de la difficulté lexicale pour les lecteurs non natifs. D'une part, elle étudie la manière dont les mots sont introduits dans les matériels de lecture didactique étiquetés avec les niveaux du CECR. Cette étude introduit une nouvelle base de données lexicale graduée pour le néerlandais, à savoir NT2Lex (Tack et al., 2018). La caractéristique innovante de cette base de données est que la fréquence par niveau de difficulté a été calculée pour le sens de chaque mot, désambiguïsé sur la base du contexte de la phrase. Cependant, les résultats montrent qu'il existe d'importantes incohérences dans la manière dont les traductions étymologiquement liées apparaissent dans les bases de données néerlandaise et française. Par conséquent, cette mesure de difficulté ne semble pas encore valable comme base pour un système automatisé. D'autre part, on étudie comment les locuteurs non natifs perçoivent les mots difficiles pendant la lecture. La perception de la difficulté est importante à prévoir car l'attention de l'apprenant est un facteur déterminant dans le processus d'apprentissage (Schmidt, 2001). L'étude introduit de nouvelles données pour les lecteurs du français. Un objectif important de ces données est de faire des prédictions correctes pour tous les mots du texte, ce qui contraste avec les études sur l'acquisition des langues étrangères qui se concentrent sur un nombre limité (Mdn = 22) de mots cibles dans le texte. De plus, les analyses montrent que les données peuvent être utilisées pour développer un système personnalisé et contextualisé. La dernière section examine deux types de modèles prédictifs développés sur les données susmentionnées, à savoir les modèles à effets mixtes et les réseaux neuronaux artificiels. Les résultats valident l'idée que la perception de la difficulté lexicale peut être prédite principalement sur la base de la "surprise des mots", un concept central de la théorie de l'information. En outre, les analyses montrent que les statistiques de performance couramment utilisées (telles que la précision et le F-score) sont sensibles aux différences individuelles dans les taux de difficulté. Comme ceux-ci ne sont donc pas appropriés pour comparer les prédictions pour différents apprenants, les coefficients D et Phi sont utilisés. De plus, les résultats montrent clairement qu'un modèle personnalisé fait des prédictions nettement meilleures qu'un modèle non personnalisé. D'autre part, les résultats montrent qu'un modèle contextualisé peut mieux discriminer la difficulté, bien que ces améliorations ne soient pas toujours significatives pour chaque apprenant.(LALE - Langues et lettres) -- UCL, 202

    CEFR-based Short Answer Grading

    No full text
    The project through which the corpus was collected is concerned with the task of automatically assessing the written proficiency level of non-native (L2) learners of English. Drawing on previous research on automated L2 writing assessment following the Common European Framework of Reference for Languages (CEFR), we investigate the possibilities and difficulties of deriving the CEFR level from short answers to open-ended questions, which has not yet been subjected to numerous studies up to date. The object of our study is twofold: to examine the intricacy involved with both human and automated CEFR-based grading of short answers. First, we compiled a learner corpus of short answers graded with CEFR levels by three certified Cambridge examiners. Next, we used the corpus to develop a soft-voting system for the automated CEFR-based grading of short answers

    The BEA 2023 Shared Task on Generating AI Teacher Responses in Educational Dialogues

    No full text
    This paper describes the results of the first shared task on the generation of teacher responses in educational dialogues. The goal of the task was to benchmark the ability of generative language models to act as AI teachers, replying to a student in a teacher-student dialogue. Eight teams participated in the competition hosted on CodaLab. They experimented with a wide variety of state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2, Flan-T5, GPT-2, GPT-3, GPT- 4, LLaMA, OPT-2.7B, and T5-base. Their submissions were automatically scored using BERTScore and DialogRPT metrics, and the top three among them were further manually evaluated in terms of pedagogical ability based on Tack and Piech (2022). The NAISTeacher system, which ranked first in both automated and human evaluation, generated responses with GPT-3.5 using an ensemble of prompts and a DialogRPT-based ranking of responses for given dialogue contexts. Despite the promising achievements of the participating teams, the results also highlight the need for evaluation metrics better suited to educational contexts

    The Role of Cognate Vocabulary in CEFR-based Word-level Readability Assessment

    No full text
    Cognate vocabulary is known to have a facilitating effect on foreign language (L2) lexical development (de Groot and Keijzer, 2000; Elgort, 2013). Because of their cross-lingual semiotic transparency, cognates are known to be easier to comprehend and learn. As a result, cognate status has been considered an important feature when modeling L2 vocabulary learning (Willis and Ohashi, 2012) or when assessing L2 lexical readability (Beinborn et al., 2014). Although the latter readability-focused user study has shown a positive effect of cognates on decontextualized word comprehension, not many studies seem to have focused on how cognate vocabulary is distributed in reading texts of different L2 levels, such as reading materials found in textbooks graded along the CEFR (Common European Framework of Reference) scale (Council of Europe, 2001). Our aim is therefore to examine whether the presupposed increasing difficulty of the lexical stock attested in such texts is somehow related to cognate density. To this end, we will focus on French and Dutch L2 and will use two lexical databases, viz. FLELex (Francois et al., 2014) and NT2Lex (Tack et al., 2018), respectively. These resources have been compiled from a corpus of L2 reading materials targeted towards a specific CEFR level, including expert-written texts found in textbooks or readers. The lexicons thus describe word frequency distributions observed along the CEFR scale and therefore inform us about the lexical stock that should be understood a priori at a given level. In these CEFR-graded word distributions, cognate vocabulary in Dutch and French will be automatically identified, drawing on recent machine translation methods (Beinborn et al., 2013; Mitkov et al., 2007). As a parallel reference dataset, we will use the Dutch-French alignments of the Dutch Parallel Corpus (Paulussen et al., 2006)

    SVALex. En andraspråksordlista med CEFR-nivåer

    No full text
    När man planerar att utveckla en språkkurs i ett andra- eller främmandespråk (L2) ställs man inför utmaningen att definiera vilket ordförråd inlärarna behöver tillägna sig. Forskning inom andraspråksinlärning tyder på att läsaren behöver kunna 95–98 % av löporden i en text för att förstå den (Laufer & Ravenhorst-Kalovski 2010). Sådana studier är användbara för att uppskatta storleken på det ordförråd som behövs för att tillägna sig innehållet i en text, men de ger ingen närmare metodologisk vägledning för den som vill utveckla nivåstrukturerade läromedel eller kurser för andraspråksundervisning. Speciellt tydligt är detta inom CALL, Computer-Assisted Language Learning, där läromaterial (t.ex. övningar) genereras automatiskt, och behöver elektroniska resurser som kunskapskälla. Man kan istället angripa problemet från andra hållet. Om man har en samling nivåklassificerade texter för andraspråksinlärare kan man utifrån dem bygga ordlistor där varje ord är placerat på en färdighetsskala. Om man känner till den förutsatta färdighetsnivån hos läsaren, kan man helt enkelt anta att den textnivå där ett ord dyker upp första gången också anger ordets svårighetsgrad. SVALex är ett lexikon som har byggts enligt den principen. Resursen ska kunna användas av inlärare och lärare i svenska som andraspråk, men även av lexikografer, av kursutvecklare och provkonstruktörer samt av dem som likt oss själva ägnar sig åt utveckling av språkteknologibaserade datorstöd för språkinlärning och språktestning. SVALex utgör en vidareutveckling i förhållande till tidigare lexikonresurser för svenska som andraspråk (se avsnitt 2), genom att den konsekvent relaterar de 15 681 lexikoningångarna till en vida använd färdighetsskala för andra- och främmandespråksinlärning, Europarådets gemensamma europeiska referensram för språk (Common European Framework of Reference, i fortsättningen refererad till som CEFR) (Council of Europe 2001; Skolverket 2009). Nivåklassningen av lexikonenheterna i SVALex görs på basis av deras distribution i COCTAILL, en korpus innehållande lärobokstexter i svenska som andraspråk, där lärare har placerat in varje text i någon av CEFR-nivåerna (Volodina et al. 2014)

    Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers

    No full text
    In this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with thegoal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface andavailable on demand for research purposes

    Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers

    No full text
    International audienceIn this paper, we present a new parallel corpus addressed to researchers, teachers, and speech therapists interested in text simplification as a means of alleviating difficulties in children learning to read. The corpus is composed of excerpts drawn from 79 authentic literary (tales, stories) and scientific (documentary) texts commonly used in French schools for children aged between 7 to 9 years old. The excerpts were manually simplified at the lexical, morpho-syntactic, and discourse levels in order to propose a parallel corpus for reading tests and for the development of automatic text simplification tools. A sample of 21 poor-reading and dyslexic children with an average reading delay of 2.5 years read a portion of the corpus. The transcripts of readings errors were integrated into the corpus with the goal of identifying lexical difficulty in the target population. By means of statistical testing, we provide evidence that the manual simplifications significantly reduced reading errors, highlighting that the words targeted for simplification were not only well-chosen but also substituted with substantially easier alternatives. The entire corpus is available for consultation through a web interface and available on demand for research purposes
    corecore