Search CORE

555 research outputs found

Which words do English non-native speakers know? New supernational levels based on yes/no decision

Author: Brysbaert Marc
Keuleers Emmanuel
Mandera Pawel
Publication venue: 'SAGE Publications'
Publication date: 19/06/2020
Field of study

To have more information about the English words known by second language (L2) speakers, we ran a large-scale crowdsourcing vocabulary test, which yielded 17 million useful responses. It provided us with a list of 445 words known to nearly all participants. The list was compared to various existing lists of words advised to include in the first stages of English L2 teaching. The data also provided us with a ranking of 61,000 words in terms of degree and speed of word recognition in English L2 speakers, which correlated r = .85 with a similar ranking based on native English speakers. The L2 speakers in our study were relatively better at academic words (which are often cognates in their mother tongue) and words related to experiences English L2 students are likely to have. They were worse at words related to childhood and family life. Finally, a new list of 20 levels of 1,000 word families is presented, which will be of use to English L2 teachers, as the levels represent the order in which English vocabulary seems to be acquired by L2 learners across the world

Ghent University Academic Bibliography

Tilburg University Repository

非英語母語話者のためのインタラクティブな書き換え

Author: 伊藤拓海
Publication venue
Publication date: 24/03/2023
Field of study

Tohoku University博士（情報科学）thesi

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Rethinking First Language–Second Language Similarities and Differences in English Proficiency: Insights From the ENglish Reading Online (ENRO) Project

Author: et al.
Gattei Carolina A.
Shalom Diego E.
Publication venue: Wiley Periodicals LLC
Publication date: 01/01/2023
Field of study

This article presents the ENglish Reading Online (ENRO) project that offers data on English reading and listening comprehension from 7,338 university-level advanced learners and native speakers of English representing 19 countries. The database also includes estimates of reading rate and seven component skills of English, including vocabulary, spelling, and grammar, as well as rich demographic and language background data. We first demonstrate high reliability for ENRO tests and their convergent validity with existing meta-analyses.We then provide a bird’s-eye view of first (L1) and second (L2) language comparisons and examine the relative role of various predictors of reading and listening comprehension and reading speed. Across analyses, we found substantially more overlap than differences between L1 and L2 speakers, suggesting that English reading proficiency is best considered across a continuum of skill, ability, and experiences spanning L1 and L2 speakers alike. We end by providing pointers for how researchers can mine ENRO data for future studies.Este artículo se encuentra publicado en Language Learning, 73(1)

Repositorio Digital Universidad Torcuato Di Tella

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning

Author: Buolamwini Joy
Duggan Maeve
Grimm Tracy B.
Horton Valerie
Koehn Philipp
MacNeil Heather
Magazine Life
SAA.
Tripathi Aditya
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/12/2019
Field of study

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.Comment: To be published in Conference on Fairness, Accountability, and Transparency FAT* '20, January 27-30, 2020, Barcelona, Spain. ACM, New York, NY, USA, 11 page

arXiv.org e-Print Archive

Crossref

Analyzing Authentic Texts for Language Learning: Web-based Technology for Input Enrichment and Question Generation

Author: Chinkina Maria
Publication venue: Universität Tübingen
Publication date: 01/01/2018
Field of study

Acquisition of a language largely depends on the learner's exposure to and interaction with it. Our research goal is to explore and implement automatic techniques that help create a richer grammatical intake from a given text input and engage learners in making form-meaning connections during reading. A starting point for addressing this issue is the automatic input enrichment method, which aims to ensure that a target structure is richly represented in a given text. We demonstrate the high performance of our rule-based algorithm, which is able to detect 87 linguistic forms contained in an official curriculum for the English language. Showcasing the algorithm's capability to differentiate between the various functions of the same linguistic form, we establish the task of tense sense disambiguation, which we approach by leveraging machine learning and rule-based methods. Using the aforementioned technology, we develop an online information retrieval system FLAIR that prioritizes texts with a rich representation of selected linguistic forms. It is implemented as a web search engine for language teachers and learners and provides effective input enrichment in a real-life teaching setting. It can also serve as a foundation for empirical research on input enrichment and input enhancement. The input enrichment component of the FLAIR system is evaluated in a web-based study that demonstrates that English teachers prefer automatic input enrichment to standard web search when selecting reading material for class. We then explore automatic question generation for facilitating and testing reading comprehension as well as linguistic knowledge. We give an overview of the types of questions that are usually asked and can be automatically generated from text in the language learning context. We argue that questions can facilitate the acquisition of different linguistic forms by providing functionally driven input enhancement, i.e., by ensuring that the learner notices and processes the form. The generation of well-established and novel types of questions is discussed and examples are provided; moreover, the results from a crowdsourcing study show that automatically generated questions are comparable to human-written ones

Publikationsserver der Universität Tübingen

平易なコーパスを用いないテキスト平易化

Author: Kajiwara Tomoyuki
Publication venue
Publication date: 25/03/2018
Field of study

首都大学東京, 2018-03-25, 博士（工学）首都大学東

Tokyo Metropolitan University Institutional Repository Miyako-Dori / 首都大学東京機関リポジトリ

HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition

Author: Chriqui Avihay
Yahav Inbal
Publication venue
Publication date: 25/02/2021
Field of study

This paper introduces HeBERT and HebEMO. HeBERT is a Transformer-based model for modern Hebrew text, which relies on a BERT (Bidirectional Encoder Representations for Transformers) architecture. BERT has been shown to outperform alternative architectures in sentiment analysis, and is suggested to be particularly appropriate for MRLs. Analyzing multiple BERT specifications, we find that while model complexity correlates with high performance on language tasks that aim to understand terms in a sentence, a more-parsimonious model better captures the sentiment of entire sentence. Either way, out BERT-based language model outperforms all existing Hebrew alternatives on all common language tasks. HebEMO is a tool that uses HeBERT to detect polarity and extract emotions from Hebrew UGC. HebEMO is trained on a unique Covid-19-related UGC dataset that we collected and annotated for this study. Data collection and annotation followed an active learning procedure that aimed to maximize predictability. We show that HebEMO yields a high F1-score of 0.96 for polarity classification. Emotion detection reaches F1-scores of 0.78-0.97 for various target emotions, with the exception of surprise, which the model failed to capture (F1 = 0.41). These results are better than the best-reported performance, even among English-language models of emotion detection

arXiv.org e-Print Archive

A Comprehensive Overview of Large Language Models

Author: Anwar Saeed
Barnes Nick
Khan Asad Ullah
Mian Ajmal
Naveed Humza
Qiu Shi
Saqib Muhammad
Usman Muhammad
Publication venue
Publication date: 12/07/2023
Field of study

Large Language Models (LLMs) have shown excellent generalization capabilities that have led to the development of numerous models. These models propose various new architectures, tweaking existing architectures with refined training strategies, increasing context length, using high-quality training data, and increasing training time to outperform baselines. Analyzing new developments is crucial for identifying changes that enhance training stability and improve generalization in LLMs. This survey paper comprehensively analyses the LLMs architectures and their categorization, training strategies, training datasets, and performance evaluations and discusses future research directions. Moreover, the paper also discusses the basic building blocks and concepts behind LLMs, followed by a complete overview of LLMs, including their important features and functions. Finally, the paper summarizes significant findings from LLM research and consolidates essential architectural and training strategies for developing advanced LLMs. Given the continuous advancements in LLMs, we intend to regularly update this paper by incorporating new sections and featuring the latest LLM models

arXiv.org e-Print Archive

2nd Conference on Language, Data and Knowledge (LDK 2019), May 20–23, 2019, Leipzig, Germany

Author: Buitelaar Paul
Chiarcos Christian
de Melo Gerard
Dojchinovski Milan
Eskevich Maria
Fäth Christian
Klimek Bettina
McCrae John P.
Publication venue
Publication date: 27/04/2023
Field of study

OPUS Augsburg