555 research outputs found
Which words do English non-native speakers know? New supernational levels based on yes/no decision
To have more information about the English words known by second language (L2) speakers, we ran a large-scale crowdsourcing vocabulary test, which yielded 17 million useful responses. It provided us with a list of 445 words known to nearly all participants. The list was compared to various existing lists of words advised to include in the first stages of English L2 teaching. The data also provided us with a ranking of 61,000 words in terms of degree and speed of word recognition in English L2 speakers, which correlated r = .85 with a similar ranking based on native English speakers. The L2 speakers in our study were relatively better at academic words (which are often cognates in their mother tongue) and words related to experiences English L2 students are likely to have. They were worse at words related to childhood and family life. Finally, a new list of 20 levels of 1,000 word families is presented, which will be of use to English L2 teachers, as the levels represent the order in which English vocabulary seems to be acquired by L2 learners across the world
非英語母語話者のためのインタラクティブな書き換え
Tohoku University博士(情報科学)thesi
Rethinking First Language–Second Language Similarities and Differences in English Proficiency: Insights From the ENglish Reading Online (ENRO) Project
This article presents the ENglish Reading Online (ENRO) project that offers
data on English reading and listening comprehension from 7,338 university-level advanced
learners and native speakers of English representing 19 countries. The database
also includes estimates of reading rate and seven component skills of English, including
vocabulary, spelling, and grammar, as well as rich demographic and language background
data. We first demonstrate high reliability for ENRO tests and their convergent
validity with existing meta-analyses.We then provide a bird’s-eye view of first (L1) and
second (L2) language comparisons and examine the relative role of various predictors of reading and listening comprehension and reading speed. Across analyses, we found
substantially more overlap than differences between L1 and L2 speakers, suggesting
that English reading proficiency is best considered across a continuum of skill, ability,
and experiences spanning L1 and L2 speakers alike. We end by providing pointers for
how researchers can mine ENRO data for future studies.Este artículo se encuentra publicado en Language Learning, 73(1)
Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning
A growing body of work shows that many problems in fairness, accountability,
transparency, and ethics in machine learning systems are rooted in decisions
surrounding the data collection and annotation process. In spite of its
fundamental nature however, data collection remains an overlooked part of the
machine learning (ML) pipeline. In this paper, we argue that a new
specialization should be formed within ML that is focused on methodologies for
data collection and annotation: efforts that require institutional frameworks
and procedures. Specifically for sociocultural data, parallels can be drawn
from archives and libraries. Archives are the longest standing communal effort
to gather human information and archive scholars have already developed the
language and procedures to address and discuss many challenges pertaining to
data collection such as consent, power, inclusivity, transparency, and ethics &
privacy. We discuss these five key approaches in document collection practices
in archives that can inform data collection in sociocultural ML. By showing
data collection practices from another field, we encourage ML research to be
more cognizant and systematic in data collection and draw from
interdisciplinary expertise.Comment: To be published in Conference on Fairness, Accountability, and
Transparency FAT* '20, January 27-30, 2020, Barcelona, Spain. ACM, New York,
NY, USA, 11 page
Analyzing Authentic Texts for Language Learning: Web-based Technology for Input Enrichment and Question Generation
Acquisition of a language largely depends on the learner's exposure to and interaction with it. Our research goal is to explore and implement automatic techniques that help create a richer grammatical intake from a given text input and engage learners in making form-meaning connections during reading.
A starting point for addressing this issue is the automatic input enrichment method, which aims to ensure that a target structure is richly represented in a given text.
We demonstrate the high performance of our rule-based algorithm, which is able to detect 87 linguistic forms contained in an official curriculum for the English language. Showcasing the algorithm's capability to differentiate between the various functions of the same linguistic form, we establish the task of tense sense disambiguation, which we approach by leveraging machine learning and rule-based methods.
Using the aforementioned technology, we develop an online information retrieval system FLAIR that prioritizes texts with a rich representation of selected linguistic forms. It is implemented as a web search engine for language teachers and learners and provides effective input enrichment in a real-life teaching setting. It can also serve as a foundation for empirical research on input enrichment and input enhancement.
The input enrichment component of the FLAIR system is evaluated in a web-based study that demonstrates that English teachers prefer automatic input enrichment to standard web search when selecting reading material for class.
We then explore automatic question generation for facilitating and testing reading comprehension as well as linguistic knowledge.
We give an overview of the types of questions that are usually asked and can be automatically generated from text in the language learning context. We argue that questions can facilitate the acquisition of different linguistic forms by providing functionally driven input enhancement, i.e., by ensuring that the learner notices and processes the form.
The generation of well-established and novel types of questions is discussed and examples are provided; moreover, the results from a crowdsourcing study show that automatically generated questions are comparable to human-written ones
平易なコーパスを用いないテキスト平易化
首都大学東京, 2018-03-25, 博士(工学)首都大学東
HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition
This paper introduces HeBERT and HebEMO. HeBERT is a Transformer-based model
for modern Hebrew text, which relies on a BERT (Bidirectional Encoder
Representations for Transformers) architecture. BERT has been shown to
outperform alternative architectures in sentiment analysis, and is suggested to
be particularly appropriate for MRLs. Analyzing multiple BERT specifications,
we find that while model complexity correlates with high performance on
language tasks that aim to understand terms in a sentence, a more-parsimonious
model better captures the sentiment of entire sentence. Either way, out
BERT-based language model outperforms all existing Hebrew alternatives on all
common language tasks. HebEMO is a tool that uses HeBERT to detect polarity and
extract emotions from Hebrew UGC. HebEMO is trained on a unique
Covid-19-related UGC dataset that we collected and annotated for this study.
Data collection and annotation followed an active learning procedure that aimed
to maximize predictability. We show that HebEMO yields a high F1-score of 0.96
for polarity classification. Emotion detection reaches F1-scores of 0.78-0.97
for various target emotions, with the exception of surprise, which the model
failed to capture (F1 = 0.41). These results are better than the best-reported
performance, even among English-language models of emotion detection
A Comprehensive Overview of Large Language Models
Large Language Models (LLMs) have shown excellent generalization capabilities
that have led to the development of numerous models. These models propose
various new architectures, tweaking existing architectures with refined
training strategies, increasing context length, using high-quality training
data, and increasing training time to outperform baselines. Analyzing new
developments is crucial for identifying changes that enhance training stability
and improve generalization in LLMs. This survey paper comprehensively analyses
the LLMs architectures and their categorization, training strategies, training
datasets, and performance evaluations and discusses future research directions.
Moreover, the paper also discusses the basic building blocks and concepts
behind LLMs, followed by a complete overview of LLMs, including their important
features and functions. Finally, the paper summarizes significant findings from
LLM research and consolidates essential architectural and training strategies
for developing advanced LLMs. Given the continuous advancements in LLMs, we
intend to regularly update this paper by incorporating new sections and
featuring the latest LLM models
- …