3,608 research outputs found
The Development of Collocations as Constructions in L2 Writing
Cross-sectional and longitudinal learner corpus studies utilizing phraseological, frequency, and association strength approaches to phraseological unit identification have shown how the use of phraseological units varies across proficiency levels and develops over time. However, these methods suffer from several limitations, such as a reliance on native speaker intuition, a limited focus on contiguous word sequences, and a neglect of part of speech information in association strength calculation. This study seeks to address these limitations by defining lexical collocations as constructions (henceforth “collconstructions”) within the framework of Construction Grammar and investigating their cross-sectional variation and longitudinal development in two corpora of L2 writing. The cross-sectional corpus consisted of beginner and intermediate EFL learner texts assessed for overall writing proficiency, while the longitudinal corpus contained freewrites produced by ESL learners over the course of one year. Contiguous and non-contiguous adjective-noun, verb-noun, and adverb-adjective collconstruction tokens were extracted from each learner text in the two learner corpora. Each learner text was assessed for multiple constructional and collostructional indices of collconstruction production. Constructional indices included type frequencies, token frequencies, and normalized entropy scores for each collconstruction category. Collostructional indices consisted of proportion scores for different categories of adjective-noun, adverb-adjective, and verb-noun collconstruction types and tokens based on covarying collexeme scores calculated using frequency information from an academic reference corpus. Variation across proficiency levels was evaluated both qualitatively and quantitatively. The qualitative analysis consisted of examining variation in the production of specific functional collconstruction subcategories from a Usage-based Second Language Acquisition perspective. The quantitative analysis consisted of the calculation of an ordinal logistic regression in order to determine whether any indices of collconstruction production were predictive of L2 writing quality. Longitudinal development at the group level was investigated through the use of linear mixed effects models. Development for individual learners was examined from a Dynamic Systems Theory perspective that focuses on the role of variability in language development as well as interconnected development for multiple indices of collconstruction production. This study has important implications for future research on L2 phraseology research and second language acquisition research as well as phraseology pedagogy
Masked Language Model Scoring
Pretrained masked language models (MLMs) require finetuning for most NLP
tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood
scores (PLLs), which are computed by masking tokens one by one. We show that
PLLs outperform scores from autoregressive language models like GPT-2 in a
variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an
end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on
state-of-the-art baselines for low-resource translation pairs, with further
gains from domain adaptation. We attribute this success to PLL's unsupervised
expression of linguistic acceptability without a left-to-right bias, greatly
improving on scores from GPT-2 (+10 points on island effects, NPI licensing in
BLiMP). One can finetune MLMs to give scores without masking, enabling
computation in a single inference pass. In all, PLLs and their associated
pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of
pretrained MLMs; e.g., we use a single cross-lingual model to rescore
translations in multiple languages. We release our library for language model
scoring at https://github.com/awslabs/mlm-scoring.Comment: ACL 2020 camera-ready (presented July 2020
Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora
This thesis describes the development and in-depth empirical investigation of a
method, called BootMark, for bootstrapping the marking up of named entities
in textual documents. The reason for working with documents, as opposed to
for instance sentences or phrases, is that the BootMark method is concerned
with the creation of corpora. The claim made in the thesis is that BootMark
requires a human annotator to manually annotate fewer documents in order to
produce a named entity recognizer with a given performance, than would be
needed if the documents forming the basis for the recognizer were randomly
drawn from the same corpus. The intention is then to use the created named en-
tity recognizer as a pre-tagger and thus eventually turn the manual annotation
process into one in which the annotator reviews system-suggested annotations
rather than creating new ones from scratch. The BootMark method consists of
three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping
– active machine learning for the purpose of selecting which document to an-
notate next; (3) The remaining unannotated documents of the original corpus
are marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically investigated
in the thesis. Their common denominator is that they all depend on the real-
ization of the named entity recognition task, and as such, require the context
of a practical setting in order to be properly addressed. The emerging issues
are related to: (1) the characteristics of the named entity recognition task and
the base learners used in conjunction with it; (2) the constitution of the set of
documents annotated by the human annotator in phase one in order to start the
bootstrapping process; (3) the active selection of the documents to annotate in
phase two; (4) the monitoring and termination of the active learning carried out
in phase two, including a new intrinsic stopping criterion for committee-based
active learning; and (5) the applicability of the named entity recognizer created
during phase two as a pre-tagger in phase three.
The outcomes of the empirical investigations concerning the emerging is-
sues support the claim made in the thesis. The results also suggest that while
the recognizer produced in phases one and two is as useful for pre-tagging as
a recognizer created from randomly selected documents, the applicability of
the recognizer as a pre-tagger is best investigated by conducting a user study
involving real annotators working on a real named entity recognition task
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
Statistical Measures for Usage‐Based Linguistics
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/111781/1/lang12119.pd
- …