3,608 research outputs found

    The Development of Collocations as Constructions in L2 Writing

    Get PDF
    Cross-sectional and longitudinal learner corpus studies utilizing phraseological, frequency, and association strength approaches to phraseological unit identification have shown how the use of phraseological units varies across proficiency levels and develops over time. However, these methods suffer from several limitations, such as a reliance on native speaker intuition, a limited focus on contiguous word sequences, and a neglect of part of speech information in association strength calculation. This study seeks to address these limitations by defining lexical collocations as constructions (henceforth “collconstructions”) within the framework of Construction Grammar and investigating their cross-sectional variation and longitudinal development in two corpora of L2 writing. The cross-sectional corpus consisted of beginner and intermediate EFL learner texts assessed for overall writing proficiency, while the longitudinal corpus contained freewrites produced by ESL learners over the course of one year. Contiguous and non-contiguous adjective-noun, verb-noun, and adverb-adjective collconstruction tokens were extracted from each learner text in the two learner corpora. Each learner text was assessed for multiple constructional and collostructional indices of collconstruction production. Constructional indices included type frequencies, token frequencies, and normalized entropy scores for each collconstruction category. Collostructional indices consisted of proportion scores for different categories of adjective-noun, adverb-adjective, and verb-noun collconstruction types and tokens based on covarying collexeme scores calculated using frequency information from an academic reference corpus. Variation across proficiency levels was evaluated both qualitatively and quantitatively. The qualitative analysis consisted of examining variation in the production of specific functional collconstruction subcategories from a Usage-based Second Language Acquisition perspective. The quantitative analysis consisted of the calculation of an ordinal logistic regression in order to determine whether any indices of collconstruction production were predictive of L2 writing quality. Longitudinal development at the group level was investigated through the use of linear mixed effects models. Development for individual learners was examined from a Dynamic Systems Theory perspective that focuses on the role of variability in language development as well as interconnected development for multiple indices of collconstruction production. This study has important implications for future research on L2 phraseology research and second language acquisition research as well as phraseology pedagogy

    Masked Language Model Scoring

    Full text link
    Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https://github.com/awslabs/mlm-scoring.Comment: ACL 2020 camera-ready (presented July 2020

    Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora

    Get PDF
    This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task

    Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

    Get PDF
    In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages
    corecore