203 research outputs found

    DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

    Get PDF
    Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT's derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used

    Predicting the Growth of Morphological Families from Social and Linguistic Factors

    Get PDF
    We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Predic- tion (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP

    A Graph Auto-encoder Model of Derivational Morphology

    Get PDF
    There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics (Bauer, 2019). We present a graph auto-encoder that learns em- beddings capturing information about the com- patibility of affixes and stems in derivation. The auto-encoder models MWF in English sur- prisingly well by combining syntactic and se- mantic information with associative informa- tion from the mental lexicon

    The Reddit Politosphere: A Large-Scale Text and NetworkResource of Online Political Discourse

    Get PDF
    We introduce the Reddit Politosphere, a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. It is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data, allowing for methodologically-diverse analyses. We describe in detail how we create the Reddit Politosphere, present descriptive statistics, and sketch potential directions for future research based on the resource

    The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative

    Get PDF
    Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge

    Bunk8s: Enabling Easy Integration Testing of Microservices in Kubernetes

    Full text link
    Microservice architecture is the common choice for cloud applications these days since each individual microservice can be independently modified, replaced, and scaled. However, the complexity of microservice applications requires automated testing with a focus on the interactions between the services. While this is achievable with end-to-end tests, they are error-prone, brittle, expensive to write, time-consuming to run, and require the entire application to be deployed. Integration tests are an alternative to end-to-end tests since they have a smaller test scope and require the deployment of a significantly fewer number of services. The de-facto standard for deploying microservice applications in the cloud is containers with Kubernetes being the most widely used container orchestration platform. To support the integration testing of microservices in Kubernetes, several tools such as Octopus, Istio, and Jenkins exist. However, each of these tools either lack crucial functionality or lead to a substantial increase in the complexity and growth of the tool landscape when introduced into a project. To this end, we present \emph{Bunk8s}, a tool for integration testing of microservice applications in Kubernetes that overcomes the limitations of these existing tools. \emph{Bunk8s} is independent of the test framework used for writing integration tests, independent of the used CI/CD infrastructure, and supports test result publishing. A video demonstrating the functioning of our tool is available from \url{https://www.youtube.com/watch?v=e8wbS25O4Bo}.Comment: 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER

    Explaining pretrained language models' understanding of linguistic structures using construction grammar

    Get PDF
    Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasizing the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step toward assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models' behavior in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs, as well as OPT, are able to recognize the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge
    • …
    corecore