Search CORE

203 research outputs found

DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue
Publication date: 01/01/2020
Field of study

Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT's derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT's derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used

arXiv.org e-Print Archive

Crossref

Open Access LMU

Oxford University Research Archive

Predicting the Growth of Morphological Families from Social and Linguistic Factors

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2020
Field of study

We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Predic- tion (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP

Crossref

Open Access LMU

Oxford University Research Archive

A Graph Auto-encoder Model of Derivational Morphology

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2020
Field of study

There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics (Bauer, 2019). We present a graph auto-encoder that learns em- beddings capturing information about the com- patibility of affixes and stems in derivation. The auto-encoder models MWF in English sur- prisingly well by combining syntactic and se- mantic information with associative informa- tion from the mental lexicon

Crossref

Open Access LMU

Oxford University Research Archive

The Reddit Politosphere: A Large-Scale Text and NetworkResource of Online Political Discourse

Author: Hofmann Valentin
Pierrehumbert Janet
Schütze Hinrich
Publication venue
Publication date: 31/05/2022
Field of study

We introduce the Reddit Politosphere, a large-scale resource of online political discourse covering more than 600 political discussion groups over a period of 12 years. It is to the best of our knowledge the largest and ideologically most comprehensive dataset of its type now available. One key feature of the Reddit Politosphere is that it consists of both text and network data, allowing for methodologically-diverse analyses. We describe in detail how we create the Reddit Politosphere, present descriptive statistics, and sketch potential directions for future research based on the resource

Open Access LMU

Oxford University Research Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

The Better Your Syntax, the Better Your Semantics? Probing Pretrained Language Models for the English Comparative Correlative

Author: Che Wanxiang
Hofmann Valentin
Köksal Abdullatif
Schütze Hinrich
Shutova Ekaterina
Weissweiler Leonie
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/12/2022
Field of study

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge

Open Access LMU

Bunk8s: Enabling Easy Integration Testing of Microservices in Kubernetes

Author: Chadha Mohak
Gerndt Michael
Hauner Valentin
Hofmann Benjamin
Jindal Anshul
Reile Christoph
Publication venue
Publication date: 14/07/2022
Field of study

Microservice architecture is the common choice for cloud applications these days since each individual microservice can be independently modified, replaced, and scaled. However, the complexity of microservice applications requires automated testing with a focus on the interactions between the services. While this is achievable with end-to-end tests, they are error-prone, brittle, expensive to write, time-consuming to run, and require the entire application to be deployed. Integration tests are an alternative to end-to-end tests since they have a smaller test scope and require the deployment of a significantly fewer number of services. The de-facto standard for deploying microservice applications in the cloud is containers with Kubernetes being the most widely used container orchestration platform. To support the integration testing of microservices in Kubernetes, several tools such as Octopus, Istio, and Jenkins exist. However, each of these tools either lack crucial functionality or lead to a substantial increase in the complexity and growth of the tool landscape when introduced into a project. To this end, we present \emph{Bunk8s}, a tool for integration testing of microservice applications in Kubernetes that overcomes the limitations of these existing tools. \emph{Bunk8s} is independent of the test framework used for writing integration tests, independent of the used CI/CD infrastructure, and supports test result publishing. A video demonstrating the functioning of our tool is available from \url{https://www.youtube.com/watch?v=e8wbS25O4Bo}.Comment: 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER

arXiv.org e-Print Archive

Explaining pretrained language models' understanding of linguistic structures using construction grammar

Author: Abdullatif Köksal
Abdullatif Köksal
Hinrich Schütze
Hinrich Schütze
Leonie Weissweiler
Leonie Weissweiler
Valentin Hofmann
Valentin Hofmann
Publication venue: Frontiers Media S.A.
Publication date: 01/10/2023
Field of study

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasizing the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step toward assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models' behavior in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs, as well as OPT, are able to recognize the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge

Directory of Open Access Journals