Journal for Language Technology and Computational Linguistics (JLCL)
Not a member yet
    251 research outputs found

    Do LLMs fail in bridging generation?

    Get PDF
    In this work we investigate whether large language models (LLMs) ‘understand’ bridging relations and can use this knowledge effectively. We present the results obtained from two tasks: generation of texts containing bridging and filling in missing bridging spans. We show that in most of the cases LLMs fail to generate bridging in a reliable way

    Exploring the Limits of LLMs for German Text Classification: Prompting and Fine-tuning Strategies Across Small and Medium-sized Datasets

    Get PDF
    Large Language Models (LLMs) are highly capable, state-of-the-art technologies and widely used as text classifiers for various NLP tasks, including sentiment analysis, topic classification, legal document analysis, etc. In this paper, we present a systematic analysis of the performance of LLMs as text classifiers using five German datasets from social media across 13 different tasks. We investigate zero- (ZSC) and few-shot classification (FSC) approaches with multiple LLMs and provide a comparative analysis with fine-tuned models based on Llama-3.2, EuroLLM, Teuken and BübleLM. We concentrate on investigating the limits of LLMs and on accurately describing our findings and overall challenges

    GPT makes a poor AMR parser

    Get PDF
    This paper evaluates GPT models as out-of-the-box Abstract Meaning Representation (AMR) parsers using prompt-based strategies, including 0-shot, few-shot, Chain-of-Thought (CoT), and a two-step approach in which core arguments and non-core roles are handled separately. Our results show that GPT-3.5 and GPT-4o fall well short of state-of-the-art parsers, with a maximum Smatch score of 60 using GPT-4o in a 5-shot setting. While CoT prompting provides some interpretability, it does not improve performance. We further conduct fine-grained evaluations, revealing GPT’s limited ability to handle AMR-specific linguistic structures and complex semantic roles. Ourfindings suggest that, despite recent advances, GPT models are not yet suitable as standalone AMR parsers

    Editorial

    Get PDF

    A Study of Errors in the Output of Large Language Models for Domain-Specific Few-Shot Named Entity Recognition

    Get PDF
    This paper proposes an error classification framework for a comprehensive analysis of the output that large language models (LLMs) generate in a few-shot named entity recognition (NER) task in a specialised domain. The framework should be seen as an exploratory analysis complementary to established performance metrics for NER classifiers, such as F1 score, as it accounts for outcomes possible in a few-shot, LLMbased NER task. By categorising and assessing incorrect named entity predictions quantitatively, the paper shows how the proposed error classification could support a deeper cross-model and cross-prompt performance comparison, alongside a roadmap for a guided qualitative error analysis

    The Struggles of Large Language Models with Zero- and Few-Shot (Extended) Metaphor Detection

    Get PDF
    Extended metaphor is the use of multiple metaphoric words that express the same domain mapping. Although it would provide valuable insight for computational metaphor processing, detecting extended metaphor has been rather neglected. We fill this gap by providing a series of zero- and few-shot experiments on the detection of all linguistic metaphors and specifically on extended metaphors with LLaMa and GPT models. We find that no model was able to achieve satisfactory performance on either task, and that LLaMa in particular showed problematic overgeneralization tendencies. Moreover, our error analysis showed that LLaMa is not sufficiently able to construct the domain mappings relevant for metaphor understanding

    Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations

    Get PDF
    Contemporary research in social sciences increasingly utilizes state-of-the-art generative language models to annotate or generate content. While these models achieve benchmarkleading performance on common language tasks, their application to novel out-of domain tasks remains insufficiently explored. To address this gap, we investigate how personalized language models align with human responses on the Moral Foundation Theory Questionnaire. We adapt open-source generative language models to different political personas and repeatedly survey these models to generate synthetic data sets where model-persona combinations define our sub-populations. Our analysis reveals that models produce inconsistent results across multiple repetitions, yielding high response variance. Furthermore, the alignment between synthetic data and corresponding human data from psychological studies shows a weak correlation, with conservative persona-prompted models particularly failing to align with actual conservative populations. These results suggest that language models struggle to coherently represent ideologies through in-context prompting due to their alignment process. Thus, using language models to simulate social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes properly

    Large language models for terminology work: A question of the right prompt?

    Get PDF
    Text-generative large language models (LLMs) offer promising possibilities for terminology work, including term extraction, definition creation and assessment of concept relations. This study examines the performance of ChatGPT, Perplexity and Microsoft CoPilot for conducting terminology work in the field of the Austrian and British higher education systems using strategic prompting frameworks. Despite efforts to refine prompts by specifying language variety and system context, the LLM outputs failed to reliably differentiate between the Austrian and German systems and fabricated terms. Factors such as the distribution of German-language training data,potential pivot translation via English and the lack of transparency in LLM training further complicated evaluation. Additionally, output variability across identical prompts highlights the unpredictability of LLM-generated terminology. The study underscores the importance of human expertise in evaluating LLM outputs, as inconsistencies may undermine the reliability of terminology derived from such models. Without domain-specific knowledge (encompassing both subject-matter expertise and familiarity with terminology principles) as well as LLM literacy, users are unable to critically assess the quality of LLM outputs in terminological contexts. Rather than indiscriminately applying LLMs to all aspects of terminology work, it is crucial to assess their suitability for specific tasks

    Pictorial constituents & the metalinguistic performance of LLMs

    Get PDF
    In this paper I show that, although ChatGPT (GPT-4o) can provide accurate linguistic acceptability judgments for many types of sentences (Cai, Duan, Haslett, Wang, & Pickering, 2024; Collins, 2024a, 2024b; Ortega-Martín et al., 2023; Wang et al., 2023), it does not give accurate grammaticality judgments for sentences that contain pro-text emojis, which are emojis that appear in a written utterance as morphosyntactic constituents (Cohn, Engelen, & Schilperoord, 2019; Pierini, 2021; Storment, 2024; Tieu, Qiu, Puvipalan, & Pasternak, 2025, a.o.). I demonstrate this with three distinct experiments performed on GPT-4o using both English and Spanish data. This work builds on prior research that shows that the combinatorics of pro-text emojis are sensitive to the morphosyntactic constraints of the language in which the emojis appear, and it connects the poor performance of GPT-4o in this respect to two factors: (i) the fact that, while LLMs are able to make some generalizations of syntactic structural dependencies, their mechanisms for making such generalizations are not derived in the same way that human syntactic structures are (Contreras Kallens, Kristensen-McLachlan, & Christiansen, 2023; Hale & Stanojević, 2024; Kennedy, 2025; Linzen & Baroni, 2021; Manova, 2024a, 2024b; Zhong, Ding, Liu, Du, & Tao, 2023, a.o.), and (ii) the fact that LLMs lack the means of directly processing iconic and pictorial content in the same way that human cognition allows for. I also consider the possibility that the relevant data are poorly attested in the model\u27s training parameters. This paper establishes a precedent for the research of the intersection of generative AI and utterances that contain pictorial elements as morphosyntactic constituents

    Measuring the Contributions of Vision and Text Modalities

    Get PDF
    This dissertation investigates multimodal transformers that process both image and text modalities together to generate outputs for various tasks (such as answering questions about images). Specifically, methods are developed to assess the effectiveness of vision and language models in combining, understanding, utilizing, and explaining information from these two modalities. The dissertation contributes to the advancement of the field in three ways: (i) by measuring specific and task-independent capabilities of vision and language models, (ii) by interpreting these models to quantify the extent to which they use and integrate information from both modalities, and (iii) by evaluating their ability to provide self-consistent explanations of their outputs to users

    230

    full texts

    251

    metadata records
    Updated in last 30 days.
    Journal for Language Technology and Computational Linguistics (JLCL) is based in Germany
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇