11 research outputs found
Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus
In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well
An exploratory study into automated pr\ue9cis grading
Automated writing evaluation is a popular research field, but the main focus has been on evaluating argumentative essays. In this paper, we consider a different genre, namely précis texts. A précis is a written text that provides a coherent summary of main points of a spoken or written text. We present a corpus of English précis texts which all received a grade assigned by a highly-experienced English language teacher and were subsequently annotated following an exhaustive error typology. With this corpus we trained a machine learning model which relies on a number of linguistic, automatic summarization and AWE features. Our results reveal that this model is able to predict the grade of précis texts with only a moderate error margin
Benchmarking zero-shot text classification for Dutch
Abstract: The advent and popularisation of Large Language Models (LLMs) have given rise to promptbased Natural Language Processing (NLP) techniques which eliminate the need for large manually annotated corpora and computationally expensive supervised training or fine-tuning processes. Zero-shot learning in particular presents itself as an attractive alternative to the classical train-development-test paradigm for many downstream tasks as it provides a quick and inexpensive way of directly leveraging the implicitly encoded knowledge in LLMs. Despite the large interest in zero-shot applications within the domain of NLP as a whole, there is often no consensus on the methodology, analysis and evaluation of zero-shot pipelines. As a tentative step towards finding such a consensus, this work provides a detailed overview of available methods, resources, and caveats for zero-shot prompting within the Dutch language domain. At the same time, we present centralised zero-shot benchmark results on a large variety of Dutch NLP tasks using a series of standardised datasets. These tasks vary in subjectivity and domain, ranging from more social information extraction tasks (sentiment, emotion and irony detection for social media) to factual tasks (news topic classification and event coreference resolution). To ensure that the benchmark results are representative, we investigated a selection of zero-shot methodologies for a variety of state-of-the-art Dutch Natural Language Inference models (NLI), Masked Language models (MLM), and autoregressive language models. The output on each test set was compared to the best performance achieved using supervised methods. Our findings indicate that task-specific fine-tuning delivers superior performance in all but one (emotion detection) task. In the zero-shot settings it could be observed that large generative models through prompting seem to outperform NLI models, which in turn perform better than the MLM approach. Finally, we note several caveats and challenges tied to using zero-shot learning in application settings. These include, but are not limited to, properly streamlining evaluation of zero-shot output, parameter efficiency compared to standard finetuned models and prompt optimization
