49 research outputs found

    Collecting a corpus of Dutch SMS

    Get PDF
    In this paper we present the first freely available corpus of Dutch text messages containing data originating from the Netherlands and Flanders. This corpus has been collected in the framework of the SoNaR project and constitutes a viable part of this 500-million-word corpus. About 53,000 text messages were collected on a large scale, based on voluntary donations. These messages will be distributed as such. In this paper we focus on the data collection processes involved and after studying the effect of media coverage we show that especially free publicity in newspapers and on social media networks results in more contributions. All SMS are provided with metadata information. Looking at the composition of the corpus, it becomes visible that a small number of people have contributed a large amount of data, in total 272 people have contributed to the corpus during three months. The number of women contributing to the corpus is larger than the number of men, but male contributors submitted larger amounts of data. This corpus will be of paramount importance for sociolinguistic research and normalisation studies

    Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource

    Full text link
    Word embeddings have recently seen a strong increase in interest as a result of strong performance gains on a variety of tasks. However, most of this research also underlined the importance of benchmark datasets, and the difficulty of constructing these for a variety of language-specific tasks. Still, many of the datasets used in these tasks could prove to be fruitful linguistic resources, allowing for unique observations into language use and variability. In this paper we demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification. For the latter, we compare unsupervised methods with a traditional, hand-crafted dictionary. With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.Comment: in LREC 201

    Evaluation of automatic hypernym extraction from technical corpora in English and Dutch

    Get PDF
    In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts

    A Data-Oriented Model of Literary Language

    Get PDF
    We consider the task of predicting how literary a text is, with a gold standard from human ratings. Aside from a standard bigram baseline, we apply rich syntactic tree fragments, mined from the training set, and a series of hand-picked features. Our model is the first to distinguish degrees of highly and less literary novels using a variety of lexical and syntactic features, and explains 76.0 % of the variation in literary ratings.Comment: To be published in EACL 2017, 11 page

    Is redundancy useful in language? Agent-recipient disambiguation in English and Dutch.

    Full text link
    peer reviewedrobustness in language processing and learning (MacWhinney et al. 2014), both from a typological and diachronic perspective. Specifically, we assess the potential benefits or costs of redundancy in morphosyntactic marking of participant roles, comparing and testing two opposing hypotheses: On the one hand, following the most crucial tenet in usage-based linguistics that language use affects – or even determines – grammar (Bybee 2010), we assume that language is organised in a way that facilitates efficient usage (e.g. Gibson et al. 2019). On this account, redundant marking should be dispreferred. Well-known typological ‘trade-off’ distributions and diachronic trajectories between word order and morphological case marking seem to support this point (Fedzechkina et al. 2017). Furthermore, prepositional marking is often only applied in contexts where it comes with some added processing benefit (cf. Pijpops et al. 2018 on the impact of complexity on Dutch transitive object marking, or Tal et al. 2020, Levshina 2021 on ambiguity/atypicality in differential object marking). On the other hand, however, we pursue Van de Velde's (2014) argument that a certain amount of redundancy – or rather, ‘degenerate’ marking (involving many-to-many relationships) – is in fact beneficial from a usage perspective: redundancy constitutes an indispensable component of any degenerative Complex Adaptive System, and thus also of language (Steels 2000; Beckner et al. 2009). Such redundancy/degeneracy comes with two important advantages, viz. robustness and evolvability: most importantly for the present paper, the former entails that redundant marking offers protection against information loss in the noisy language channel, even though it may be less efficient. Redundancy is furthermore assumed to increase learnability, particularly in more complex situations (e.g. Tal et al. 2021). Our case study to assess the plausibility of what we call the ‘strict-efficiency’ versus the ‘robustness’ account is participant role marking in ditransitive clauses in Present Day Dutch and English, for a comparative perspective, as well as historical English for a diachronic view. More precisely, we investigate the interaction between strategies used to distinguish agents and recipients in transfer-events, e.g. with verbs of giving as in (1) and (2). (1) TheyAGENT give a book to the studentRECIPIENT. (2) Ze AGENT geven een boek aan de studentRECIPIENT. Since both agents and recipients in ditransitive clauses are prototypically animate (sentient) and volitional (e.g. Newman 1998; Naess 2007; Haspelmath 2015), disambiguating these roles based on semantic-pragmatic information is usually difficult if not impossible. Morpho-syntactic cues are hence indispensable in determining ‘who gave what to whom’. Among the strategies language users have at their disposal are (i) constituent order (e.g. SVO in Present Day English), (ii) case marking/ formal differentiation (e.g. subject vs object pronoun forms in PDE), (iii) subject-verb agreement, and (iv) prepositional marking. Employing multiple strategies at the same time constitutes redundant marking; for example, in (1) all four disambiguation strategies are given. Meanwhile in (3), none are used, resulting in an ambiguous sentence. (3) Mijn baas kan je niet zomaar een uitbrander geven. ‘You can’t just give my boss a telling-off’ or ‘My boss can’t just give you a telling off.’ In our study, we make use of the Sonar Corpus of Written Dutch (Oostdijk et al. 2013), a pre-compiled dataset of ditransitives from the ICE-GB (Röthlisberger 2018) and the Penn Parsed Corpus of Middle English (PPCME2; Kroch et al. 2000). Instances of ditransitive clauses with give are extracted from the corpora, and coded for the strategies instantiated by them. Following the ‘strict-efficiency’ account, we then expect language users to prefer employing a single strategy for each instance. By contrast, based on the degeneracy/ robustness account, we anticipate sentences that simultaneously instantiate multiple strategies to be most common, and cases where only one strategy is at work to be rare. Our results indicate that even though the precise strategies and their disambiguation power differ between Dutch and English, both languages show substantial redundancy to be the default. Still, redundancy seems to operate within limits, with four-fold strategy use being rare, and two simultaneous strategies being most common. Our diachronic results are in line with this conclusion: We find that English appears to have moved towards more redundant marking over time, but that after a short period of ‘exuberant’ redundancy, double redundancy is settled on as the norm. In a final step, we assess the question of whether redundant marking is particularly frequent in complex environments, here measured as sentence length in words (excluding the subject and object arguments of the respective ditransitive patterns). Our findings are again mixed: for Dutch and historical English, complexity emerges as an influential predictor; in Present Day English, however, no significant effect can be observed. We interpret this outcome of our study in light of the differing degrees of variability of strategies in the languages/ stages
    corecore