40 research outputs found

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Querying large treebanks : benchmarking GrETEL indexing

    Get PDF
    The amount of data that is available for research grows rapidly, yet technology to efficiently interpret and excavate these data lags behind. For instance, when using large treebanks for linguistic research, the speed of a query leaves much to be desired. GrETEL Indexing, or GrInding, tackles this issue. The idea behind GrInding is to make the search space as small as possible before actually starting the treebank search, by pre-processing the treebank at hand. We recursively divide the treebank into smaller parts, called subtree-banks, which are then converted into database files. All subtree-banks are organized according to their linguistic dependency pattern, and labeled as such. Additionally, general patterns are linked to more specific ones. By doing so, we create millions of databases, and given a linguistic structure we know in which databases that structure can occur, leading up to a significant efficiency boost. We present the results of a benchmark experiment, testing the effect of the GrInding procedure on the SoNaR-500 treebank

    BERTje:A Dutch BERT Model

    Get PDF
    The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized multilingual BERT model on downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic role labeling, and sentiment analysis). Our pre-trained Dutch BERT model is made available at https://github.com/wietsedv/bertje

    CLARIN’s Support for Research into the Acquisition of Lexical Properties

    Get PDF
    Odijk (2011) sketched a research question on the acquisition of lexical properties of words, and illustrated it with some concrete examples, in particular with respect to the lexical properties of the Dutch synonyms heel, erg, and zeer (all meaning ‘very’). This work also indicated what the CLARIN infrastructure should offer to make it possible to address this research question. In this contribution I sketch to what extent the CLARIN infrastructure has achieved these requirements and desiderata. The resulting picture is mixed: (1) some have been implemented; (2) some have not been implemented and are still highly desirable; (3) some have not been implemented but turned out to be not so urgent; (4) new requirements and desiderata have arisen in the last 10 years, only some of which have been implemented. In this way, I evaluate the development of the CLARIN infrastructure (mainly its Netherlands part) over the past 10 years, and sketch the requirements and desiderata for the CLARIN infrastructure to address this research question for the next 10 years

    Finding Dutch multiword expressions

    Get PDF
    We present MWE-Finder, which enables a user to search for occurrences of multiword expressions (MWEs) in large Dutch text corpora. Components of many MWEs in Dutch can occur in multiple forms, need not be adjacent, and can occur in multiple orders (such MWEs are called flexible). Searching for occurrences of such flexible MWEs is difficult and cannot be done reliably with most search applications. What is needed is a search engine that takes into account the grammatical configuration of the MWE. MWE-Finder is therefore embedded in GrETEL, a treebank search application for Dutch. A user can enter an example of a MWE in a specific canonical form, after which the system searches for sentences in which the MWE occurs, using queries generated automatically from the canonical form. The MWE can also be selected from a list of more than 11k canonical forms for Dutch MWEs that MWE-Finder offers. We will show that MWE-Finder also offers facilities to find examples with unexpected modifiers or determiners on components of the MW
    corecore