27 research outputs found
Transferring PoS-tagging and lemmatization tools from spoken to written Dutch corpus development
Querying large treebanks: Benchmarking GrETEL indexing
The amount of data that is available for research grows rapidly, yet technology to efficiently interpret and excavate these data lags behind. For instance, when using large treebanks for linguistic research, the speed of a query leaves much to be desired. GrETEL Indexing, or GrInding, tackles this issue. The idea behind GrInding is to make the search space as small as possible before actually starting the treebank search, by pre-processing the treebank at hand. We recursively divide the treebank into smaller parts, called subtree-banks, which are then converted into database files. All subtree-banks are organized according to their linguistic dependency pattern, and labeled as such. Additionally, general patterns are linked to more specific ones. By doing so, we create millions of databases, and given a linguistic structure we know in which databases that structure can occur, leading up to a significant efficiency boost. We present the results of a benchmark experiment, testing the effect of the GrInding procedure on the SoNaR-500 treebank.status: publishe
From D-Coi to SoNaR: A reference corpus for Dutch
Contains fulltext :
67981.pdf (publisher's version ) (Open Access
Discovery of association rules between syntactic variables. Data mining the Syntactic Atlas of the Dutch dialects.
This research applies an association rule mining technique to purely syntactic dialect data. The paper answers the research question of how relevant associations between syntactic variables can be discovered. The method calculates the proportional overlap between geographical distributions of syntactic microvariables and incorporates rule quality factors such as accuracy, coverage and completeness to measure the interestingness of the variable associations.The exploratory review of the results discusses several highly ranked association rules and also examines an implicational chain of syntactic variables