Building linguistic corpora from Wikipedia articles and discussions

Lüngen, Harald; Margaretha, Eliza

research

Building linguistic corpora from Wikipedia articles and discussions

Authors: Harald Lüngen
Eliza Margaretha
Publication date: 16 December 2014
Publisher

Abstract

Wikipedia is a valuable resource, useful as a lingustic corpus or a dataset for many kinds of research. We built corpora from Wikipedia articles and talk pages in the I5 format, a TEI customisation used in the German Reference Corpus (Deutsches Referenzkorpus - DeReKo). Our approach is a two-stage conversion combining parsing using the Sweble parser, and transformation using XSLT stylesheets. The conversion approach is able to successfully generate rich and valid corpora regardless of languages. We also introduce a method to segment user contributions in talk pages into postings

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Publikationsserver des Instituts für Deutsche Sprache

oai:ids-pub.bsz-bw.de:3330

Last time updated on 18/06/2017