Skip to main content
Article thumbnail
Location of Repository

Building a DDC-annotated Corpus from OAI

By Mathias Lösch, Ulli Waltinger, Wolfram Horstmann and Er Mehler


Abstract. Document servers complying to the standards of the Open Archives Initiative (OAI) are rich, yet seldom exploited source of textual primary data for research fields in text mining, natural language processing or computational linguistics. We present a bilingual (English and German) text corpus consisting of bibliographic OAI records and the associated full texts. A particular added value is that we annotated each record with at least one Dewey Decimal Classification (DDC) number, inducing a subject-based categorization of the corpus. By this means, it can be used as training data for machine learning-based text categorization tasks in digital libraries, but also as primary data source for linguistic research on academic language use related to specific disciplines. We describe the construction of the corpus using data from the Bielefeld Academic Search Engine (BASE), as well as its characteristics

Topics: Digital libraries, text mining, corpora, Dewey Decimal Classification
Year: 2013
OAI identifier: oai:CiteSeerX.psu:
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • (external link)
  • (external link)
  • (external link)
  • Suggested articles

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.