research

Creating open language resources for Hungarian

Abstract

The paper provides an overview of the open source Hungarian language resources that the SzóSzablya ‘WordSword ’ project is creating. An extensive crawl of the.hu domain yielded a raw dataset of over 18m web pages. We discuss the methods used to detect and remove duplicates, low quality, foreign, and mixed language documents, and describe the resulting gigaword corpus and various frequency counts and dictionaries based on it. 1

    Similar works