A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish
  Book Center Corpus

Kulick, Seth; Ryant, Neville; Santorini, Beatrice; Wallenberg, Joel

A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

Authors: Seth Kulick
Neville Ryant
Beatrice Santorini
Joel Wallenberg
Publication date: 3 April 2022
Publisher

Abstract

We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2204.01175

Last time updated on 26/04/2022