Easily accessible language technologies for Slovene, Croatian and Serbian

Abstract

In this paper we present the pipeline of recently developed language technology tools for Slovene, Croatian and Serbian. They currently cover text segmentation, text normalisation, part-of-speech tagging, lemmatisation and inflectional lexicon lookup. Most rely on machine learning approaches, such as statistical machine translation and conditional random fields, capable of producing high-quality models for the phenomenon covered. Special emphasis is put on easy accessibility of these tools by offering them and the trained models for all three languages as (1) open source via public git repositories and (2) online in the form of web applications and web services

    Similar works