3 research outputs found
Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets
Sparsity is one of the major problems in natural language processing. The
problem becomes even more severe in agglutinating languages that are highly
prone to be inflected. We deal with sparsity in Turkish by adopting
morphological features for part-of-speech tagging. We learn inflectional and
derivational morpheme tags in Turkish by using conditional random fields (CRF)
and we employ the morpheme tags in part-of-speech (PoS) tagging by using hidden
Markov models (HMMs) to mitigate sparsity. Results show that using morpheme
tags in PoS tagging helps alleviate the sparsity in emission probabilities. Our
model outperforms other hidden Markov model based PoS tagging models for small
training datasets in Turkish. We obtain an accuracy of 94.1% in morpheme
tagging and 89.2% in PoS tagging on a 5K training dataset.Comment: 13 pages, accepted and presented in 17th International Conference on
Intelligent Text Processing and Computational Linguistics (CICLING
Cross-Lingual Word Embeddings for Turkic Languages
There has been an increasing interest in learning cross-lingual word
embeddings to transfer knowledge obtained from a resource-rich language, such
as English, to lower-resource languages for which annotated data is scarce,
such as Turkish, Russian, and many others. In this paper, we present the first
viability study of established techniques to align monolingual embedding spaces
for Turkish, Uzbek, Azeri, Kazakh and Kyrgyz, members of the Turkic family
which is heavily affected by the low-resource constraint. Those techniques are
known to require little explicit supervision, mainly in the form of bilingual
dictionaries, hence being easily adaptable to different domains, including
low-resource ones. We obtain new bilingual dictionaries and new word embeddings
for these languages and show the steps for obtaining cross-lingual word
embeddings using state-of-the-art techniques. Then, we evaluate the results
using the bilingual dictionary induction task. Our experiments confirm that the
obtained bilingual dictionaries outperform previously-available ones, and that
word embeddings from a low-resource language can benefit from resource-rich
closely-related languages when they are aligned together. Furthermore,
evaluation on an extrinsic task (Sentiment analysis on Uzbek) proves that
monolingual word embeddings can, although slightly, benefit from cross-lingual
alignments.Comment: Final version, published in the proceedings of LREC 202
A suffix based part-of-speech tagger for Turkish
5th International Conference on Information Technology - New Generations -- APR 07-09, 2008 -- Las Vegas, NVWOS: 000255578800115In this paper, we present a stochastic part-of-speech tagger for Turkish. The tagger is primarily developed for information retrieval purposes, but it can as well serve as a light-weight PoS tagger for other purposes. The tagger uses a well-established Hidden Markov model of the language with a closed lexicon that consists of fixed number of letters from the word endings. We have considered seven different lengths of word endings against 30 training corpus sizes. Best-case accuracy obtained is 90.2% with 5 characters. The main contribution of this paper is to present a way of constructing a closed vocabulary for part-of-speech tagging effort that can be useful for highly inflected languages like Turkish, Finnish, Hungarian, Estonian, and Czech.IEEE Comp So