91 research outputs found
Language. An introduction to the study of speech
Mode of access: Internet
Predictive Text Entry for Agglutinative Languages Using Unsupervised Morphological Segmentation
Host publication title: Computational Linguistics and Intelligent Text Processing Host publication sub-title: 13th International Conference, CICLing 2012Peer reviewe
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction.
Language-independent tokenisation (LIT) methods that do not require labelled
language resources or lexicons have recently gained popularity because of their
applicability in resource-poor languages. Moreover, they compactly represent a
language using a fixed size vocabulary and can efficiently handle unseen or
rare words. On the other hand, language-specific tokenisation (LST) methods
have a long and established history, and are developed using carefully created
lexicons and training resources. Unlike subtokens produced by LIT methods, LST
methods produce valid morphological subwords. Despite the contrasting
trade-offs between LIT vs. LST methods, their performance on downstream NLP
tasks remain unclear. In this paper, we empirically compare the two approaches
using semantic similarity measurement as an evaluation task across a diverse
set of languages. Our experimental results covering eight languages show that
LST consistently outperforms LIT when the vocabulary size is large, but LIT can
produce comparable or better results than LST in many languages with
comparatively smaller (i.e. less than 100K words) vocabulary sizes, encouraging
the use of LIT when language-specific resources are unavailable, incomplete or
a smaller model is required. Moreover, we find that smoothed inverse frequency
(SIF) to be an accurate method to create word embeddings from subword
embeddings for multilingual semantic similarity prediction tasks. Further
analysis of the nearest neighbours of tokens show that semantically and
syntactically related tokens are closely embedded in subword embedding spacesComment: To appear in the 12th Language Resources and Evaluation (LREC 2020)
Conferenc
Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /
The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers
Named Entity Recognition for Manipuri Using Support Vector Machine
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Modeling Target-Side Inflection in Neural Machine Translation
NMT systems have problems with large vocabulary sizes. Byte-pair encoding
(BPE) is a popular approach to solving this problem, but while BPE allows the
system to generate any target-side word, it does not enable effective
generalization over the rich vocabulary in morphologically rich languages with
strong inflectional phenomena. We introduce a simple approach to overcome this
problem by training a system to produce the lemma of a word and its
morphologically rich POS tag, which is then followed by a deterministic
generation step. We apply this strategy for English-Czech and English-German
translation scenarios, obtaining improvements in both settings. We furthermore
show that the improvement is not due to only adding explicit morphological
information.Comment: Accepted as a research paper at WMT17. (Updated version with
corrected references.
Attaching Translations to Proper Lexical Senses in DBnary
International audienceThe DBnary project aims at providing high quality Lexical Linked Data extracted from different Wiktionary language editions. Data from 10 different languages is currently extracted for a total of over 3.16M translation links that connect lexical entries from the 10 extracted languages, to entries in more than one thousand languages. In Wiktionary, glosses are often associated with translations to help users understand to what sense they refer to, whether through a textual definition or a target sense number. In this article we aim at the extraction of as much of this information as possible and then the disambiguation of the corresponding translations for all languages available. We use an adaptation of various textual and semantic similarity techniques based on partial or fuzzy gloss overlaps to disambiguate the translation relations (To account for the lack of normalization, e.g. lemmatization and PoS tagging) and then extract some of the sense number information present to build a gold standard so as to evaluate our disambiguation as well as tune and optimize the parameters of the similarity measures. We obtain F-measures of the order of 80\% (on par with similar work on English only), across the three languages where we could generate a gold standard (French, Portuguese, Finnish) and show that most of the disambiguation errors are due to inconsistencies in Wiktionary itself that cannot be detected at the generation of DBnary (shifted sense numbers, inconsistent glosses, etc.)
Automated Implementation Process of Machine Translation System for Related Languages
The paper presents an attempt to automate all data creation processes of a rule-based shallow-transfer machine translation system. The presented methods were tested on four fully functional translation systems covering language pairs: Slovenian paired with Serbian, Czech, English and Estonian language. An extensive range of evaluation tests was performed to assess the applicability of the methods
- …