170 research outputs found
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
Morphological annotation of Korean with Directly Maintainable Resources
This article describes an exclusively resource-based method of morphological
annotation of written Korean text. Korean is an agglutinative language. Our
annotator is designed to process text before the operation of a syntactic
parser. In its present state, it annotates one-stem words only. The output is a
graph of morphemes annotated with accurate linguistic information. The
granularity of the tagset is 3 to 5 times higher than usual tagsets. A
comparison with a reference annotated corpus showed that it achieves 89% recall
without any corpus training. The language resources used by the system are
lexicons of stems, transducers of suffixes and transducers of generation of
allomorphs. All can be easily updated, which allows users to control the
evolution of the performances of the system. It has been claimed that
morphological annotation of Korean text could only be performed by a
morphological analysis module accessing a lexicon of morphemes. We show that it
can also be performed directly with a lexicon of words and without applying
morphological rules at annotation time, which speeds up annotation to 1,210
word/s. The lexicon of words is obtained from the maintainable language
resources through a fully automated compilation process
СУЧАСНІ МЕТОДИ ВИРІШЕННЯ ПРОБЛЕМИ ГРАМАТИЧНОЇ ОМОНІМІЇ В ТЕКСТІ. (Modern solving methods problems of grammatical homonymy in the text.)
У статті розглянуто явище граматичної омонімії, а саме її різновид – омонімію морфологічну, з позицій текстоцентричного підходу. Проаналізовано основні підходи, досвід та перспективи вирішення цієї проблеми у процесі автоматичного морфологічного аналізу тексту, зокрема для української та інших мов зі складною морфологією.
(The article studies the phenomenon of grammatical homonymy, namely the morphological homonymy, from the
text-centered perspective. The main approaches, experience, and prospects for solving the issue of grammatical ambiguity
in the process of automatic morphological analysis are considered, notably in terms of Ukrainian and other
morphologically complex languages.
Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron
Este artículo presenta sistemas de identificación de chunks y cláusulas para el
euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más
concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez
y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas
parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite
incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta
manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los
resultados del identificador de chunks han mejorado considerablemente y se ha compensado la
influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera.
En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos,
debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos
actualmente.This paper presents systems for syntactic chunking and clause identification for
Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used
Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that
recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for
these tasks in English. This model allows incorporating a rich set of features to represent
syntactic phrases, making possible to use information from different sources. We used this
property in order to include more linguistic features in the learning model and the results
obtained in chunking have been improved greatly. This way, we have made up for the relatively
small training data available for Basque to learn a chunking model. In the case of clause
identification, our preliminary results are low, which suggest that this is due to the free order of
Basque and to the small corpus available.Research partly funded by the Basque
Government (Department of Education,
University and Research, IT-397-07), the
Spanish Ministry of Education and Science
(TIN2007-63173) and the ETORTEK-ANHITZ
project from the Basque Government
(Department of Culture and Industry, IE06-
185)
Statistical modeling of agglutinative languages
Ankara : Department of Computer Engineering and the Institute of Engineering and Science of Bilkent Univ., 2000.Thesis (Ph.D.) -- Bilkent University, 2000.Includes bibliographical references leaves 107-116Hakkani-Tür, Dilek ZPh.D
Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /
The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers
Development and Design of Deep Learning-based Parts-of-Speech Tagging System for Azerbaijani language
Parts-of-Speech (POS) tagging, also referred to as word-class disambiguation, is one of the prerequisite techniques that are used as part of the advanced pre-processing stage across pipeline at the majority of natural language processing (NLP) applications. By using this tool as a preliminary step, most NLP software, such as Chat Bots, Translating Engines, Voice Recognitions, etc., assigns a prior part of speech to each word in the given data in order to identify or distinguish the grammatical category, so they can easily decipher the meaning of the word.
This thesis addresses the novel approach to the issue related to the clarification of word context for the Azerbaijani language by using a deep learning-based automatic speech tagger on a clean (manually annotated) dataset. Azerbaijani is a member of the Turkish family and an agglutinative language. In contrast to other languages, recent research studies of speech taggers for the Azerbaijani language were unable to deliver efficient state of the art accuracy. Thus, in this thesis, study is being conducted to investigate how deep learning strategies such as simple recurrent neural networks (RNN), long short-term memory (LSTM), bi-directional long short-term memory (Bi-LSTM), and gated recurrent unit (GRU) might be used to enhance the POS tagging capabilities of the Azerbaijani language
- …