170 research outputs found

    A free/open-source hybrid morphological disambiguation tool for Kazakh

    Get PDF
    This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text

    A free/open-source hybrid morphological disambiguation tool for Kazakh

    Get PDF
    This paper presents the results of developing a morphological disambiguation tool for Kazakh. Starting with a previously developed rule-based approach, we tried to cope with the complex morphology of Kazakh by breaking up lexical forms across their derivational boundaries into inflectional groups and modeling their behavior with statistical methods. A hybrid rule-based/statistical approach appears to benefit morphological disambiguation demonstrating a per-token accuracy of 91% in running text

    Morphological annotation of Korean with Directly Maintainable Resources

    Get PDF
    This article describes an exclusively resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. Our annotator is designed to process text before the operation of a syntactic parser. In its present state, it annotates one-stem words only. The output is a graph of morphemes annotated with accurate linguistic information. The granularity of the tagset is 3 to 5 times higher than usual tagsets. A comparison with a reference annotated corpus showed that it achieves 89% recall without any corpus training. The language resources used by the system are lexicons of stems, transducers of suffixes and transducers of generation of allomorphs. All can be easily updated, which allows users to control the evolution of the performances of the system. It has been claimed that morphological annotation of Korean text could only be performed by a morphological analysis module accessing a lexicon of morphemes. We show that it can also be performed directly with a lexicon of words and without applying morphological rules at annotation time, which speeds up annotation to 1,210 word/s. The lexicon of words is obtained from the maintainable language resources through a fully automated compilation process

    СУЧАСНІ МЕТОДИ ВИРІШЕННЯ ПРОБЛЕМИ ГРАМАТИЧНОЇ ОМОНІМІЇ В ТЕКСТІ. (Modern solving methods problems of grammatical homonymy in the text.)

    Get PDF
    У статті розглянуто явище граматичної омонімії, а саме її різновид – омонімію морфологічну, з позицій текстоцентричного підходу. Проаналізовано основні підходи, досвід та перспективи вирішення цієї проблеми у процесі автоматичного морфологічного аналізу тексту, зокрема для української та інших мов зі складною морфологією. (The article studies the phenomenon of grammatical homonymy, namely the morphological homonymy, from the text-centered perspective. The main approaches, experience, and prospects for solving the issue of grammatical ambiguity in the process of automatic morphological analysis are considered, notably in terms of Ukrainian and other morphologically complex languages.

    Identificación de cláusulas y chunks para el Euskera, usando Filtrado y Ranking con el Perceptron

    Get PDF
    Este artículo presenta sistemas de identificación de chunks y cláusulas para el euskera, combinando gramáticas basadas en reglas con técnicas de aprendizaje automático. Más concretamente, se utiliza el modelo de Filtrado y Ranking con el Perceptron (Carreras, Màrquez y Castro, 2005): un modelo de aprendizaje que permite identificar estructuras sintácticas parciales en la oración, con resultados óptimos para estas tareas en inglés. Este modelo permite incorporar nuevos atributos, y posibilita así el uso de información de diferentes fuentes. De esta manera, hemos añadido información lingüística en los algoritmos de aprendizaje. Así, los resultados del identificador de chunks han mejorado considerablemente y se ha compensado la influencia del relativamente pequeño corpus de entrenamiento que disponemos para el euskera. En cuanto a la identificación de cláusulas, los primeros resultados no son demasiado buenos, debido probablemente al orden libre del euskera y al pequeño corpus del que disponemos actualmente.This paper presents systems for syntactic chunking and clause identification for Basque, combining rule-based grammars with machine-learning techniques. Precisely, we used Filtering-Ranking with Perceptrons (Carreras, Màrquez and Castro, 2005): a learning model that recognizes partial syntactic structures in sentences, obtaining state-of-the-art performance for these tasks in English. This model allows incorporating a rich set of features to represent syntactic phrases, making possible to use information from different sources. We used this property in order to include more linguistic features in the learning model and the results obtained in chunking have been improved greatly. This way, we have made up for the relatively small training data available for Basque to learn a chunking model. In the case of clause identification, our preliminary results are low, which suggest that this is due to the free order of Basque and to the small corpus available.Research partly funded by the Basque Government (Department of Education, University and Research, IT-397-07), the Spanish Ministry of Education and Science (TIN2007-63173) and the ETORTEK-ANHITZ project from the Basque Government (Department of Culture and Industry, IE06- 185)

    Statistical modeling of agglutinative languages

    Get PDF
    Ankara : Department of Computer Engineering and the Institute of Engineering and Science of Bilkent Univ., 2000.Thesis (Ph.D.) -- Bilkent University, 2000.Includes bibliographical references leaves 107-116Hakkani-Tür, Dilek ZPh.D

    Statistical morphological disambiguation with application to disambiguation of pronunciations in Turkish /

    Get PDF
    The statistical morphological disambiguation of agglutinative languages suffers from data sparseness. In this study, we introduce the notion of distinguishing tag sets (DTS) to overcome the problem. The morphological analyses of words are modeled with DTS and the root major part-of-speech tags. The disambiguator based on the introduced representations performs the statistical morphological disambiguation of Turkish with a recall of as high as 95.69 percent. In text-to-speech systems and in developing transcriptions for acoustic speech data, the problem occurs in disambiguating the pronunciation of a token in context, so that the correct pronunciation can be produced or the transcription uses the correct set of phonemes. We apply the morphological disambiguator to this problem of pronunciation disambiguation and achieve 99.54 percent recall with 97.95 percent precision. Most text-to-speech systems perform phrase level accentuation based on content word/function word distinction. This approach seems easy and adequate for some right headed languages such as English but is not suitable for languages such as Turkish. We then use a a heuristic approach to mark up the phrase boundaries based on dependency parsing on a basis of phrase level accentuation for Turkish TTS synthesizers

    Development and Design of Deep Learning-based Parts-of-Speech Tagging System for Azerbaijani language

    Get PDF
    Parts-of-Speech (POS) tagging, also referred to as word-class disambiguation, is one of the prerequisite techniques that are used as part of the advanced pre-processing stage across pipeline at the majority of natural language processing (NLP) applications. By using this tool as a preliminary step, most NLP software, such as Chat Bots, Translating Engines, Voice Recognitions, etc., assigns a prior part of speech to each word in the given data in order to identify or distinguish the grammatical category, so they can easily decipher the meaning of the word. This thesis addresses the novel approach to the issue related to the clarification of word context for the Azerbaijani language by using a deep learning-based automatic speech tagger on a clean (manually annotated) dataset. Azerbaijani is a member of the Turkish family and an agglutinative language. In contrast to other languages, recent research studies of speech taggers for the Azerbaijani language were unable to deliver efficient state of the art accuracy. Thus, in this thesis, study is being conducted to investigate how deep learning strategies such as simple recurrent neural networks (RNN), long short-term memory (LSTM), bi-directional long short-term memory (Bi-LSTM), and gated recurrent unit (GRU) might be used to enhance the POS tagging capabilities of the Azerbaijani language
    corecore