7 research outputs found

    A tagger environment for Galician

    Get PDF
    [Abstract] In this paper, we introduce a tagger environment for Galician, the native language of Galicia. Galician belongs to the group of Romance languages which developed from the Latin imposed on the north-west of the Iberian Peninsula by the Romans, with additions from the languages of peoples living here before the colonization, as well as contributions from other languages subsequent to the breaking-up of the Roman Empire. Various historical circumstances led to its not becoming a State language and although it was relegated to informal usage, our vernacular has managed to survive well into the twentieth century when, parallel to the recovery of the institutions for self-government, Galician was once again granted the status of official language for Galicia, together with the Spanish language. From an operational point of view, our proposal is based on the notion of finite automaton, separating the execution strategy from the implementation of the tagging interpreter. That facilitates the maintenance at the time that assures the robustness of the architecture. Empirical tests prove the validity of our approach to deal with a language whose morphology is non-trivial.España. Gobierno; HF97-223Xunta de Galcia; XUGA10505B96Xunta de Galcia; XUGA20402B97

    Use of Weighted Finite State Transducers in Part of Speech Tagging

    Full text link
    This paper addresses issues in part of speech disambiguation using finite-state transducers and presents two main contributions to the field. One of them is the use of finite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on transitions in weighted finite-state transducers. Another contribution is the successful combination of techniques -- linguistic and statistical -- for word disambiguation, compounded with the notion of word classes.Comment: uses psfig, ipamac

    Magyar ismeretlen szó-elemző program fejlesztése

    Get PDF
    Nagy korpuszok szám ítógépes feldolgozása során elkerülhetetlenül beleütközünk abba a problém ába, hogy a szövegekben szereplő szóalakok igen jelentős részét nem tu dja a rendelkezésre álló morfológiai elemzőprogram elemezni, mert hiányzik az adatbázisából a szó töve. Ugyanakkor ezeknek az elem ezhetetlen szóalakoknak a nagy része tartalmaz toldalékokat, ezért valamilyen módon ezeket is elemezni kell. Ennek a problém ának a kezelésére olyan hibrid eljárást lehet alkalmazni, amely szimbolikus parciális morfológiai elemzőből és egy olyan statisztikai alapú eszközből áll, amely az első lépésben a szimbolikus ismeretlenszó-elemző által előállított hipotézisteret a kívánt mértékűre szűkíti

    A morphological-syntactical analysis approach for Arabic textual tagging

    Get PDF
    Part-of-Speech (POS) tagging is the process of labeling or classifying each word in written text with its grammatical category or part-of-speech, i.e. noun, verb, preposition, adjective, etc. It is the most common disambiguation process in the field of Natural Language Processing (NLP). POS tagging systems are often preprocessors in many NLP applications. The Arabic language has a valuable and an important feature, called diacritics, which are marks placed over and below the letters of the word. An Arabic text is partiallyvocalisedl when the diacritical mark is assigned to one or maximum two letters in the word. Diacritics in Arabic texts are extremely important especially at the end of the word. They help determining not only the correct POS tag for each word in the sentence, but also in providing full information regarding the inflectional features, such as tense, number, gender, etc. for the sentence words. They add semantic information to words which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics ascribe grammatical functions to the words, differentiating the word from other words, and determining the syntactic position of the word in the sentence. 1. Vocalisation (also referred as diacritisation or vowelisation). This thesis presents a rule-based Part-of-Speech tagging system called AMT - short for Arabic Morphosyntactic Tagger. The main function of the AMT system is to assign the correct tag to each word in an untagged raw partially-vocalised Arabic corpus, and to produce a POS tagged corpus without using a manually tagged or untagged lexicon (dictionary) for training. Two different techniques were used in this work, the pattem-based technique and the lexical and contextual technique. The rules in the pattem-based technique technique are based on the pattern of the testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been designed and introduced in this work. The aim of this algorithm is to match the testing word with its correct pattern in pattern lexicon. The lexical and contextual technique on the other hand is used to assist the pattembased technique technique to assign the correct tag to those words not have a pattern to follow. The rules in the lexical and contextual technique are based on the character(s), the last diacritical mark, the word itself, and the tags of the surrounding words. The importance of utilizing the diacritic feature of the Arabic language to reduce the lexical ambiguity in POS tagging has been addressed. In addition, a new Arabic tag set and a new partially-vocalised Arabic corpus to test AMT have been compiled and presented in this work. The AMT system has achieved an average accuracy of 91 %

    I. Magyar Számítógépes Nyelvészeti Konferencia

    Get PDF

    Argumentative zoning information extraction from scientific text

    Get PDF
    Let me tell you, writing a thesis is not always a barrel of laughs—and strange things can happen, too. For example, at the height of my thesis paranoia, I had a re-current dream in which my cat Amy gave me detailed advice on how to restructure the thesis chapters, which was awfully nice of her. But I also had a lot of human help throughout this time, whether things were going fine or beserk. Most of all, I want to thank Marc Moens: I could not have had a better or more knowledgable supervisor. He always took time for me, however busy he might have been, reading chapters thoroughly in two days. He both had the calmness of mind to give me lots of freedom in research, and the right judgement to guide me away, tactfully but determinedly, from the occasional catastrophe or other waiting along the way. He was great fun to work with and also became a good friend. My work has profitted from the interdisciplinary, interactive and enlightened atmosphere at the Human Communication Centre and the Centre for Cognitive Science (which is now called something else). The Language Technology Group was a great place to work in, as my research was grounded in practical applications develope
    corecore