570 research outputs found

    A Morphological Analyzer for Japanese Nouns, Verbs and Adjectives

    Full text link
    We present an open source morphological analyzer for Japanese nouns, verbs and adjectives. The system builds upon the morphological analyzing capabilities of MeCab to incorporate finer details of classification such as politeness, tense, mood and voice attributes. We implemented our analyzer in the form of a finite state transducer using the open source finite state compiler FOMA toolkit. The source code and tool is available at https://bitbucket.org/skylander/yc-nlplab/

    Efficient deep processing of japanese

    Get PDF
    We present a broad coverage Japanese grammar written in the HPSG formalism with MRS semantics. The grammar is created for use in real world applications, such that robustness and performance issues play an important role. It is connected to a POS tagging and word segmentation tool. This grammar is being developed in a multilingual context, requiring MRS structures that are easily comparable across languages

    Dependency parsing of Turkish

    Get PDF
    The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, poses interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical representations called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We compare two different parsing methods, one based on a probabilistic model with beam search, the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of parsing method.We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank

    ANNOTATED DISJUNCT FOR MACHINE TRANSLATION

    Get PDF
    Most information found in the Internet is available in English version. However, most people in the world are non-English speaker. Hence, it will be of great advantage to have reliable Machine Translation tool for those people. There are many approaches for developing Machine Translation (MT) systems, some of them are direct, rule-based/transfer, interlingua, and statistical approaches. This thesis focuses on developing an MT for less resourced languages i.e. languages that do not have available grammar formalism, parser, and corpus, such as some languages in South East Asia. The nonexistence of bilingual corpora motivates us to use direct or transfer approaches. Moreover, the unavailability of grammar formalism and parser in the target languages motivates us to develop a hybrid between direct and transfer approaches. This hybrid approach is referred as a hybrid transfer approach. This approach uses the Annotated Disjunct (ADJ) method. This method, based on Link Grammar (LG) formalism, can theoretically handle one-to-one, many-to-one, and many-to-many word(s) translations. This method consists of transfer rules module which maps source words in a source sentence (SS) into target words in correct position in a target sentence (TS). The developed transfer rules are demonstrated on English → Indonesian translation tasks. An experimental evaluation is conducted to measure the performance of the developed system over available English-Indonesian MT systems. The developed ADJ-based MT system translated simple, compound, and complex English sentences in present, present continuous, present perfect, past, past perfect, and future tenses with better precision than other systems, with the accuracy of 71.17% in Subjective Sentence Error Rate metric

    Autenttisiin teksteihin perustuva tietokoneavusteinen kielen oppiminen: sovelluksia italian kielelle

    Get PDF
    Computer-Assisted Language Learning (CALL) is one of the sub-disciplines within the area of Second Language Acquisition. Clozes, also called fill-in-the-blank, are largely used exercises in language learning applications. A cloze is an exercise where the learner is asked to provide a fragment that has been removed from the text. For language learning purposes, in addition to open-end clozes where one or more words are removed and the student must fill the gap, another type of cloze is commonly used, namely multiple-choice cloze. In a multiple-choice cloze, a fragment is removed from the text and the student must choose the correct answer from multiple options. Multiple-choice exercises are a common way of practicing and testing grammatical knowledge. The aim of this work is to identify relevant learning constructs for Italian to be applied to automatic exercises creation based on authentic texts in the Revita Framework. Learning constructs are units that represent language knowledge. Revita is a free to use online platform that was designed to provide language learning tools with the aim of revitalizing endangered languages including several Finno-Ugric languages such as North Saami. Later non-endangered languages were added. Italian is the first majority language to be added in a principled way. This work paves the way towards adding new languages in the future. Its purpose is threefold: it contributes to the raising of Italian from its beta status towards a full development stage; it formulates best practices for defining support for a new language and it serves as a documentation of what has been done, how and what remains to be done. Grammars and linguistic resources were consulted to compile an inventory of learning constructs for Italian. Analytic and pronominal verbs, verb government with prepositions, and noun phrase agreement were implemented by designing pattern rules that match sequences of tokens with specific parts-of-speech, surfaces and morphological tags. The rules were tested with test sentences that allowed further refining and correction of the rules. Current precision of the 47 rules for analytic and pronominal verbs on 177 test sentences results in 100%. Recall is 96.4%. Both precision and recall for the 5 noun phrase agreement rules result in 96.0% in respect to the 34 test sentences. Analytic and pronominal verb, as well as noun phrase agreement patterns, were used to generate open-end clozes. Verb government pattern rules were implemented into multiple-choice exercises where one of the four presented options is the correct preposition and the other three are prepositions that do not fit in context. The patterns were designed based on colligations, combinations of tokens (collocations) that are also explained by grammatical constraints. Verb government exercises were generated on a specifically collected corpus of 29074 words. The corpus included three types of text: biography sections from Wikipedia, Italian news articles and Italian language matriculation exams. The last text type generated the most exercises with a rate of 19 exercises every 10000 words, suggesting that the semi-authentic text met best the level of verb government exercises because of appropriate vocabulary frequency and sentence structure complexity. Four native language experts, either teachers of Italian as L2 or linguists, evaluated usability of the generated multiple-choice clozes, which resulted in 93.55%. This result suggests that minor adjustments i.e., the exclusion of target verbs that cause multiple-admissibility, are sufficient to consider verb government patterns usable until the possibility of dealing with multiple-admissible answers is addressed. The implementation of some of the most important learning constructs for Italian resulted feasible with current NLP tools, although quantitative evaluation of precision and recall of the designed rules is needed to evaluate the generation of exercises on authentic text. This work paves the way towards a full development stage of Italian in Revita and enables further pilot studies with actual learners, which will allow to measure learning outcomes in quantitative term
    corecore