4 research outputs found

    Using Stanford Part-of-Speech Tagger for the Morphologically-rich Filipino Language

    Get PDF

    Application of Lexical Features Towards Improvement of Filipino Readability Identification of Children's Literature

    Get PDF
    Proper identification of grade levels of children's reading materials is an important step towards effective learning. Recent studies in readability assessment for the English domain applied modern approaches in natural language processing (NLP) such as machine learning (ML) techniques to automate the process. There is also a need to extract the correct linguistic features when modeling readability formulas. In the context of the Filipino language, limited work has been done [1, 2], especially in considering the language's lexical complexity as main features. In this paper, we explore the use of lexical features towards improving the development of readability identification of children's books written in Filipino. Results show that combining lexical features (LEX) consisting of type-token ratio, lexical density, lexical variation, foreign word count with traditional features (TRAD) used by previous works such as sentence length, average syllable length, polysyllabic words, word, sentence, and phrase counts increased the performance of readability models by almost a 5% margin (from 42% to 47.2%). Further analysis and ranking of the most important features were shown to identify which features contribute the most in terms of reading complexity.Comment: 8 tables, 1 figure. Presented at the Philippine Computing Science Congress 202

    Examining voice choice in Tagalog: A corpus of web-based Tagalog

    Get PDF
    This study is a corpus-based analysis of web-based Tagalog (Austronesian) investigating how different prominence features influence voice in basic, declarative, transitive clauses. A large sample of these structures were extracted from a web-based corpus of Tagalog. The arguments were annotated for animacy, definiteness, and other factors proposed to influence voice choice. Preliminary results suggest that despite the morphosyntactic symmetry in voice alternations in the language, the Undergoer voice appears to be the preferred structure regardless of these factors in Tagalog. Moreover, there may be highly constrained contexts in which the Actor Voice is used when describing two-participant, transitive events. This work has implications for how we understand the notion of prominence more generally and how languages might have specific requirements for the mappings between different prominence hierarchies

    Using Stanford part-of-speech tagger for the morphologically-rich Filipino Language

    No full text
    This research focuses on the implementation of a Maximum Entropy-based Part-of-Speech (POS) tagger for Filipino. It uses the Stanford POS tagger - a trainable POS tagger that has been trained on English, Chinese, Arabic, and other languages and producing one of the highest results in each language. The tagger was trained for Filipino using a 406k token corpus and considering unique Filipino linguistic phenomena such as high morphology and intra-sentential code-switches. The Filipino POS tagger resulted to 96.15% tagging accuracy which currently presents the highest accuracy and with a large lead among existing POS taggers for Filipino. Copyright © 2017 Matthew Phillip Go and Nicco Noco
    corecore