190 research outputs found

    An open, extendible, and fast Turkish morphological analyzer

    Get PDF
    In this paper, we present a two-level morphological analyzer for Turkish which consists of five main components: finite state transducer, rule engine for suffixation, lexicon, trie data structure, and LRU cache. We use Java language to implement finite state machine logic and rule engine, Xml language to describe the finite state transducer rules of the Turkish language, which makes the morphological analyzer both easily extendible and easily applicable to other languages. Empowered with a comprehensive lexicon of 54,000 bare-forms including 19,000 proper nouns, our morphological analyzer is amongst the most reliable analyzers produced so far. The analyzer is compared with Turkish morphological analyzers in the literature. By using LRU cache and a trie data structure, the system can analyze 100,000 words per second, which enables users to analyze huge corpora in a few hours.Publisher's Versio

    Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing

    Get PDF
    Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8%/82.8% LAS and 84.0%/87.9% UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.Peer reviewe

    IMST: A Revisited Turkish Dependency Treebank

    Get PDF
    In this paper, we present a critical analysis of the dependency annotation framework used in the METU-Sabancı Treebank (MST), and propose new annotation schemes that would alleviate the issues we have identified. Later, we describe our attempt at reannotating the treebank from the ground up using the proposed schemes, and then compare the consistencies of the two versions via cross validation using a dependency parser. According to our experiments, the reannotated version of the original treebank, which we call the ITU-METU-Sabancı Treebank (IMST), demonstrates a labeled attachment score of 75.3% and an unlabeled attachment score of 83.7%, surpassing the corresponding scores of 65.9% and 76.0% for MST by a very large margin.Peer reviewe

    Building Phrase Polarity Lexicons for Sentiment Analysis

    Get PDF
    Many approaches to sentiment analysis benefit from polarity lexicons. Most polarity lexicons include a list of polar (positive/negative) words, and sentiment analysis systems attempt to capture the occurrence of those words in text using polarity lexicons. Although there exist some polarity lexicons in many natural languages, most languages suffer from the lack of phrase polarity lexicons. Phrases play an important role in sentiment analysis because the polarity of a phrase cannot always be estimated based on the polarity of its parts. In this work, a hybrid approach is proposed for building phrase polarity lexicons which is experimented on Turkish as a low-resource language. The obtained classification accuracies in extracting and classifying phrases as positive, negative, or neutral, approve the effectiveness of the proposed methodology

    Sentiment analysis in Turkish: resources and techniques

    Get PDF
    Due to the ever-increasing amount of online information, manual processing of data is impractical. Social media such as Twitter play an important role in storing such information and helping people share their ideas. Extracting the attitude and opinion of people from user entered data is worthwhile for companies. Sentiment analysis attempts to extract the embedded polarity from a segment of text (or other data types) with many commercial and con-commercial applications. Companies are interested in opinions of their customers. On the other hand, customers are interested in opinions of other customers. Politicians and policy makers are also interested in public's feedback on political events. The above mentioned opinions can be (semi)automatically extracted from social media such as Twitter or Facebook by the help of sentiment analysis techniques. Sentiment analysis is a language (e.g. English) dependent task that relies on natural language processing techniques. The richest language in terms of resources and research in sentiment analysis is English, while many other languages such as Turkish su er from a lack of resources and techniques for sentiment analysis. In this thesis, we try to ll this gap by designing and implementing a framework for sentiment analysis in Turkish. This framework can also be adapted to other languages with some minor changes. In the scope of the framework, we have built a few Turkish polarity lexicons for the rst time in the literature. We also comprehensively investigated the problem of sentiment analysis in Turkish and suggested some solutions. Experimental evaluation shows the e ectiveness of the proposed resources and techniques for Turkish

    Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

    Get PDF
    Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io
    corecore