288 research outputs found

    Statistical dependency parsing of Turkish

    Get PDF
    This paper presents results from the first statistical dependency parser for Turkish. Turkish is a free-constituent order language with complex agglutinative inflectional and derivational morphology and presents interesting challenges for statistical parsing, as in general, dependency relations are between “portions” of words called inflectional groups. We have explored statistical models that use different representational units for parsing. We have used the Turkish Dependency Treebank to train and test our parser but have limited this initial exploration to that subset of the treebank sentences with only left-to-right non-crossing dependency links. Our results indicate that the best accuracy in terms of the dependency relations between inflectional groups is obtained when we use inflectional groups as units in parsing, and when contexts around the dependent are employed

    Chunking in Turkish with Conditional Random Fields

    Full text link
    In this paper, we report our work on chunking in Turkish. We used the data that we generated by manually translating a subset of the Penn Treebank. We exploited the already available tags in the trees to automatically identify and label chunks in their Turkish translations. We used conditional random fields (CRF) to train a model over the annotated data. We report our results on different levels of chunk resolution.Publisher's Versio

    Dependency parsing of Turkish

    Get PDF
    The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, poses interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical representations called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We compare two different parsing methods, one based on a probabilistic model with beam search, the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of parsing method.We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank

    Indonesian Named-entity Recognition for 15 Classes Using Ensemble Supervised Learning

    Get PDF
    AbstractHere, we describe our effort in building Indonesian Named Entity Recognition (NER) for newspaper article with 15 classes which is larger number of class type compared to existing Indonesian NER. We employed supervised machine learning in the NER and conducted experiments to find the best attribute combination and the best algorithm with highest accuracy. We compared the attribute of word level, sentence level and document level. In the algorithm, we compared several single machine learning algorithms and also an ensembled one. Using 457 news articles, the best accuracy was achieved by using ensemble technique where the result of several machine learning algorithms were used as the feature for one machine learning algorithm

    Explaining Russian-German code-mixing

    Get PDF
    The study of grammatical variation in language mixing has been at the core of research into bilingual language practices. Although various motivations have been proposed in the literature to account for possible mixing patterns, some of them are either controversial, or remain untested. Little is still known about whether and how frequency of use of linguistic elements can contribute to the patterning of bilingual talk. This book is the first to systematically explore the factor usage frequency in a corpus of bilingual speech. The two aims are (i) to describe and analyze the variation in mixing patterns in the speech of Russia German adolescents and young adults in Germany, and (ii) to propose and test usage-based explanations of variation in mixing patterns in three morphosyntactic contexts: the adjective-modified noun phrase, the prepositional phrase, and the plural marking of German noun insertions in bilingual sentences. In these contexts, German noun insertions combine with either Russian or German words and grammatical markers, thus yielding mixed bilingual and German monolingual constituents in otherwise Russian sentences, the latter also labelled as embedded-language islands. The results suggest that the frequency with which words are used together mediates the distribution of mixing patterns in each of the examined contexts. The differing impacts of co-occurrence frequency are attributed to the distributional and semantic specifics of the analyzed morphosyntactic configurations. Lexical frequency has been found to be another important determinant in this variation. Other factors include recency, or lexical priming, in discourse in the case of prepositional phrases, and phonological and structural similarities and differences in the inflectional systems of the contact languages in the case of plural marking

    Theorizing L2 metalinguistic knowledge

    Get PDF
    corecore