274 research outputs found

    Bahasa Indonesia-English Code-Mixing in Writing Business Emails: Understanding the Communicative Purpose

    Get PDF
    The study had two objectives, those were to investigate the use of code-mixing in the business emails which could be categorized as asynchronous computer-mediated communication, and to calculate the most prevalent occurrences of codemixing in the business emails. The data were gathered from the business emails of Corporate Sales Officer in The British Institute Surabaya and its corporate clients from January to June 2017. Code-mixing typology acted as the primary tool in identifying and classifying the code-mixing in the data. The analysis of data yielded a total of 209 code-mixing occurrences. From the result, it can be seen that code-mixing insertion is ranked as the most prevalent occurrences of code-mixing with a total of 115 occurrences (55,02%). It is followed by code-mixing congruent lexicalization with 78 occurrences (37,32%),and code-mixing alternation with 16 occurrences (7,66%). Furthermore, the result of the study shows that the occurrences of code-mixing categories in the business emails may have some communicative purposes, such as emphasizing the main idea of the utterances or referring to the specific lexical items which deal with the knowledge of certain fields

    Complexity of Lexical Descriptions and its Relevance to Partial Parsing

    Get PDF
    In this dissertation, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (supertags) that impose complex constraints in a local context. However, increasing the complexity of descriptions makes the number of different descriptions for each lexical item much larger and hence increases the local ambiguity for a parser. This local ambiguity can be resolved by using supertag co-occurrence statistics collected from parsed corpora. We have explored these ideas in the context of Lexicalized Tree-Adjoining Grammar (LTAG) framework wherein supertag disambiguation provides a representation that is an almost parse. We have used the disambiguated supertag sequence in conjunction with a lightweight dependency analyzer to compute noun groups, verb groups, dependency linkages and even partial parses. We have shown that a trigram-based supertagger achieves an accuracy of 92.1‰ on Wall Street Journal (WSJ) texts. Furthermore, we have shown that the lightweight dependency analysis on the output of the supertagger identifies 83‰ of the dependency links accurately. We have exploited the representation of supertags with Explanation-Based Learning to improve parsing effciency. In this approach, parsing in limited domains can be modeled as a Finite-State Transduction. We have implemented such a system for the ATIS domain which improves parsing eciency by a factor of 15. We have used the supertagger in a variety of applications to provide lexical descriptions at an appropriate granularity. In an information retrieval application, we show that the supertag based system performs at higher levels of precision compared to a system based on part-of-speech tags. In an information extraction task, supertags are used in specifying extraction patterns. For language modeling applications, we view supertags as syntactically motivated class labels in a class-based language model. The distinction between recursive and non-recursive supertags is exploited in a sentence simplification application

    CRPC-DB – A Discourse Bank for Portuguese

    Get PDF
    info:eu-repo/semantics/publishedVersio

    Towards a Universal Wordnet by Learning from Combined Evidenc

    Get PDF
    Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification

    Cross-Lingual and Low-Resource Sentiment Analysis

    Get PDF
    Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment

    A Feature-Based Lexicalized Tree Adjoining Grammar for Korean

    Get PDF
    This document describes an on-going project of developing a grammar of Korean, the Korean XTAG grammar, written in the TAG formalism and implemented for use with the XTAG system enriched with a Korean morphological analyzer. The Korean XTAG grammar described in this report is based on the TAG formalism (Joshi et al. (1975)), which has been extended to include lexicalization (Schabes et al. (1988)), and unification-based feature structures (Vijay-Shanker and Joshi (1991)). The document first describes the modifications that we have made to the XTAG system (The XTAG-Group (1998)) to handle rich inflectional morphology in Korean. Then various syntactic phenomena that can be currently handled are described, including adverb modification, relative clauses, complex noun phrases, auxiliary verb constructions, gerunds and adjunct clauses. The work reported here is a first step towards the development of an implemented TAG grammar for Korean, which is continuously updated with the addition of new analyses and modification of old ones
    • …
    corecore