179 research outputs found

    Efficient Parsing for Korean and English: A Parameterized Message Passing Approach

    Get PDF
    This article presents an efficient, implemented approach to cross-linguistic parsing based on Government-Binding (GB) Theory (Chomsky, 1986) and followers. One of the drawbacks to alternative GB-based parsing approaches is that they generally adopt a filterbased paradigm. These approaches typically generate all possible candidate structures of the sentence that satisfy X theory, and then subsequently apply filters in order to eliminate those structures that violate GB principles. (See, for example, (Abney, 1989; Correa, 1991; Dorr, 1993; Fong, 1991).) The current approach provides an alternative to filter-based designs which avoids these difficulties by applying principles to description

    Development of Cross-Linguistic Syntactic and Semantic Parameters for Parsing and Generation

    Get PDF
    This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. The translation approach adopted here is interlingual i.e., a single underlying representation called Lexical Conceptual Structure (LCS) is used for both Korean and English. The primary focus of this investigation concerns the notion of `parameterization' i.e., a mechanism that accounts for both syntactic and lexical-semantic distinctions between Korean and English. We present our assumptions about the syntactic structure of Korean-type languages vs. English-type languages and describe our investigation of syntactic parameterization for distinguishing between these two types of languages. We also present the details of the LCS structure and describe how this representation is parameterized so that it accommodates both languages. We address critical issues concerning interlingual machine translation such as locative postpositions and the dividing line between the interlingua and the knowledge representation. Difficulties in translation and transliteration of Korean are discussed and complex morphological properties of Korean are presented. Finally, we describe recent work on lexical acquisition and conclude with a discussion about two hypotheses concerning semantic classification that are currently being tested. (Also cross-referenced as UMIACS-TR-94-26

    Research in the Language, Information and Computation Laboratory of the University of Pennsylvania

    Get PDF
    This report takes its name from the Computational Linguistics Feedback Forum (CLiFF), an informal discussion group for students and faculty. However the scope of the research covered in this report is broader than the title might suggest; this is the yearly report of the LINC Lab, the Language, Information and Computation Laboratory of the University of Pennsylvania. It may at first be hard to see the threads that bind together the work presented here, work by faculty, graduate students and postdocs in the Computer Science and Linguistics Departments, and the Institute for Research in Cognitive Science. It includes prototypical Natural Language fields such as: Combinatorial Categorial Grammars, Tree Adjoining Grammars, syntactic parsing and the syntax-semantics interface; but it extends to statistical methods, plan inference, instruction understanding, intonation, causal reasoning, free word order languages, geometric reasoning, medical informatics, connectionism, and language acquisition. Naturally, this introduction cannot spell out all the connections between these abstracts; we invite you to explore them on your own. In fact, with this issue it’s easier than ever to do so: this document is accessible on the “information superhighway”. Just call up http://www.cis.upenn.edu/~cliff-group/94/cliffnotes.html In addition, you can find many of the papers referenced in the CLiFF Notes on the net. Most can be obtained by following links from the authors’ abstracts in the web version of this report. The abstracts describe the researchers’ many areas of investigation, explain their shared concerns, and present some interesting work in Cognitive Science. We hope its new online format makes the CLiFF Notes a more useful and interesting guide to Computational Linguistics activity at Penn

    Empirical studies on word representations

    Get PDF
    One of the most fundamental tasks in natural language processing is representing words with mathematical objects (such as vectors). The word representations, which are most often estimated from data, allow capturing the meaning of words. They enable comparing words according to their semantic similarity, and have been shown to work extremely well when included in complex real-world applications. A large part of our work deals with ways of estimating word representations directly from large quantities of text. Our methods exploit the idea that words which occur in similar contexts have a similar meaning. How we define the context is an important focus of our thesis. The context can consist of a number of words to the left and to the right of the word in question, but, as we show, obtaining context words via syntactic links (such as the link between the verb and its subject) often works better. We furthermore investigate word representations that accurately capture multiple meanings of a single word. We show that translation of a word in context contains information that can be used to disambiguate the meaning of that word

    Empirical studies on word representations

    Get PDF

    UDapter:Typology-based Language Adapters for Multilingual Dependency Parsing and Sequence Labeling

    Get PDF
    Recent advances in multilingual language modeling have brought the idea of a truly universal parser closer to reality. However, such models are still not immune to the “curse of multilinguality”: Cross-language interference and restrained model capacity remain major obstacles. To address this, we propose a novel language adaptation approach by introducing contextual language adapters to a multilingual parser. Contextual language adapters make it possible to learn adapters via language embeddings while sharing model parameters across languages based on contextual parameter generation. Moreover, our method allows for an easy but effective integration of existing linguistic typology features into the parsing model. Because not all typological features are available for every language, we further combine typological feature prediction with parsing in a multi-task model that achieves very competitive parsing performance without the need for an external prediction system for missing features. The resulting parser, UDapter, can be used for dependency parsing as well as sequence labeling tasks such as POS tagging, morphological tagging, and NER. In dependency parsing, it outperforms strong monolingual and multilingual baselines on the majority of both high-resource and low-resource (zero-shot) languages, showing the success of the proposed adaptation approach. In sequence labeling tasks, our parser surpasses the baseline on high resource languages, and performs very competitively in a zero-shot setting. Our in-depth analyses show that adapter generation via typological features of languages is key to this success

    Chinese information access through internet on X-open system.

    Get PDF
    by Yao Jian.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references (leaves 109-112).Abstract --- p.iAcknowledgments --- p.iiiChapter 1 --- Introduction --- p.1Chapter 2 --- Basic Concepts And Related Work --- p.6Chapter 2.1 --- Codeset and Codeset Conversion --- p.7Chapter 2.2 --- HTML Language --- p.10Chapter 2.3 --- HTTP Protocol --- p.13Chapter 2.4 --- I18N And LION --- p.18Chapter 2.5 --- Proxy Server --- p.19Chapter 2.6 --- Related Work --- p.20Chapter 3 --- Design Principles And System Architecture --- p.23Chapter 3.1 --- Use of Existing Web System --- p.23Chapter 3.1.1 --- Protocol --- p.23Chapter 3.1.2 --- Avoid Duplication of Documents for Different Codesets --- p.25Chapter 3.1.3 --- Support On-line Codeset Conversion Facility --- p.27Chapter 3.1.4 --- Provide Internationalized Interface of Web Browser --- p.28Chapter 3.2 --- Our Approach --- p.29Chapter 3.2.1 --- Enhancing the Existing Browsers and Servers --- p.30Chapter 3.2.2 --- Incorporating Proxies in Our Scheme --- p.32Chapter 3.2.3 --- Automatic Codeset Conversion --- p.34Chapter 3.3 --- Overall System Architecture --- p.38Chapter 3.3.1 --- Architecture of Our Web System --- p.38Chapter 3.3.2 --- Flexibility of Our Design --- p.40Chapter 3.3.3 --- Which side do the codeset conversion? --- p.42Chapter 3.3.4 --- Caching --- p.42Chapter 4 --- Design Details of An Enhanced Server --- p.44Chapter 4.1 --- Architecture of The Enhanced Server --- p.44Chapter 4.2 --- Procedure on Processing Client's Request --- p.45Chapter 4.3 --- Modifications of The Enhanced Server --- p.48Chapter 4.3.1 --- Interpretation of Client's Codeset Announcement --- p.48Chapter 4.3.2 --- Codeset Identification of Web Documents on the Server --- p.49Chapter 4.3.3 --- Codeset Notification to the Web Client --- p.52Chapter 4.3.4 --- Codeset Conversion --- p.54Chapter 4.4 --- Experiment Results --- p.54Chapter 5 --- Design Details of An Enhanced Browser --- p.58Chapter 5.1 --- Architecture of The Enhanced Browser --- p.58Chapter 5.2 --- Procedure on Processing Users' Requests --- p.61Chapter 5.3 --- Event Management and Handling --- p.63Chapter 5.3.1 --- Basic Control Flow of the Browser --- p.63Chapter 5.3.2 --- Event Handlers --- p.64Chapter 5.4 --- Internationalization of Browser Interface --- p.75Chapter 5.4.1 --- Locale --- p.76Chapter 5.4.2 --- Resource File --- p.77Chapter 5.4.3 --- Message Catalog System --- p.79Chapter 5.5 --- Experiment Result --- p.85Chapter 6 --- Another Scheme - CGI --- p.89Chapter 6.1 --- Form and CGI --- p.90Chapter 6.2 --- CGI Control Flow --- p.96Chapter 6.3 --- Automatic Codeset Detection --- p.96Chapter 6.3.1 --- Analysis of code range for GB and Big5 --- p.98Chapter 6.3.2 --- Control Flow of Automatic Codeset Detection --- p.99Chapter 6.4 --- Experiment Results --- p.101Chapter 7 --- Conclusions and Future Work --- p.104Chapter 7.1 --- Current Status --- p.105Chapter 7.2 --- System Efficiency --- p.106Chapter 7.3 --- Future Work --- p.107Bibliography --- p.109Chapter A --- Programmer's Guide --- p.113Chapter A.1 --- Data Structure --- p.113Chapter A.2 --- Calling Sequence of Functions --- p.114Chapter A.3 --- Modification of Souce Code --- p.116Chapter A.4 --- Modification of Resources --- p.133Chapter B --- User Manual --- p.13

    Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

    Get PDF
    Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language
    • …
    corecore