157 research outputs found

    Toward a Principle-Based Translator

    Get PDF
    A principle-based computational model of natural language translation consists of two components: (1) a module which makes use of a set of principles and parameters to transform the source language into an annotated surface form that can be easily converted into a "base" syntactic structure; and (2) a module which makes use of the same set of principles, but a different set of parameter values, to transform the "base" syntactic structure into the target language surface structure. This proposed scheme of language translation is an improvement over existing schemes since it is based on interactions between principles and parameters rather than on complex interactions between language-specific rules as found in older schemes. The background for research of the problem includes: an examination of existing schemes of computerized language translation and an analysis of their shortcomings. Construction of the proposed scheme requires a preliminary investigation of the common "universal" principles and parametric variations across different languages within the framework of current linguistic theory. The work to be done includes: construction of a module which uses linguistic principles and source language parameter values to parse and output the corresponding annotated surface structures of source language sentences; creation of procedures which handle the transformation of an annotated surface structure into a "base" syntactic structure; and development of a special purpose generation scheme which converts a "base" syntactic structure into a surface form in the target language.MIT Artificial Intelligence Laborator

    LEXICALL: Lexicon Construction for Foreign Language Tutoring

    Get PDF
    We focus on the problem of building large repositories of lexical conceptual structure (LCS) representations for verbs in multiple languages. One of the main results of this work is the definition of a relation between broad semantic classes and LCS meaning components. Our acquisition program---LEXICALL---takes, as input, the result of previous work on verb classification and thematic grid tagging, and outputs LCS representations for different languages. These representations have been ported into English, Arabic and Spanish lexicons, each containing approximately 9000 verbs. We are currently using these lexicons in an operational foreign language tutoring and machine translation. (Also cross-referenced as UMIACS-TR-97-09

    Development of Cross-Linguistic Syntactic and Semantic Parameters for Parsing and Generation

    Get PDF
    This document reports on research conducted at the University of Maryland for the Korean/English Machine Translation (MT) project. The translation approach adopted here is interlingual i.e., a single underlying representation called Lexical Conceptual Structure (LCS) is used for both Korean and English. The primary focus of this investigation concerns the notion of `parameterization' i.e., a mechanism that accounts for both syntactic and lexical-semantic distinctions between Korean and English. We present our assumptions about the syntactic structure of Korean-type languages vs. English-type languages and describe our investigation of syntactic parameterization for distinguishing between these two types of languages. We also present the details of the LCS structure and describe how this representation is parameterized so that it accommodates both languages. We address critical issues concerning interlingual machine translation such as locative postpositions and the dividing line between the interlingua and the knowledge representation. Difficulties in translation and transliteration of Korean are discussed and complex morphological properties of Korean are presented. Finally, we describe recent work on lexical acquisition and conclude with a discussion about two hypotheses concerning semantic classification that are currently being tested. (Also cross-referenced as UMIACS-TR-94-26

    Knowledge Graphs Effectiveness in Neural Machine Translation Improvement

    Get PDF
    Neural Machine Translation (NMT) systems require a massive amount of Maintaining semantic relations between words during the translation process yields more accurate target-language output from Neural Machine Translation (NMT). Although difficult to achieve from training data alone, it is possible to leverage Knowledge Graphs (KGs) to retain source-language semantic relations in the corresponding target-language translation. The core idea is to use KG entity relations as embedding constraints to improve the mapping from source to target. This paper describes two embedding constraints, both of which employ Entity Linking (EL)---assigning a unique identity to entities---to associate words in training sentences with those in the KG: (1) a monolingual embedding constraint that supports an enhanced semantic representation of the source words through access to relations between entities in a KG; and (2) a bilingual embedding constraint that forces entity relations in the source-language to be carried over to the corresponding entities in the target-language translation. The method is evaluated for English-Spanish translation exploiting Freebase as a source of knowledge. Our experimental results show that exploiting KG information not only decreases the number of unknown words in the translation but also improves translation quality

    LonXplain: Lonesomeness as a Consequence of Mental Disturbance in Reddit Posts

    Full text link
    Social media is a potential source of information that infers latent mental states through Natural Language Processing (NLP). While narrating real-life experiences, social media users convey their feeling of loneliness or isolated lifestyle, impacting their mental well-being. Existing literature on psychological theories points to loneliness as the major consequence of interpersonal risk factors, propounding the need to investigate loneliness as a major aspect of mental disturbance. We formulate lonesomeness detection in social media posts as an explainable binary classification problem, discovering the users at-risk, suggesting the need of resilience for early control. To the best of our knowledge, there is no existing explainable dataset, i.e., one with human-readable, annotated text spans, to facilitate further research and development in loneliness detection causing mental disturbance. In this work, three experts: a senior clinical psychologist, a rehabilitation counselor, and a social NLP researcher define annotation schemes and perplexity guidelines to mark the presence or absence of lonesomeness, along with the marking of text-spans in original posts as explanation, in 3,521 Reddit posts. We expect the public release of our dataset, LonXplain, and traditional classifiers as baselines via GitHub

    Automatic Extraction of Semantic Classes from Syntactic Information in Online Resources

    Get PDF
    This paper addresses the issue of word-sense ambiguity in extraction from machine-readable resources for the construction of large-scale knowledge sources. We describe two experiments: one which took word-sense distinctions into account, resulting in 97.9% accuracy for semantic classification of verbs based on (Levin, 1993); and one which ignored word-sense distinctions, resulting in 6.3% accuracy. These experiments were dual purpose: (1) to validate the central thesis of the work of (Levin, 1993), i.e., that verb semantics and syntactic behavior are predictably related; (2) to demonstrate that a 20-fold improvement can be achieved in deriving semantic information from syntactic cues if we first divide the syntactic cues into distinct groupings that correlate with different word senses. Finally, we show that we can provide effective acquisition techniques for novel word senses using a combination of online sources. (Also cross-referenced as UMIACS-TR-95-65

    Bilingual Lexicon Construction Using Large Corpora

    Get PDF
    This paper introduces a method for learning bilingual term and sentence level alignments for the purpose of building lexicons. Combining statistical techniques with linguistic knowledge, a general algorithm is developed for learning term and sentence alignments from large bilingual corpora with high accuracy. This is achieved through the use of filtered linguistic feedback between term and sentence alignment processes. An implementation of this algorithm, TAG-ALIGN, is evaluated against approaches similar to [Brown et al. 1993] that apply Bayesian techniques for term alignment, and [Gale and Church 1991] a dynamic programming method for aligning sentences. The ultimate goal is to produce large bilingual lexicons with a high degree of accuracy from potentially noisy corpora. (Also cross-referenced as UMIACS-TR-97-50

    On automatic filtering of multilingual texts

    Get PDF
    An emerging requirement to sift through the increasing ood of text information has led to the rapid development of information ltering technology in the past ve years. This study introduces novel approaches for ltering texts regardless of their source language. We begin with a brief description of related developments in text ltering and multilingual information retrieval. We then present three alternative approaches to selecting texts from a multilingual information stream which represent a logical evolution from existing techniques in related disciplines. Finally, a practical automated performance evaluation technique is proposed.

    A Survey of Multilingual Text Retrieval

    Get PDF
    This report reviews the present state of the art in selection of texts in one language based on queries in another, a problem we refer to as ``multilingual'' text retrieval. Present applications of multilingual text retrieval systems are limited by the cost and complexity of developing and using the multilingual thesauri on which they are based and by the level of user training that is required to achieve satisfactory search effectiveness. A general model for multilingual text retrieval is used to review the development of the field and to describe modern production and experimental systems. The report concludes with some observations on the present state of the art and an extensive bibliography of the technical literature on multilingual text retrieval. The research reported herein was supported, in part, by Army Research Office contract DAAL03-91-C-0034 through Battelle Corporation, NSF NYI IRI-9357731, Alfred P. Sloan Research Fellow Award BR3336, and a General Research Board Semester Award. (Also cross-referenced as UMIACS-TR-96-19
    • …
    corecore