157 research outputs found
Toward a Principle-Based Translator
A principle-based computational model of natural language translation consists of two components: (1) a module which makes use of a set of principles and parameters to transform the source language into an annotated surface form that can be easily converted into a "base" syntactic structure; and (2) a module which makes use of the same set of principles, but a different set of parameter values, to transform the "base" syntactic structure into the target language surface structure. This proposed scheme of language translation is an improvement over existing schemes since it is based on interactions between principles and parameters rather than on complex interactions between language-specific rules as found in older schemes.
The background for research of the problem includes: an examination of existing schemes of computerized language translation and an analysis of their shortcomings. Construction of the proposed scheme requires a preliminary investigation of the common "universal" principles and parametric variations across different languages within the framework of current linguistic theory.
The work to be done includes: construction of a module which uses linguistic principles and source language parameter values to parse and output the corresponding annotated surface structures of source language sentences; creation of procedures which handle the transformation of an annotated surface structure into a "base" syntactic structure; and development of a special purpose generation scheme which converts a "base" syntactic structure into a surface form in the target language.MIT Artificial Intelligence Laborator
LEXICALL: Lexicon Construction for Foreign Language Tutoring
We focus on the problem of building large repositories of lexical
conceptual structure (LCS) representations for verbs in multiple
languages. One of the main results of this work is the definition of a
relation between broad semantic classes and LCS meaning components.
Our acquisition program---LEXICALL---takes, as input, the result of
previous work on verb classification and thematic grid tagging, and
outputs LCS representations for different languages. These
representations have been ported into English, Arabic and Spanish
lexicons, each containing approximately 9000 verbs. We are currently
using these lexicons in an operational foreign language tutoring and
machine translation.
(Also cross-referenced as UMIACS-TR-97-09
Development of Cross-Linguistic Syntactic and Semantic Parameters for Parsing and Generation
This document reports on research conducted at the University of
Maryland for the Korean/English Machine Translation (MT) project. The
translation approach adopted here is interlingual i.e., a single
underlying representation called Lexical Conceptual Structure (LCS) is
used for both Korean and English.
The primary focus of this investigation concerns the notion of
`parameterization' i.e., a mechanism that accounts for both syntactic
and lexical-semantic distinctions between Korean and English. We
present our assumptions about the syntactic structure of Korean-type
languages vs. English-type languages and describe our investigation of
syntactic parameterization for distinguishing between these two types
of languages. We also present the details of the LCS structure and
describe how this representation is parameterized so that it
accommodates both languages.
We address critical issues concerning interlingual machine translation
such as locative postpositions and the dividing line between the
interlingua and the knowledge representation. Difficulties in
translation and transliteration of Korean are discussed and complex
morphological properties of Korean are presented. Finally, we
describe recent work on lexical acquisition and conclude with a
discussion about two hypotheses concerning semantic classification
that are currently being tested.
(Also cross-referenced as UMIACS-TR-94-26
Knowledge Graphs Effectiveness in Neural Machine Translation Improvement
Neural Machine Translation (NMT) systems require a massive amount of Maintaining semantic relations between words during the translation process yields more accurate target-language output from Neural Machine Translation (NMT). Although difficult to achieve from training data alone, it is possible to leverage Knowledge Graphs (KGs) to retain source-language semantic relations in the corresponding target-language translation. The core idea is to use KG entity relations as embedding constraints to improve the mapping from source to target. This paper describes two embedding constraints, both of which employ Entity Linking (EL)---assigning a unique identity to entities---to associate words in training sentences with those in the KG: (1) a monolingual embedding constraint that supports an enhanced semantic representation of the source words through access to relations between entities in a KG; and (2) a bilingual embedding constraint that forces entity relations in the source-language to be carried over to the corresponding entities in the target-language translation. The method is evaluated for English-Spanish translation exploiting Freebase as a source of knowledge. Our experimental results show that exploiting KG information not only decreases the number of unknown words in the translation but also improves translation quality
LonXplain: Lonesomeness as a Consequence of Mental Disturbance in Reddit Posts
Social media is a potential source of information that infers latent mental
states through Natural Language Processing (NLP). While narrating real-life
experiences, social media users convey their feeling of loneliness or isolated
lifestyle, impacting their mental well-being. Existing literature on
psychological theories points to loneliness as the major consequence of
interpersonal risk factors, propounding the need to investigate loneliness as a
major aspect of mental disturbance. We formulate lonesomeness detection in
social media posts as an explainable binary classification problem, discovering
the users at-risk, suggesting the need of resilience for early control. To the
best of our knowledge, there is no existing explainable dataset, i.e., one with
human-readable, annotated text spans, to facilitate further research and
development in loneliness detection causing mental disturbance. In this work,
three experts: a senior clinical psychologist, a rehabilitation counselor, and
a social NLP researcher define annotation schemes and perplexity guidelines to
mark the presence or absence of lonesomeness, along with the marking of
text-spans in original posts as explanation, in 3,521 Reddit posts. We expect
the public release of our dataset, LonXplain, and traditional classifiers as
baselines via GitHub
Automatic Extraction of Semantic Classes from Syntactic Information in Online Resources
This paper addresses the issue of word-sense ambiguity in extraction
from machine-readable resources for the construction of large-scale
knowledge sources. We describe two experiments: one which took
word-sense distinctions into account, resulting in 97.9% accuracy for
semantic classification of verbs based on (Levin, 1993); and one which
ignored word-sense distinctions, resulting in 6.3% accuracy. These
experiments were dual purpose: (1) to validate the central thesis of
the work of (Levin, 1993), i.e., that verb semantics and syntactic
behavior are predictably related; (2) to demonstrate that a 20-fold
improvement can be achieved in deriving semantic information from
syntactic cues if we first divide the syntactic cues into distinct
groupings that correlate with different word senses. Finally, we show
that we can provide effective acquisition techniques for novel word
senses using a combination of online sources.
(Also cross-referenced as UMIACS-TR-95-65
Bilingual Lexicon Construction Using Large Corpora
This paper introduces a method for learning bilingual term and sentence
level alignments for the purpose of building lexicons. Combining
statistical techniques with linguistic knowledge, a general algorithm
is developed for learning term and sentence alignments from large
bilingual corpora with high accuracy. This is achieved through the
use of filtered linguistic feedback between term and sentence alignment
processes. An implementation of this algorithm, TAG-ALIGN, is evaluated
against approaches similar to [Brown et al. 1993] that apply Bayesian
techniques for term alignment, and [Gale and Church 1991] a dynamic
programming method for aligning sentences. The ultimate goal is to
produce large bilingual lexicons with a high degree of accuracy from
potentially noisy corpora.
(Also cross-referenced as UMIACS-TR-97-50
On automatic filtering of multilingual texts
An emerging requirement to sift through the increasing ood of text information has led to the rapid development of information ltering technology in the past ve years. This study introduces novel approaches for ltering texts regardless of their source language. We begin with a brief description of related developments in text ltering and multilingual information retrieval. We then present three alternative approaches to selecting texts from a multilingual information stream which represent a logical evolution from existing techniques in related disciplines. Finally, a practical automated performance evaluation technique is proposed.
A Survey of Multilingual Text Retrieval
This report reviews the present state of the art
in selection of texts in one language based on queries in another, a
problem we refer to as ``multilingual'' text retrieval. Present
applications of multilingual text retrieval systems are limited by the
cost and complexity of developing and using the multilingual thesauri
on which they are based and by the level of user training that is
required to achieve satisfactory search effectiveness. A general
model for multilingual text retrieval is used to review the
development of the field and to describe modern production and
experimental systems. The report concludes with some observations on
the present state of the art and an extensive bibliography of the
technical literature on multilingual text retrieval. The research
reported herein was supported, in part, by Army Research
Office contract DAAL03-91-C-0034 through Battelle Corporation, NSF NYI
IRI-9357731, Alfred P. Sloan Research Fellow Award BR3336, and a
General Research Board Semester Award.
(Also cross-referenced as UMIACS-TR-96-19
- …