146,377 research outputs found
Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques
Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair.
This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules.
The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft
Bootstrapping Lexical Choice via Multiple-Sequence Alignment
An important component of any generation system is the mapping dictionary, a
lexicon of elementary semantic expressions and corresponding natural language
realizations. Typically, labor-intensive knowledge-based methods are used to
construct the dictionary. We instead propose to acquire it automatically via a
novel multiple-pass algorithm employing multiple-sequence alignment, a
technique commonly used in bioinformatics. Crucially, our method leverages
latent information contained in multi-parallel corpora -- datasets that supply
several verbalizations of the corresponding semantics rather than just one.
We used our techniques to generate natural language versions of
computer-generated mathematical proofs, with good results on both a
per-component and overall-output basis. For example, in evaluations involving a
dozen human judges, our system produced output whose readability and
faithfulness to the semantic input rivaled that of a traditional generation
system.Comment: 8 pages; to appear in the proceedings of EMNLP-200
MoNoise: Modeling Noise Using a Modular Normalization System
We propose MoNoise: a normalization model focused on generalizability and
efficiency, it aims at being easily reusable and adaptable. Normalization is
the task of translating texts from a non- canonical domain to a more canonical
domain, in our case: from social media data to standard language. Our proposed
model is based on a modular candidate generation in which each module is
responsible for a different type of normalization action. The most important
generation modules are a spelling correction system and a word embeddings
module. Depending on the definition of the normalization task, a static lookup
list can be crucial for performance. We train a random forest classifier to
rank the candidates, which generalizes well to all different types of
normaliza- tion actions. Most features for the ranking originate from the
generation modules; besides these features, N-gram features prove to be an
important source of information. We show that MoNoise beats the
state-of-the-art on different normalization benchmarks for English and Dutch,
which all define the task of normalization slightly different.Comment: Source code: https://bitbucket.org/robvanderg/monois
Statistical Inferences for Polarity Identification in Natural Language
Information forms the basis for all human behavior, including the ubiquitous
decision-making that people constantly perform in their every day lives. It is
thus the mission of researchers to understand how humans process information to
reach decisions. In order to facilitate this task, this work proposes a novel
method of studying the reception of granular expressions in natural language.
The approach utilizes LASSO regularization as a statistical tool to extract
decisive words from textual content and draw statistical inferences based on
the correspondence between the occurrences of words and an exogenous response
variable. Accordingly, the method immediately suggests significant implications
for social sciences and Information Systems research: everyone can now identify
text segments and word choices that are statistically relevant to authors or
readers and, based on this knowledge, test hypotheses from behavioral research.
We demonstrate the contribution of our method by examining how authors
communicate subjective information through narrative materials. This allows us
to answer the question of which words to choose when communicating negative
information. On the other hand, we show that investors trade not only upon
facts in financial disclosures but are distracted by filler words and
non-informative language. Practitioners - for example those in the fields of
investor communications or marketing - can exploit our insights to enhance
their writings based on the true perception of word choice
Take-ings
The word property had many meanings in 1789, as it does today, and a critical aspect of the ongoing debate about the meaning of the Fifth Amendment\u27s Takings Clause has centered on how the word should be read in the context of the Clause. Property has been read by Professor Thomas Merrill to refer to ownership interests, by Richard Epstein in terms of a broad Blackstonian conception of the individual control of the possession, use, and disposition of resources, by Benjamin Barros as reflective of constructions through individual expectations and state law, and by the author as physical control of material possessions
As a textual matter, however, the Takings Clause is not simply concerned with governmental actions that affect property. The Clause provides that private property [shall not] be taken for public use without just compensation. It is thus concerned with property taken for public use and the word taken is the key, at least for a textualist, to understanding both which types of governmental actions fall within the ambit of the Clause and what types of property the Clause protects. The centrality of the concept of takings to the Clause\u27s meaning is reflected by the name by which the Clause is known. It is the Takings Clause, not the Property Clause. Although it has, ironically, not figured prominently in takings scholarship, the word taken is of fundamental importance to the Clause\u27s meaning. In this essay, the author explores the importance from a textualist perspective and argues that a textualist will reject the doctrine of regulatory takings
- …