Search CORE

146,377 research outputs found

Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques

Author: Centelles Jordi
Ruiz Costa-Jussà Marta
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Bootstrapping Lexical Choice via Multiple-Sequence Alignment

Author: Barzilay Regina
Lee Lillian
Publication venue
Publication date: 01/01/2002
Field of study

An important component of any generation system is the mapping dictionary, a lexicon of elementary semantic expressions and corresponding natural language realizations. Typically, labor-intensive knowledge-based methods are used to construct the dictionary. We instead propose to acquire it automatically via a novel multiple-pass algorithm employing multiple-sequence alignment, a technique commonly used in bioinformatics. Crucially, our method leverages latent information contained in multi-parallel corpora -- datasets that supply several verbalizations of the corresponding semantics rather than just one. We used our techniques to generate natural language versions of computer-generated mathematical proofs, with good results on both a per-component and overall-output basis. For example, in evaluations involving a dozen human judges, our system produced output whose readability and faithfulness to the semantic input rivaled that of a traditional generation system.Comment: 8 pages; to appear in the proceedings of EMNLP-200

arXiv.org e-Print Archive

CiteSeerX

Columbia University Academic Commons

MoNoise: Modeling Noise Using a Modular Normalization System

Author: van der Goot Rob
van Noord Gertjan
Publication venue
Publication date: 01/01/2017
Field of study

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.Comment: Source code: https://bitbucket.org/robvanderg/monois

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Statistical Inferences for Polarity Identification in Natural Language

Author: Feuerriegel Stefan
Neumann Dirk
Pröllochs Nicolas
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Information forms the basis for all human behavior, including the ubiquitous decision-making that people constantly perform in their every day lives. It is thus the mission of researchers to understand how humans process information to reach decisions. In order to facilitate this task, this work proposes a novel method of studying the reception of granular expressions in natural language. The approach utilizes LASSO regularization as a statistical tool to extract decisive words from textual content and draw statistical inferences based on the correspondence between the occurrences of words and an exogenous response variable. Accordingly, the method immediately suggests significant implications for social sciences and Information Systems research: everyone can now identify text segments and word choices that are statistically relevant to authors or readers and, based on this knowledge, test hypotheses from behavioral research. We demonstrate the contribution of our method by examining how authors communicate subjective information through narrative materials. This allows us to answer the question of which words to choose when communicating negative information. On the other hand, we show that investors trade not only upon facts in financial disclosures but are distracted by filler words and non-informative language. Practitioners - for example those in the fields of investor communications or marketing - can exploit our insights to enhance their writings based on the true perception of word choice

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

FigShare

Take-ings

Author: Treanor William Michael
Publication venue: Scholarship @ GEORGETOWN LAW
Publication date: 01/01/2008
Field of study

The word property had many meanings in 1789, as it does today, and a critical aspect of the ongoing debate about the meaning of the Fifth Amendment\u27s Takings Clause has centered on how the word should be read in the context of the Clause. Property has been read by Professor Thomas Merrill to refer to ownership interests, by Richard Epstein in terms of a broad Blackstonian conception of the individual control of the possession, use, and disposition of resources, by Benjamin Barros as reflective of constructions through individual expectations and state law, and by the author as physical control of material possessions As a textual matter, however, the Takings Clause is not simply concerned with governmental actions that affect property. The Clause provides that private property [shall not] be taken for public use without just compensation. It is thus concerned with property taken for public use and the word taken is the key, at least for a textualist, to understanding both which types of governmental actions fall within the ambit of the Clause and what types of property the Clause protects. The centrality of the concept of takings to the Clause\u27s meaning is reflected by the name by which the Clause is known. It is the Takings Clause, not the Property Clause. Although it has, ironically, not figured prominently in takings scholarship, the word taken is of fundamental importance to the Clause\u27s meaning. In this essay, the author explores the importance from a textualist perspective and argues that a textualist will reject the doctrine of regulatory takings

bepress Legal Repository

Georgetown Law Scholarly Commons

University of San Diego