Search CORE

5,062 research outputs found

BIKE: Bilingual Keyphrase Experiments

Author: Barrière Caroline
George Foster
Nadeau David
Publication venue
Publication date: 01/01/2005
Field of study

This paper presents a novel strategy for translating lists of keyphrases. Typical keyphrase lists appear in scientific articles, information retrieval systems and web page meta-data. Our system combines a statistical translation model trained on a bilingual corpus of scientific papers with sense-focused look-up in a large bilingual terminological resource. For the latter, we developed a novel technique that benefits from viewing the keyphrase list as contextual help for sense disambiguation. The optimal combination of modules was discovered by a genetic algorithm. Our work applies to the French / English language pair

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

UGENT-LT3 SCATE system for machine translation quality estimation

Author: Desmet Bart
Hoste Veronique
Macken Lieve
Tezcan Arda
Publication venue
Publication date: 01/01/2015
Field of study

This paper describes the submission of the UGENT-LT3 SCATE system to the WMT15 Shared Task on Quality Estima-tion (QE), viz. English-Spanish word and sentence-level QE. We conceived QE as a supervised Machine Learning (ML) problem and designed additional features and combined these with the baseline feature set to estimate quality. The sen-tence-level QE system re-uses the word level predictions of the word-level QE system. We experimented with different learning methods and observe improve-ments over the baseline system for word-level QE with the use of the new features and by combining learning methods into ensembles. For sentence-level QE we show that using a single feature based on word-level predictions can perform better than the baseline system and using this in combination with additional features led to further improvements in performance

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23

Sentence Alignment using MR and GA

Author: Abdel Fattah Mohamed
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 28/07/2011
Field of study

In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on mathematical regression (MR) and genetic algorithm (GA) classifiers are presented. A feature vector is extracted from the text pair under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the mathematical regression and genetic algorithm models. Another set of data was used for testing. The results of (MR) and (GA) outperform the results of length based approach. Moreover these new approaches are valid for any languages pair and are quite flexible since the feature vector may contain more, less or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research

International Institute for Science, Technology and Education (IISTE): E-Journals

Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance

Author: Asgari Ehsaneddin
Mofrad Mohammad R. K.
Publication venue
Publication date: 28/04/2016
Field of study

We introduce a new measure of distance between languages based on word embedding, called word embedding language divergence (WELD). WELD is defined as divergence between unified similarity distribution of words between languages. Using such a measure, we perform language comparison for fifty natural languages and twelve genetic languages. Our natural language dataset is a collection of sentence-aligned parallel corpora from bible translations for fifty languages spanning a variety of language families. Although we use parallel corpora, which guarantees having the same content in all languages, interestingly in many cases languages within the same family cluster together. In addition to natural languages, we perform language comparison for the coding regions in the genomes of 12 different organisms (4 plants, 6 animals, and two human subjects). Our result confirms a significant high-level difference in the genetic language model of humans/animals versus plants. The proposed method is a step toward defining a quantitative measure of similarity between languages, with applications in languages classification, genre identification, dialect identification, and evaluation of translations

arXiv.org e-Print Archive

eScholarship - University of California

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

A competence-performance based model to develop a syntactic language for artificial agents

Author: Aler Ricardo
Mingo Jack Mario
Publication venue: 'Elsevier BV'
Publication date: 10/12/2016
Field of study

The hypothesis of language use is an attractive theory in order to explain how natural languages evolve and develop in social populations. In this paper we present a model partially based on the idea of language games, so that a group of artificial agents are able to produce and share a symbolic language with syntactic structure. Grammatical structure is induced by grammatical evolution of stochastic regular grammars with learning capabilities, while language development is refined by means of language games where the agents apply on-line probabilistic reinforcement learning. Within this framework, the model adapts the concepts of competence and performance in language, as they have been proposed in some linguistic theories. The first experiments in this article have been organized around the linguistic description of visual scenes with the possibility of changing the referential situations. A second and more complicated experimental setting is also analyzed, where linguistic descriptions are enforced to keep word order constraints.The second author has been supported by the Spanish Ministry of Science under contract ENE2014-56126-C2-2-R (AOPRIN-SOL)

Universidad Carlos III de Madrid e-Archivo

Data-driven sentence simplification: Survey and benchmark

Author: Alva-Manchego F.
Scarton C.
Specia L.
Publication venue: MIT Press
Publication date: 15/09/2019
Field of study

Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments

Online Research @ Cardiff

Spiral - Imperial College Digital Repository

White Rose Research Online