Search CORE

580 research outputs found

TExSIS: bilingual terminology extraction from parallel corpora using chunk-based alignment

Author: Hoste Veronique
Lefever Els
Macken Lieve
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2013
Field of study

Crossref

Ghent University Academic Bibliography

Archivsystem Ask23

Dutch parallel corpus: a balanced parallel corpus for Dutch-English and Dutch-French

Author: FJ Och
G Sutter De
G Vanderbauwhede
Isabelle Delaere
L Macken
L Macken
Lieve Macken
M Kay
M Simard
MP Marcus
P Keirsbilck Van
PF Brown
R Moore
W Daelemans
WA Gale
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

status: publishe

Lirias

Crossref

Springer - Publisher Connector

Ghent University Academic Bibliography

A Machine Learning Approach For Opinion Holder Extraction In Arabic Language

Author: AbdelRahman Samir
Elarnaoty Mohamed
Fahmy Aly
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 06/04/2012
Field of study

Opinion mining aims at extracting useful subjective information from reliable amounts of text. Opinion mining holder recognition is a task that has not been considered yet in Arabic Language. This task essentially requires deep understanding of clauses structures. Unfortunately, the lack of a robust, publicly available, Arabic parser further complicates the research. This paper presents a leading research for the opinion holder extraction in Arabic news independent from any lexical parsers. We investigate constructing a comprehensive feature set to compensate the lack of parsing structural outcomes. The proposed feature set is tuned from English previous works coupled with our proposed semantic field and named entities features. Our feature analysis is based on Conditional Random Fields (CRF) and semi-supervised pattern recognition techniques. Different research models are evaluated via cross-validation experiments achieving 54.03 F-measure. We publicly release our own research outcome corpus and lexicon for opinion mining community to encourage further research

arXiv.org e-Print Archive

Crossref

Sub-sentential alignment of translational correspondences

Author: Macken Lieve
Publication venue: UPA University Press Antwerp
Publication date: 01/01/2010
Field of study

The focus of this thesis is sub-sentential alignment, i.e. the automatic alignment of translational correspondences below sentence level. The system that we developed takes as its input sentence-aligned parallel texts and aligns translational correspondences at the sub-sentential level, which can be words, word groups or chunks. The research described in this thesis aims to be of value to the developers of computer-assisted translation tools and to human translators in general. Two important aspects of this research are its focus on different text types and its focus on precision. In order to cover a wide range of syntactic and stylistic phenomena that emerge from different writing and translation styles, we used parallel texts of different text types. As the intended users are ultimately human translators, our explicit aim was to develop a model that aligns segments with a very high precision. This thesis consists of three major parts. The first part is introductory and focuses on the manual annotation, the resources used and the evaluation methodology. The second part forms the main contribution of this thesis and describes the sub-sentential alignment system that was developed. In the third part, two different applications are discussed. Although the global architecture of our sub-sentential alignment module is language-independent, the main focus is on the English-Dutch language pair. At the beginning of the research project, a Gold Standard was created. The manual reference corpus contains three different types of links: regular links for straightforward correspondences, fuzzy links for translation-specific shifts of various kinds, and null links for words for which no correspondence could be indicated. The different writing and translation styles in the different text types was reflected in the number of regular, fuzzy and null links. The sub-sentential alignment system is conceived as a cascaded model consisting of two phases. In the first phase, anchor chunks are linked on the basis of lexical correspondences and syntactic similarity. In the second phase, we use a bootstrapping approach to extract language-pair specific translation patterns. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers. To generate the lexical correspondences, we experimented with two different types of bilingual dictionaries: a handcrafted bilingual dictionary and probabilistic bilingual dictionaries. In the bootstrapping experiments, we started from the precise GIZA++ intersected word alignments. The proposed system improves the recall of the intersected GIZA++ word alignments without sacrificing precision, which makes the resulting alignments more useful for incorporation in CAT-tools or bilingual terminology extraction tools. Moreover, the system's ability to align discontiguous chunks makes the system useful for languages containing split verbal constructions and phrasal verbs. In the last part of this thesis, we demonstrate the usefulness of the sub-sentential alignment module in two different applications. First, we used the sub-sentential alignment module to guide bilingual terminology extraction on three different language pairs, viz. French-English, French-Italian and French-Dutch. Second, we compare the performance of our alignment system with a commercial sub-sentential translation memory system

Ghent University Academic Bibliography

Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation

Author: Ture Ferhan
Publication venue
Publication date: 01/01/2013
Field of study

With the adoption of web services in daily life, people have access to tremendous amounts of information, beyond any human's reading and comprehension capabilities. As a result, search technologies have become a fundamental tool for accessing information. Furthermore, the web contains information in multiple languages, introducing another barrier between people and information. Therefore, search technologies need to handle content written in multiple languages, which requires techniques to account for the linguistic differences. Information Retrieval (IR) is the study of search techniques, in which the task is to find material relevant to a given information need. Cross-Language Information Retrieval (CLIR) is a special case of IR when the search takes place in a multi-lingual collection. Of course, it is not helpful to retrieve content in languages the user cannot understand. Machine Translation (MT) studies the translation of text from one language into another efficiently (within a reasonable amount of time) and effectively (fluent and retaining the original meaning), which helps people understand what is being written, regardless of the source language. Putting these together, we observe that search and translation technologies are part of an important user application, calling for a better integration of search (IR) and translation (MT), since these two technologies need to work together to produce high-quality output. In this dissertation, the main goal is to build better connections between IR and MT, for which we present solutions to two problems: Searching to translate explores approximate search techniques for extracting bilingual data from multilingual Wikipedia collections to train better translation models. Translating to search explores the integration of a modern statistical MT system into the cross-language search processes. In both cases, our best-performing approach yielded improvements over strong baselines for a variety of language pairs. Finally, we propose a general architecture, in which various components of IR and MT systems can be connected together into a feedback loop, with potential improvements to both search and translation tasks. We hope that the ideas presented in this dissertation will spur more interest in the integration of search and translation technologies

CiteSeerX

Digital Repository at the University of Maryland

Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines

Author: Angelov Krasimir
Gruzitis N.
Kolachina Prasanth
Ranta Aarne
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2020
Field of study

Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF

Chalmers Research

wEBMT: developing and validating an example-based machine translation system using the world wide web

Author: Gough Nano
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2003
Field of study

We have developed an example-based machine translation (EBMT) system that uses the World Wide Web for two different purposes: First, we populate the system’s memory with translations gathered from rule-based MT systems located on the Web. The source strings input to these systems were extracted automatically from an extremely small subset of the rule types in the Penn-II Treebank. In subsequent stages, the (source, target) translation pairs obtained are automatically transformed into a series of resources that render the translation process more successful. Despite the fact that the output from on-line MT systems is often faulty, we demonstrate in a number of experiments that when used to seed the memories of an EBMT system, they can in fact prove useful in generating translations of high quality in a robust fashion. In addition, we demonstrate the relative gain of EBMT in comparison to on-line systems. Second, despite the perception that the documents available on the Web are of questionable quality, we demonstrate in contrast that such resources are extremely useful in automatically postediting translation candidates proposed by our system

CiteSeerX

Irish Universities

DCU Online Research Access Service

Recommended from our members

Parsing Arabic Dialects

Author: Chiang David
Diab Mona T.
Habash Nizar Y.
Hwa Rebecca
Lacey Vincent
Levy Roger
Nichols Carol
Rambow Owen C.
Shareef Safiullah
Sima'an Khalil
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA). We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LA-MSA. Instead, we use explicit knowledge about the relation between LA and MSA

Columbia University Academic Commons