Search CORE

3 research outputs found

Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor

Author: Callison-Burch Chris
Ganitkevitch Juri
Lopez Adam
Post Matt
Weese Jonathan
Publication venue
Publication date: 01/07/2011
Field of study

We present progress on Joshua, an opensource decoder for hierarchical and syntaxbased machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.

CiteSeerX

Edinburgh Research Explorer

Generating Politically-Relevant Event Data

Author: Beieler John
Publication venue
Publication date: 01/01/2016
Field of study

Automatically generated political event data is an important part of the social science data ecosystem. The approaches for generating this data, though, have remained largely the same for two decades. During this time, the field of computational linguistics has progressed tremendously. This paper presents an overview of political event data, including methods and ontologies, and a set of experiments to determine the applicability of deep neural networks to the extraction of political events from news text

arXiv.org e-Print Archive

Crossref

A new model for persian multi-part words edition based on statistical machine translation

Author: A. Arjomandzadeh
M. Zahedi
Publication venue: 'International Digital Organization for Scientific Information (IDOSI)'
Publication date: 01/01/2016
Field of study

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate

Directory of Open Access Journals