292 research outputs found
TMX markup: a challenge when adapting SMT to the localisation environment
Translation memory (TM) plays an important role in localisation workflows and is used as an efficient and fundamental tool to carry out translation. In recent years, statistical machine translation (SMT) techniques have been rapidly developed, and the translation quality and speed have been significantly improved as well. However,when applying SMT technique to facilitate post-editing in the localisation industry, we need to adapt SMT to the TM data which is formatted with special mark-up. In this paper, we explore some issues when adapting SMT to Symantec formatted TM data.
Three different methods are proposed to handle the Translation Memory eXchange (TMX) markup and a comparative study is carried out between them. Furthermore, we also compare the TMX-based SMT systems with a customised SYSTRAN system through human evaluation and automatic evaluation metrics. The experimental results conducted on the French and English language pair show that the SMT can perform well using TMX as input format either during training or at runtime
A detailed analysis of phrase-based and syntax-based machine translation: the search for systematic differences
This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and
English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the
systems generate different output and can potentially
be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging
Improving the post-editing experience using translation recommendation: a user study
We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We analyze the effectiveness of the model as well as the reaction of potential users. Based on the performance statistics and the users’comments, we find that translation recommendation can reduce the workload of professional post-editors and improve the acceptance of MT in the localization industry
Influence of ground substrate on establishment of reindeer lichen after artificial dispersal
Methods to improve the recovery of reindeer lichen after soil disturbance or overgrazing are being sought for areas where reindeer are herded. The effects of four substrates – mineral soil, moss, twigs and pine bark – on the establishment of lichen fragments after total removal of the vegetation were thus studied in a middle-aged pine stand and a clear-cut, both located in a lichen-rich pine-heath. Cladina mitis fragments of two sizes were manually dispersed in 1 m2 quadrats and their movements from their respective dispersal points were registered after one year. The natural re-establishment of lichens in the quadrats was monitored over three years by using digital pictures. In the forest stand, no significant differences were detected in either the fragment movement or the lichen establishment between the different substrates, but the fragment size had positive effects on both parameters. In the clear-cut, the moss substrate was the most suitable not only for the artificially dispersed lichens to fasten to, but also for the natural settlement of lichens from the surrounding lichen mat. More lichen thalli fastened to the bark and twigs substrates than to the mineral soil, but the settlement of lichens from the surrounding was greater on bare mineral soil substrate. The results indicate that artificial dispersal of lichen thalli on an appropriate substrate could be a successful strategy for promoting lichen recovery
Community-based post-editing of machine-translated content: monolingual vs. bilingual
We carried out a machine-translation postediting pilot study with users of an IT support forum community. For both language pairs (English to German, English to French), 4 native speakers for each language were recruited. They performed monolingual and bilingual postediting tasks on machine-translated forum content. The post-edited content was evaluated using human evaluation (fluency, comprehensibility, fidelity). We found that monolingual post-editing can lead to improved fluency and comprehensibility scores similar to those achieved through bilingual post-editing, while we found that fidelity improved considerably more for the bilingual set-up. Furthermore, the performance across post-editors varied greatly and it was found that some post-editors are able to produce better quality in a monolingual set-up than others
Foreebank: Syntactic Analysis of Customer Support Forums
International audienceWe present a new treebank of English and French technical forum content which has been annotated for grammatical errors and phrase structure. This double annotation allows us to empirically measure the effect of errors on parsing performance. While it is slightly easier to parse the corrected versions of the forum sentences, the errors are not the main factor in making this kind of text hard to parse
DCU-Symantec submission for the WMT 2012 quality estimation task
This paper describes the features and the machine learning methods used by Dublin City University (DCU) and SYMANTEC for the WMT 2012 quality estimation task. Two sets of features are proposed: one constrained, i.e. respecting the data limitation suggested by the workshop organisers, and one unconstrained, i.e. using data or tools trained on data that was not provided by the workshop organisers. In total, more than 300 features were extracted and used to train classifiers in order to predict the translation quality of unseen data. In this paper, we focus on a subset of our feature set that we consider to be relatively novel: features based on a topic model built using the Latent Dirichlet Allocation approach, and features based on source and target language syntax extracted using part-of-speech (POS) taggers and parsers. We evaluate nine feature combinations using four classification-based and four regression-based machine learning techniques
DCU-Symantec at the WMT 2013 Quality Estimation Shared Task
We describe the two systems submitted by the DCU-Symantec team to Task 1.1. of the WMT 2013 Shared Task on Quality Estimation for Machine Translation. Task 1.1 involve estimating post-editing effort for English-Spanish translation pairs in the news domain. The two systems use a wide variety of features, of which the most effective are the word-alignment, n-gram frequency, language model, POS-tag-based and pseudo-references ones. Both systems perform at a similarly high level in the two tasks of scoring and ranking translations, although there is some evidence that the systems are over-fitting to the training data
An investigation into the impact of controlled English rules on the comprehensibility, usefulness and acceptability of machine-translated technical documentation for French and German users
Previous studies suggest that the application of Controlled Language (CL) rules can significantly improve the readability, consistency, and machine-translatability of source text. One of the justifications for the application of CL rules is that they can have a similar impact on several target languages by reducing the post-editing effort required to bring Machine Translation (Ml’) output to acceptable quality. In certain situations, however, post-editing services may not always be a viable solution. Web-based information is often expected to be made available in real-time to ensure that its access is not restricted to certain users based on their locale. Uncertainties remain with regard to the actual usefulness of MT output for such users, as no empirical study has examined the impact of CL rules on the usefulness, comprehensibility, and acceptability of MT technical documents from a Web user's perspective. In this study, a two-phase approach is used to determine whether Controlled English rules can have a significant impact on these three variables. First, individual CL rules are evaluated within an experimental environment, which is loosely based on a test suite.Two documents are then published and subject to a
randomised evaluation within the framework of an online experiment using a customer satisfaction questionnaire. The findings indicate that a limited number of CL rules have a similar impact on the comprehensibility of French and German output at the segment level. The results of the online experiment show that the application of certain CL rules has the potential to significantly improve the comprehensibility of German MT technical documentation. Our findings also show that the introduction of CL rules did not lead to any significant improvement of the comprehensibility, usefulness, and acceptability of French MT technical documentation
Establishment of Cladonia stellaris after artificial dispersal in an unfenced forest in northern Sweden
In 2002, fragments and whole thalli of reindeer lichen, mainly Cladonia stellaris, were spread in a typical Scots pine forest in northern boreal Sweden to study the survival and development after artificial lichen dispersal. The forest was not fenced, allowing reindeer access to graze. Lichens were dispersed in intact vegetation in 1 m2 plots by one of two methods: either as an intact lichen mat (patch) of 0.25 m2 in the centre of the plot or as fragments scattered (scatter) across the whole plot. The lichen was then monitored by photo inventory. In 2006, three years after the first inventory, all patch plots had been partially grazed by reindeer and the lichen cover measured in both patch and scatter plots had decreased severely. In 2008, the lichen cover in the patch and scatter plots had increased by up to 54% and 88%, respectively, of the cover measured during the first inventory in 2003. A significant increase in the number of fragments in the plots was also observed between 2006 and 2008, suggesting that in addition to growing like naturally established thalli, the lichen had spread and slowly colonized the plots. Dispersing lichen by the “patch” method appears to be less costefficient than the “scatter” method, if the area is grazed by reindeer. These results support the hypothesis that dispersal of reindeer lichen could be an effective means of restoring lichen stands, which are important for reindeer husbandry, even if the area is open to reindeer grazing.
Abstract in Swedish / Sammanfattning:
Etablering av Cladonia stellaris efter artificiell spridning i ej inhägnad skog i norra Sverige
Renlav (främst Cladonia stellaris) spreds manuellt 2002 i en talldominerad skog i norra Sverige för att studera lavens etablering efter artificiell spridning. Försöksområdet var inte hägnat utan öppet för renbete. Laven spreds i intakt markvegetation på 1 m2-ytor, antingen i form av intakta lavbålar (0,25 m2) i ytans centrum eller som fragment över hela provytan. Lavens etablering följdes med hjälp av fotoinventering. År 2006, tre år efter första inventeringen, hade alla provytor betats av ren och lavens täckningsgrad hade reducerats betydligt. Vid inventeringen 2008 hade lavens täckningsgrad ökat med upp till 54% (intakt lav) resp. 88% (lavfragment), i jämförelse med täckningsgraden den första inventeringen. Mellan 2006 och 2008 ökade antalet fragment per provyta signifikant vilket indikerar en fortsatt naturlig etablering med spridning via fragment. Att sprida lav i form av intakta lavbålar förefaller mindre kostnadseffektivt än spridning av lav i fragmentform om spridningsområdet är öppet för renbete. Resultaten utgör ett stöd för hypotesen att artificiell spridning av renlav kan vara ett effektivt sätt att restaurera viktiga renbetesområden, även om området inte är skyddat för renbete
- …
