Search CORE

98 research outputs found

wEBMT: developing and validating an example-based machine translation system using the world wide web

Author: Gough Nano
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2003
Field of study

We have developed an example-based machine translation (EBMT) system that uses the World Wide Web for two different purposes: First, we populate the system’s memory with translations gathered from rule-based MT systems located on the Web. The source strings input to these systems were extracted automatically from an extremely small subset of the rule types in the Penn-II Treebank. In subsequent stages, the (source, target) translation pairs obtained are automatically transformed into a series of resources that render the translation process more successful. Despite the fact that the output from on-line MT systems is often faulty, we demonstrate in a number of experiments that when used to seed the memories of an EBMT system, they can in fact prove useful in generating translations of high quality in a robust fashion. In addition, we demonstrate the relative gain of EBMT in comparison to on-line systems. Second, despite the perception that the documents available on the Web are of questionable quality, we demonstrate in contrast that such resources are extremely useful in automatically postediting translation candidates proposed by our system

CiteSeerX

Irish Universities

DCU Online Research Access Service

Acquiring phrasal lexicons from corpora

Author: Bannard Colin James
Publication venue: The University of Edinburgh
Publication date: 01/01/2006
Field of study

Edinburgh Research Archive

Emotions in the face: biology or culture? – Using idiomatic constructions as indirect evidence to inform a psychological research controversy

Author: Langlotz Andreas
Publication venue: University of Bern
Publication date: 01/05/2018
Field of study

Research on the facial expression of emotions has become a bone of contention in psychological research. On the one hand, Ekman and his colleagues have argued for a universal set of six basic emotions that are recognized with a considerable degree of accuracy across cultures and automatically displayed in highly similar ways by people. On the other hand, more recent research in cognitive science has provided results that are supportive of a cultural-relativist position. In this paper this controversy is approached from a contrastive perspective on phraseological constructions. It focuses on how emotional displays are codified in somatic idioms in some European (English, German, French, Spanish) and East Asian (Japanese, Korean, Chinese [Cantonese]) languages. Using somatic idioms such as make big eyes or die Nase rümpfen as a pool of evidence to shed linguistic light on the psychological controversy, the paper engages with the following general research question: Is there a significant difference between European and East Asian somatic idioms or do these constructions rather speak for a universal apprehension of facial emotion displays? To answer this question, the paper compares somatic expressions that are selected from (idiom) dictionaries of the languages listed above. Moreover, native speakers of the East Asian languages were consulted to support the analysis of the respective data. All corresponding entries were analysed categorically, i. e. with regard to whether or not they encode a given facial area to denote a specific emotion. The results show arguments both for and against the universalist and the cultural-relativist positions. In general, they speak for an opportunistic encoding of facial emotion displays

Directory of Open Access Journals

BOP Serials

Controlled generation in example-based machine translation

Author: Gough Nano
Way Andy
Publication venue
Publication date: 01/01/2003
Field of study

The theme of controlled translation is currently in vogue in the area of MT. Recent research (Sch¨aler et al., 2003; Carl, 2003) hypothesises that EBMT systems are perhaps best suited to this challenging task. In this paper, we present an EBMT system where the generation of the target string is filtered by data written according to controlled language specifications. As far as we are aware, this is the only research available on this topic. In the field of controlled language applications, it is more usual to constrain the source language in this way rather than the target. We translate a small corpus of controlled English into French using the on-line MT system Logomedia, and seed the memories of our EBMT system with a set of automatically induced lexical resources using the Marker Hypothesis as a segmentation tool. We test our system on a large set of sentences extracted from a Sun Translation Memory, and provide both an automatic and a human evaluation. For comparative purposes, we also provide results for Logomedia itself

CiteSeerX

Irish Universities

DCU Online Research Access Service

Recommended from our members

“Do I speak better?” A longitudinal study of lexical chunking in the spoken language of two Japanese students

Author: Leedham Maria
Publication venue
Publication date: 01/05/2006
Field of study

The prominence of lexical chunks or prefabricated language has grown over recent years, however there have been few longitudinal case studies exploring changes in non-native speaker (NNS) speech and little work done involving NNSs in identifying chunks in their own speech. This study attempts to track changes in two intermediate-level Japanese students' spoken usage of lexical chunks over a period of five months in the UK. Each NNS was recorded three times in conversational long turns at two-month intervals. Twelve native speakers (NSs) were asked to order transcripts of each student's speech by perceived fluency level and three also underlined the lexical chunks; however there was little coherence amongst NSs in these tasks. Identification of chunks using Wordsmith software suggests an overall rise in the percentage of talk within chunks and a reduction in ill-formed chunks over the five months. Following some awareness-raising training on identifying lexical chunks, the Japanese students themselves were asked to identify chunks within their own transcripts. Despite the difficulty of the task, they were able to do this and additionally offered insights into which chunks were common for them. These insights included an awareness of typical Japanese phrases and how they felt their speech had changed overall. A further recording and transcribing cycle suggests that this training resulted in some short-term uptake as the percentage of chunks used increased after the lessons. Both students found it highly motivating to record and analyse transcripts of their talk as they could see progress in their own spoken language development

Open Research Online (The Open University)

Example-based machine translation using the marker hypothesis

Author: Gough Nano
Publication venue: Dublin City University. School of Computing
Publication date: 01/01/2005
Field of study

The development of large-scale rules and grammars for a Rule-Based Machine Translation (RBMT) system is labour-intensive, error-prone and expensive. Current research in Machine Translation (MT) tends to focus on the development of corpus-based systems which can overcome the problem of knowledge acquisition. Corpus-Based Machine Translation (CBMT) can take the form of Statistical Machine Translation (SMT) or Example-Based Machine Translation (EBMT). Despite the benefits of EBMT, SMT is currently the dominant paradigm and many systems classified as example-based integrate additional rule-based and statistical techniques. The benefits of an EBMT system which does not require extensive linguistic resources and can produce reasonably intelligible and accurate translations cannot be overlooked. We show that our linguistics-lite EBMT system can outperform an SMT system trained on the same data. The work reported in this thesis describes the development of a linguistics-lite EBMT system which does not have recourse to extensive linguistic resources. We apply the Marker Hypothesis (Green, 1979) — a psycholinguistic theory which states that all natural languages are ‘marked’ for complex syntactic structure at surface form by a closed set of specific lexemes and morphemes. We use this technique in different environments to segment aligned (English, French) phrases and sentences. We then apply an alignment algorithm which can deduce smaller aligned chunks and words. Following a process similar to (Block, 2000), we generalise these alignments by replacing certain function words with an associated tag. In so doing, we cluster on marker words and add flexibility to our matching process. In a post hoc stage we treat the World Wide Web as a large corpus and validate and correct instances of determiner-noun and noun-verb boundary friction. We have applied our marker-based EBMT system to different bitexts and have explored its applicability in various environments. We have developed a phrase-based EBMT system (Gough et al., 2002; Way and Gough, 2003). We show that despite the perceived low quality of on-line MT systems, our EBMT system can produce good quality translations when such systems are used to seed its memories. (Carl, 2003a; Schaler et al., 2003) suggest that EBMT is more suited to controlled translation than RBMT as it has been known to overcome the ‘knowledge acquisition bottleneck’. To this end, we developed the first controlled EBMT system (Gough and Way, 2003; Way and Gough, 2004). Given the lack of controlled bitexts, we used an on-line MT system Logomedia to translate a set of controlled English sentences, We performed experiments using controlled analysis and generation and assessed the performance of our system at each stage. We made a number of improvements to our sub-sentential alignment algorithm and following some minimal adjustments to our system, we show that our controlled EBMT system can outperform an RBMT system. We applied the Marker Hypothesis to a more scalable data set. We trained our system on 203,529 sentences extracted from a Sun Microsystems Translation Memory. We thus reduced problems of data-sparseness and limited our dependence on Logomedia. We show that scaling up data in a marker-based EBMT system improves the quality of our translations. We also report on the benefits of extracting lexical equivalences from the corpus using Mutual Information

Irish Universities

DCU Online Research Access Service

How to talk cricket: on linguistic competence in a subject matter

Author: Pawley Andrew
Publication venue: Pacific Linguistics
Publication date: 01/12/2021
Field of study

The Australian National University

Chapter 1 Formulaic sequences: a drop in the ocean of constructions or something more significant?

Author: Buerki Andreas
Publication venue: 'Informa UK Limited'
Publication date: 01/04/2020
Field of study

This article investigates how formulaic sequences fi t into a constructionist approach to grammar, which is a major post- Chomskyan family of approaches to linguistic structure. The author considers whether, in this framework, formulaic sequences represent a phenomenon that is suffi ciently diff erent to warrant special status or whether they might best be studied in terms of the larger set of all constructions found in language. Based on data drawn from a large corpus of Wikipedia texts, it is argued that it is extremely diffi cult to form a distinct class of formulaic sequences without creating highly arbitrary boundaries. On the other hand, based on existing theoretical claims that formulaic sequences are the basis of fi rst language acquisition, a marker of profi ciency in a language, critical to the success of communicative acts and key to rapid language processing, it is argued that formulaic sequences as constructions are nevertheless signifi cant enough to be the focus of research, and a theoretical category meriting particular attention. These fi ndings have key repercussions both for research primarily interested in formulaic language and phraseology as well as for construction grammatical research

Directory of Open Access Books (DOAB)