1,170 research outputs found

    Description of the Chinese-to-Spanish rule-based machine translation system developed with a hybrid combination of human annotation and statistical techniques

    Get PDF
    Two of the most popular Machine Translation (MT) paradigms are rule based (RBMT) and corpus based, which include the statistical systems (SMT). When scarce parallel corpus is available, RBMT becomes particularly attractive. This is the case of the Chinese--Spanish language pair. This article presents the first RBMT system for Chinese to Spanish. We describe a hybrid method for constructing this system taking advantage of available resources such as parallel corpora that are used to extract dictionaries and lexical and structural transfer rules. The final system is freely available online and open source. Although performance lags behind standard SMT systems for an in-domain test set, the results show that the RBMT’s coverage is competitive and it outperforms the SMT system in an out-of-domain test set. This RBMT system is available to the general public, it can be further enhanced, and it opens up the possibility of creating future hybrid MT systems.Peer ReviewedPostprint (author's final draft

    Evaluation of the Statistical Machine Translation Service for Croatian-English

    Get PDF
    Much thought has been given in an endeavour to formalize the translation process. As a result, various approaches to MT (machine translation) were taken. With the exception of statistical translation, all approaches require cooperation between language and computer science experts. Most of the models use various hybrid approaches. Statistical translation approach is completely language independent if we disregard the fact that it requires huge parallel corpus that needs to be split into sentences and words. This paper compares and discusses state-of-the-art statistical machine translation (SMT) models and evaluation methods. Results of statistically-based Google Translate tool for Croatian-English translations are presented and multilevel analysis is given. Three different types of texts are manually evaluated and results are analysed by the χ2-test

    Improving machine translation performance using comparable corpora

    Get PDF
    Abstract The overwhelming majority of the languages in the world are spoken by less than 50 million native speakers, and automatic translation of many of these languages is less investigated due to the lack of linguistic resources such as parallel corpora. In the ACCURAT project we will work on novel methods how comparable corpora can compensate for this shortage and improve machine translation systems of under-resourced languages. Translation systems on eighteen European language pairs will be investigated and methodologies in corpus linguistics will be greatly advanced. We will explore the use of preliminary SMT models to identify the parallel parts within comparable corpora, which will allow us to derive better SMT models via a bootstrapping loop

    Multilingual Lexicon Extraction under Resource-Poor Language Pairs

    Get PDF
    In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Korean–French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Korean–French and Korean–Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and French–Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

    Language technologies for a multilingual Europe

    Get PDF
    This volume of the series “Translation and Multilingual Natural Language Processing” includes most of the papers presented at the Workshop “Language Technology for a Multilingual Europe”, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic “Multilingual Resources and Multilingual Applications”, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop “Language Technology for a Multilingual Europe” was co-organised by the two GSCL working groups “Text Technology” and “Machine Translation” (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
    • 

    corecore