2,140 research outputs found

    Augmenting Translation Lexica by Learning Generalised Translation Patterns

    Get PDF
    Bilingual Lexicons do improve quality: of parallel corpora alignment, of newly extracted translation pairs, of Machine Translation, of cross language information retrieval, among other applications. In this regard, the first problem addressed in this thesis pertains to the classification of automatically extracted translations from parallel corpora-collections of sentence pairs that are translations of each other. The second problem is concerned with machine learning of bilingual morphology with applications in the solution of first problem and in the generation of Out-Of-Vocabulary translations. With respect to the problem of translation classification, two separate classifiers for handling multi-word and word-to-word translations are trained, using previously extracted and manually classified translation pairs as correct or incorrect. Several insights are useful for distinguishing the adequate multi-word candidates from those that are inadequate such as, lack or presence of parallelism, spurious terms at translation ends such as determiners, co-ordinated conjunctions, properties such as orthographic similarity between translations, the occurrence and co-occurrence frequency of the translation pairs. Morphological coverage reflecting stem and suffix agreements are explored as key features in classifying word-to-word translations. Given that the evaluation of extracted translation equivalents depends heavily on the human evaluator, incorporation of an automated filter for appropriate and inappropriate translation pairs prior to human evaluation contributes to tremendously reduce this work, thereby saving the time involved and progressively improving alignment and extraction quality. It can also be applied to filtering of translation tables used for training machine translation engines, and to detect bad translation choices made by translation engines, thus enabling significative productivity enhancements in the post-edition process of machine made translations. An important attribute of the translation lexicon is the coverage it provides. Learning suffixes and suffixation operations from the lexicon or corpus of a language is an extensively researched task to tackle out-of-vocabulary terms. However, beyond mere words or word forms are the translations and their variants, a powerful source of information for automatic structural analysis, which is explored from the perspective of improving word-to-word translation coverage and constitutes the second part of this thesis. In this context, as a phase prior to the suggestion of out-of-vocabulary bilingual lexicon entries, an approach to automatically induce segmentation and learn bilingual morph-like units by identifying and pairing word stems and suffixes is proposed, using the bilingual corpus of translations automatically extracted from aligned parallel corpora, manually validated or automatically classified. Minimally supervised technique is proposed to enable bilingual morphology learning for language pairs whose bilingual lexicons are highly defective in what concerns word-to-word translations representing inflection diversity. Apart from the above mentioned applications in the classification of machine extracted translations and in the generation of Out-Of-Vocabulary translations, learned bilingual morph-units may also have a great impact on the establishment of correspondences of sub-word constituents in the cases of word-to-multi-word and multi-word-to-multi-word translations and in compression, full text indexing and retrieval applications

    Genuine phrase-based statistical machine translation with supervision

    Get PDF
    This thesis addresses mainly two issues that have not been addressed in Statis-tical Machine Translation. One issue is that even though research has been evolving from word-based approaches to phrase-based ones, because words were consistently found to be inappropriate translation units, the fact is that words are still considered in the composition of phrases, either to determine translation equivalents or to check language fluency. Such consideration might result in the attempt of establishing relations between words within a phrase translation equivalent even when sometimes its phrases should be considered as a whole. Attempts to further partition such phrases would produce incorrect translation units that would introduce unwanted noise in the translation pro-cess. Besides, the internal fluency of an identified multi-word phrase should not require checking. As such, phrases should indeed be considered units, avoiding incorrect translation equivalents that might be identified from their partition, as well as only considering the fluency of a phrase with other phrases and not within the phrase itself. The other issue is that supervision, in the form of trans-lation lexica, is generally overlooked, with SMT research focusing mainly on the identification of translation units without any human intervention and without considering already known translation units. As such, no importance has been attributed to the inclusion of verified lexica, with only some rarely used dic-tionaries to score translation candidates and not really as a source of translation units. Indeed, translation equivalents should be memorized, checked and used as a source of translation units, avoiding the need to keep identifying the same translation units, in particular if those are frequently used. This Thesis presents a truly Phrase-Based approach to SMT, using contiguous and non-contiguous phrases, along with Supervision, in which phrases are not divided and verified lexica is built, kept and used to propose translations of complete sentences

    Translation Alignment and Extraction Within a Lexica-Centered Iterative Workflow

    Get PDF
    This thesis addresses two closely related problems. The first, translation alignment, consists of identifying bilingual document pairs that are translations of each other within multilingual document collections (document alignment); identifying sentences, titles, etc, that are translations of each other within bilingual document pairs (sentence alignment); and identifying corresponding word and phrase translations within bilingual sentence pairs (phrase alignment). The second is extraction of bilingual pairs of equivalent word and multi-word expressions, which we call translation equivalents (TEs), from sentence- and phrase-aligned parallel corpora. While these same problems have been investigated by other authors, their focus has been on fully unsupervised methods based mostly or exclusively on parallel corpora. Bilingual lexica, which are basically lists of TEs, have not been considered or given enough importance as resources in the treatment of these problems. Human validation of TEs, which consists of manually classifying TEs as correct or incorrect translations, has also not been considered in the context of alignment and extraction. Validation strengthens the importance of infrequent TEs (most of the entries of a validated lexicon) that otherwise would be statistically unimportant. The main goal of this thesis is to revisit the alignment and extraction problems in the context of a lexica-centered iterative workflow that includes human validation. Therefore, the methods proposed in this thesis were designed to take advantage of knowledge accumulated in human-validated bilingual lexica and translation tables obtained by unsupervised methods. Phrase-level alignment is a stepping stone for several applications, including the extraction of new TEs, the creation of statistical machine translation systems, and the creation of bilingual concordances. Therefore, for phrase-level alignment, the higher accuracy of human-validated bilingual lexica is crucial for achieving higher quality results in these downstream applications. There are two main conceptual contributions. The first is the coverage maximization approach to alignment, which makes direct use of the information contained in a lexicon, or in translation tables when this is small or does not exist. The second is the introduction of translation patterns which combine novel and old ideas and enables precise and productive extraction of TEs. As material contributions, the alignment and extraction methods proposed in this thesis have produced source materials for three lines of research, in the context of three PhD theses (two of them already defended), all sharing with me the supervision of my advisor. The topics of these lines of research are statistical machine translation, algorithms and data structures for indexing and querying phrase-aligned parallel corpora, and bilingual lexica classification and generation. Four publications have resulted directly from the work presented in this thesis and twelve from the collaborative lines of research

    Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation

    Get PDF
    In this paper, we compare the rule-based and data-driven approaches in the context of Spanish-to-Basque Machine Translation. The rule-based system we consider has been developed specifically for Spanish-to-Basque machine translation, and is tuned to this language pair. On the contrary, the data-driven system we use is generic, and has not been specifically designed to deal with Basque. Spanish-to-Basque Machine Translation is a challenge for data-driven approaches for at least two reasons. First, there is lack of bilingual data on which a data-driven MT system can be trained. Second, Basque is a morphologically-rich agglutinative language and translating to Basque requires a huge generation of morphological information, a difficult task for a generic system not specifically tuned to Basque. We present the results of a series of experiments, obtained on two different corpora, one being “in-domain” and the other one “out-of-domain” with respect to the data-driven system. We show that n-gram based automatic evaluation and edit-distance-based human evaluation yield two different sets of results. According to BLEU, the data-driven system outperforms the rule-based system on the in-domain data, while according to the human evaluation, the rule-based approach achieves higher scores for both corpora

    The acquisition of interlanguage morphology : a study into the role of morphology in the L2 learner's mental lexicon

    Get PDF
    Introduction 1.1 Morphology and second language learning If Dutch learners of English encounter a word like undoable, they may recognise it because they have seen it before and have remembered it. They may also fail to recognise it and guess the meaning of the word on the basis of the context. A third possibility is that they do not recognise the word and attempt to guess the meaning by decomposing the word into parts they do recognise and arrive at its meaning on the basis of these parts. In doing so, they may use the knowledge of word analysis they have gained in their mother tongue. The strategy a learner will employ in this situation will depend on a range of interrelated factors. Is the word used frequently? Is the word decomposable? Are the parts of the word used frequently? Does the learner know these parts? Are the parts similar to parts in the learner’s native language? Does the combined meaning of the parts make sense? All these questions are related to the acquisition and use of morphology in a second language, which is the subject of this study. Many of the questions raised here are related to questions that go beyond this specific subject. The first question that has to be answered is how do adult native speakers of a language store and retrieve (parts of) words? Secondly, how do people acquire knowledge and skills to produce and recognise words? Thirdly, how can this be explained by theories of language learning and theories of morphological structure? Furthermore, are the processes underlying the acquisition of morphological skills different for L1 learners and L2 learners? What, for instance, is the role of the second language learner’s first language? All of these questions and many more will be addressed here, as they constitute essential parts of the central research question of this study: What are the mechanisms and processes underlying the acquisition and use of morphological complexity in the production and comprehension of polymorphemic words by learners of a second language? More specifically, the current study will focus on the acquisition of these mechanisms and processes for Dutch learners of English. In this dissertation, an integrated multi-disciplinary model will be proposed to account for the acquisition and use of second-language morphology. The acquisition of second language morphology is a relatively new area of research, which means that there is little material to draw on. Previous studies of L2 morphology have mainly concentrated on the order of acquisition of morphemes as a function of the learner’s L1 background (the “morpheme order studies”). These studies provide ample -though contradictory- information about the order of acquisition of several specific morphemes and may contribute to the overall picture of foreign language morphological acquisition. These studies, however, have hardly paid attention to the underlying strategies applied by the learner in acquiring, processing and producing morphologically complex words and the general organisation and development of the foreign language learner’s lexicon. As studies in this particular area have been sparse, clues for the strategies and processes of L2 learners in the production and comprehension of morphologically complex words must be obtained from other sources. One of these sources is morphological theory. This area has seen the introduction of many concepts and ideas that could solve part of the puzzle. Another major source is the psycholinguistic study of the (adult native speaker’s) mental lexicon. Many such models of the mental lexicon address the problem exemplified in the opening lines of this chapter: does the mental lexicon consist of whole words, parts of words, or a combination of the two? And in case of the latter situation, how is it determined which mechanisms the speaker or listener employs to produce or comprehend morphologically complex words? Out of the many approaches to this problem, one particular model will be selected that serves as the sound basis for a model of the acquisition of L2 morphology. A third source is found in case studies describing the acquisition of L1 morphology. Studies of children’s lexical innovations reveal that children make use of morphological generalisations on a large scale. These data provide invaluable insights into the mechanisms and processes of the acquisition of morphology, which can partly be generalised to the acquisition of L2 morphology. This also holds for studies of the bilingual mental lexicon. Although these studies do not have anything to say about the acquisition of morphology explicitly, models of the bilingual mental lexicon provide useful insights into the differences between the monolingual and the bilingual lexicon. These insights can be utilised to define the role of morphology in the acquisition of the L2 mental lexicon. For example, an obvious question that is amply addressed in this field is whether the bilingual mental lexicon consists of discrete lexicons for each language or of one unified lexicon. If morphology is regarded as part of the lexicon, the question of the discreteness of L1 and L2 morphology will largely depend on the model of the bilingual lexicon adhered to. A final source of information is found in theories of second language acquisition. This field provides an insight into the importance of the second language learner’s native language and into the role of the learner’s input. Furthermore, this area addresses developmental issues and supplies data to complete the interdisciplinary model of the acquisition of L2 morphology. 1.1.1 Relevance of morphology for second language learning Many words in a language are morphologically related, at different levels and with different strengths. The verb “to learn”, for example, is not only linked to inflectional forms like “learns”, “learned” and “learning”, but also to the noun “learner” and the adjective “learnable”. It would not be very economical if all these related forms had to be learned and stored separately. This would be unlikely considering the impressive number of words that can be composed using morphology. For purely agglutinative languages like Turkish, impressive absolute figures have been calculated to this effect. Hankamer (1989) computed that a typical educated speaker of Turkish, with a lexicon size of approximately 20.000 noun roots and 10.000 verb roots could dispose of more than 200 billion entries based on this lexicon. These figures demonstrate the power of morphological relations and show the relevance of morphology for learners of a language: with a limited knowledge of morphological regularities, the learner can achieve a tremendous expansion of her1 vocabulary. Secondly, morphology can be a helpful tool to facilitate the acquisition and use of words. Recent research into the acquisition and retention of foreign and second language vocabulary has shown that newly acquired words are better retained if they were initially inferred through linguistic cues rather than through context (see e.g. Haastrup, 1989)2. Drawing attention to the morphological structure of words in a second language may result in an increasing awareness of morphological complexity, which can be an important strategy of inferring and acquiring words. In “printed school English”, 84 per cent of the prefixed words and 86 percent of the derivationally suffixed words are semantically transparent, i.e. their meaning can be inferred on the basis of their constituent morphemes (Nagy & Anderson, 1984). Obviously, morphological cues for the inference of words in a second or foreign language can be essential to vocabulary acquisition. This is confirmed by other studies, for instance Freyd and Baron (1982), which indicate that learners who are good at analysing words are the more successful word learners. 1.1.2 Relevance of the study of interlanguage morphology The study of interlanguage morphology can provide insights into the relative importance of morphology teaching in SLA. Knowledge of processes underlying the learner’s use of morphology may support teaching, as it will make clear on which areas of morphology language teaching should concentrate and will help determine the best way of teaching morphology. Secondly, this line of study could support the work that is being done in the area of vocabulary acquisition. As many words are related by form and/or by meaning, studying the nature of these relations may shed new light on the processes and factors that are relevant to the acquisition of vocabulary. Thirdly, the study of L2 morphology may contribute to general theories of second language acquisition. The role of the learner’s native language, for instance, is one of the factors that will play a major role in the study of both L2 morphology and other areas of SLA research, and findings in the field of morphology could be generalised to the other fields. Finally, insights in the field of interlanguage morphology may contribute to models of morphological processing in L1 and L2 and models of the bilingual mental lexicon. 1 I will try to be consistent in referring to learners as female human beings. Readers who feel offended by this may rest assured that I will use “he” and “him” in my next dissertation. 2 For a complete overview of the effects of context on the acquisition of vocabulary, see Mondria (1996). 1.2 Focus, aim and structure of the present study The primary focus of this study is to investigate the processes and principles underlying the acquisition of English morphology by Dutch learners. To this end, an interdisciplinary model of the acquisition of L2 morphology is proposed and tested. This model draws on different sources that are discussed in Chapter 2 and 3. Chapter 2 focuses on the role of morphology in the comprehension and production of polymorphemic words by adult native speakers. It first discusses some pertinent current theories of morphology that contribute to the model. Next, it surveys the most influential models of the mental lexicon and expresses a preference for one particular model. Then, the most relevant issues related to morphology and the lexicon are discussed thematically, focused on determining the most powerful model of morphology in the mental lexicon. It will be argued that a model of morphology should comply with a more general model of language processing, and requirements will be set for the adjustment of morphological models in this direction. In the conclusion of Chapter 2, additional support is provided for the model selected, and further adjustments are suggested. Chapter 3 concentrates on the role of morphology in first and second language learning and on the structure of the bilingual mental lexicon. After a detailed discussion of diary data describing children’s lexical innovations, the main conclusions about the principles and processes of L1 acquisition are listed. These observations give rise to elaboration and adjustment of the model proposed in Chapter 2. Additional information about the L2 learner’s lexicon is provided by models of the bilingual lexicon, which are predominantly based on speech error data. This culminates into further elaboration of the model. Finally, the model is checked against observations from the area of (general) second language acquisition research, in which particularly the role of the learner’s first language will be highlighted. Chapter 3 concludes with a sketch of the overall picture as it emerges from Chapters 2 and 3. Aspects of the model thus proposed are tested in Chapter 4. This chapter describes three empirical studies (some of which consist of several experiments), all concentrating on testing the one aspect of the model that appears to be most crucial: the role of the learner’s first language. The first study explores the relations between Dutch and English morphology in a series of three experiments. In the second study, an implication of the model is tested in a psycholinguistic priming experiment involving reaction time measurement. The third study includes a typological comparison of Dutch and English derivational morphology based on corpus data. Predictions originating from this comparison are tested in a production experiment in which learners from three levels of L2 proficiency participate. In Chapter 5, the model proposed in Chapter 2 and 3 is evaluated in terms of the results of the empirical studies in Chapter 4. Finally, some implications of the studies are mentioned and some suggestions are put forward for further research.

    Survival factors in the early Middle English lexicon

    Get PDF
    Using a corpus linguistic approach, this article aims to answer the question of which factors contribute to a better chance of survival for words in the early Middle English lexicon. Because of the cognitive benefits of rhyme that have been shown in modern studies, there is a particular interest in rhyming position as a potential factor; other factors include frequency, suffix and geographical spread. The data are analysed using survival analysis, random forests and conditional inference trees in R. The results show that geographical spread is the most important factor, usually in combination with particular suffixes. Rhyme is not generally a significant factor in the same vein, and its importance seems to be restricted to individual cases

    Canonical and non-canonical patterns in the adpositional phrase of Western Uralic : constraints of borrowing

    Get PDF
    Notions about adpositions and adpositional phrases (AdpP) reflect the ambiguous nature of this particular domain. While postpositions and prepositions are often dealt with as lexical categories, their syntactic context determines the grammatical relations of individual postpositions. In the diachronic development of individual adpositions, the phrasal unit of AdpP plays a crucial role either enhancing or diminishing the possibility of adopting new adpositions. In Uralic both the head and complement may be inflected, which increases the divergence of the adpositional system in comparison with most neighboring contact languages. This is illustratively seen in the bulk of adpositions in Finnic, Saamic and Mordvinic, which only exceptionally include borrowed lexemes. The focus of this article is bifold. Firstly, it briefly outlines the main structural types of AdpP, particularly in Western Uralic. Secondly, it discusses why loanwords only seldom occur in the adpositional system of languages that are strongly influenced by language contact and widespread bilingualism, such as contemporary Veps and Erzya.Peer reviewe
    corecore