568 research outputs found

    A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

    Get PDF
    The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We fur- ther identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the met- rics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our pro- posed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages

    Multilingual Lexicon Extraction under Resource-Poor Language Pairs

    Get PDF
    In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Korean–French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Korean–French and Korean–Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and French–Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

    A survey of cross-lingual word embedding models

    Get PDF
    Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p

    Bilingual distributed word representations from document-aligned comparable data

    Get PDF
    We propose a new model for learning bilingual word representations from non-parallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings (BWEs). Unlike prior work on inducing BWEs which heavily relied on parallel sentence-aligned corpora and/or readily available translation resources such as dictionaries, the article reveals that BWEs may be learned solely on the basis of document-aligned comparable data without any additional lexical resources nor syntactic information. We present a comparison of our approach with previous state-of-the-art models for learning bilingual word representations from comparable data that rely on the framework of multilingual probabilistic topic modeling (MuPTM), as well as with distributional local context-counting models. We demonstrate the utility of the induced BWEs in two semantic tasks: (1) bilingual lexicon extraction, (2) suggesting word translations in context for polysemous words. Our simple yet effective BWE-based models significantly outperform the MuPTM-based and contextcounting representation models from comparable data as well as prior BWE-based models, and acquire the best reported results on both tasks for all three tested language pairs.This work was done while Ivan Vuli c was a postdoctoral researcher at Department of Computer Science, KU Leuven supported by the PDM Kort fellowship (PDMK/14/117). The work was also supported by the SCATE project (IWT-SBO 130041) and the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (648909)

    A Survey of Paraphrasing and Textual Entailment Methods

    Full text link
    Paraphrasing methods recognize, generate, or extract phrases, sentences, or longer natural language expressions that convey almost the same information. Textual entailment methods, on the other hand, recognize, generate, or extract pairs of natural language expressions, such that a human who reads (and trusts) the first element of a pair would most likely infer that the other element is also true. Paraphrasing can be seen as bidirectional textual entailment and methods from the two areas are often similar. Both kinds of methods are useful, at least in principle, in a wide range of natural language processing applications, including question answering, summarization, text generation, and machine translation. We summarize key ideas from the two areas by considering in turn recognition, generation, and extraction methods, also pointing to prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of Informatics, Athens University of Economics and Business, Greece, 201

    Unsupervised neural machine translation between the Portuguese language and the Chinese and Korean languages

    Get PDF
    Tese de Mestrado, InformĂĄtica, 2023, Universidade de Lisboa, Faculdade de CiĂȘnciasO propĂłsito desta dissertação Ă© apresentar um estudo comparativo e de reprodução sobre tĂ©cnicas de Tradução AutomĂĄtica Neuronal NĂŁo-Supervisionada (Unsupervised Neural Machine Translation) para o par de lĂ­nguas PortuguĂȘs (PT) →ChinĂȘs (ZH) e PortuguĂȘs (PT) → Coreano (KR) tirando partido de ferramentas e recursos online. A escolha destes pares de lĂ­nguas prende-se com duas grandes razĂ”es. A primeira refere-se Ă  importĂąncia no panorama global das lĂ­nguas asiĂĄticas, nomeadamente do chinĂȘs, e tambĂ©m pela infuĂȘncia que a lĂ­ngua portuguesa desempenha no mundo especialmente no hemisfĂ©rio sul. A segunda razĂŁo Ă© puramente acadĂ©mica. Como hĂĄ escassez de estudos na ĂĄrea de Processamento Natural de Linguagem (NLP) com lĂ­nguas nĂŁo-germĂąnicas (devido Ă  hegemonia da lĂ­ngua inglesa), procurou-se desenvolver um trabalho que estude a infuĂȘncia das tĂ©cnicas de tradução nĂŁo supervisionada em par de lĂ­nguas poucos estudadas, a fm de testar a sua robustez. Falada por um quarto da população mundial, a lĂ­ngua chinesa Ă© o“Ás”no baralho de cartas da China. De acordo com o International Chinese Language Education Week, em 2020 estimava-se que 200 milhĂ”es pessoas nĂŁo-nativas jĂĄ tinham aprendido chinĂȘs e que no ano corrente se encontravam mais de 25 milhĂ”es a estudĂĄ-la. Com a infuĂȘncia que a lĂ­ngua chinesa desempenha, torna-se imperativo desenvolver ferramentas que preencham as falhas de comunicação. Assim, nesta conjuntura global surge a tradução automĂĄtica como ponte de comunicação entre vĂĄrias culturas e a China. A Coreia do Sul, tambĂ©m conhecida como um dos quatro tigres asiĂĄticos, concretizou um feito extraordinĂĄrio ao levantar-se da pobreza extrema para ser um dos paĂ­ses mais desenvolvidos do mundo em duas geraçÔes. Apesar de nĂŁo possuir a hegemonia econĂłmica da China, a Coreia do Sul exerce bastante infuĂȘncia devido ao seu soft power na ĂĄrea de entretenimento, designado por hallyu. Esta“onda”de cultura pop coreana atraĂ­ multidĂ”es para a aprendizagem da cultura. De forma a desvanecer a barreira comunicativa entre os amantes da cultura coreana e os nativos, a tradução automĂĄtica Ă© um forte aliado porque permite a interação entre pessoas instantaneamente sem a necessidade de aprender uma lĂ­ngua nova. Apesar de Portugal nĂŁo ter ligaçÔes culturais com a Coreia, hĂĄ uma forte ligação com a regiĂŁo administrativa especial de Macau (RAEM) onde o portuguĂȘs Ă© uma das lĂ­nguas ofciais, sendo que a Tradução AutomĂĄtica entre ambas as lĂ­nguas ofciais Ă© uma das ĂĄreas estratĂ©gicas do governo local tendo sido estabelecido um laboratĂłrio de Tradução AutomĂĄtica no Instituto PolitĂ©cnico de Macau que visa construir um sistema que possa ser usado na função pĂșblica de auxĂ­lio aos tradutores. Neste trabalho foram realizadas duas abordagens: (i) Tradução AutomĂĄtica Neuronal NĂŁo Supervisionada (Unsupervised Neural Machine Translation) e; (ii) abordagem pivĂŽ (pivot approach). Como o foco da dissertação Ă© em tĂ©cnicas nĂŁosupervisionadas, nenhuma das arquiteturas fez uso de dados paralelos entre os pares de lĂ­nguas em questĂŁo. Nomeadamente, na primeira abordagem usou-se dados monolingues. Na segunda introduziu-se uma terceira lĂ­ngua pivĂŽ que Ă© utilizada para estabelecer a ponte entre a lĂ­ngua de partida e a de chegada. Esta abordagem Ă  tradução automĂĄtica surgiu com a necessidade de criar sistemas de tradução para pares de lĂ­nguas onde existem poucos ou nenhuns dados paralelos. Como demonstrado por Koehn and Knowles [2017a], a tradução automĂĄtica neuronal precisa de grandes quantidades de dados a fm de ter um desempenho melhor que a Tradução AutomĂĄtica EstatĂ­stica (SMT). No entanto, em pares de lĂ­nguas com poucos recursos linguĂ­sticos isso nĂŁo Ă© exequĂ­vel. Para tal, a arquitetura de tradução automĂĄtica nĂŁo supervisionada somente requer dados monolingues. A implementação escolhida foi a de Artetxe et al. [2018d] que Ă© constituĂ­da por uma arquitetura encoder-decoder. Como contĂ©m um double-encoder, para esta abordagem foram consideradas ambas direçÔes: PortuguĂȘs ↔ ChinĂȘs e PortuguĂȘs ↔ Coreano. Para alĂ©m da reprodução para lĂ­nguas dissimilares com poucos recursos, tambĂ©m foi elaborado um estudo de replicação do artigo original usando os dados de um dos pares de lĂ­nguas estudados pelos autores: InglĂȘs ↔ FrancĂȘs. Outra alternativa para a falta de corpora paralelos Ă© a abordagem pivĂŽ. Nesta abordagem, o sistema faz uso de uma terceira lĂ­ngua, designada por pivĂŽ, que liga a lĂ­ngua de partida Ă  de chegada. Esta opção Ă© tida em conta quando hĂĄ existĂȘncia de dados paralelos em abundĂąncia entre as duas lĂ­nguas. A motivação deste mĂ©todo Ă© fazer jus ao desempenho que as redes neuronais tĂȘm quando sĂŁo alimentadas com grandes volumes de dados. Com a existĂȘncia de grandes quantidades de corpora paralelos entre todas as lĂ­nguas em questĂŁo e a pivĂŽ, o desempenho das redes compensa a propagação de erro introduzida pela lĂ­ngua intermediĂĄria. No nosso caso, a lĂ­ngua pivĂŽ escolhida foi o inglĂȘs pela forte presença de dados paralelos entre o pivĂŽ e as restantes trĂȘs lĂ­nguas. O sistema começa por traduzir de portuguĂȘs para inglĂȘs e depois traduz a pivĂŽ para coreano ou chinĂȘs. Ao contrĂĄrio da primeira abordagem, sĂł foi considerada uma direção de PortuguĂȘs → ChinĂȘs e PortuguĂȘs → Coreano. Para implementar esta abordagem foi considerada a framework OpenNMT desenvolvida por [Klein et al., 2017]. Os resultados foram avaliados usando a mĂ©trica BLEU [Papineni et al., 2002b]. Com esta mĂ©trica foi possĂ­vel comparar o desempenho entre as duas arquiteturas e aferir qual Ă© o mĂ©todo mais efcaz para pares de lĂ­nguas dissimilares com poucos recursos. Na direção PortuguĂȘs → ChinĂȘs e PortuguĂȘs → Coreano a abordagem pivĂŽ foi superior tendo obtido um BLEU de 13,37 pontos para a direção PortuguĂȘs → ChinĂȘs e um BLEU de 17,28 pontos na direção PortuguĂȘs → Coreano. JĂĄ com a abordagem de tradução automĂĄtica neural nĂŁo supervisionada o valor mais alto obtido na direção PortuguĂȘs → Coreano foi de um BLEU de 0,69, enquanto na direção de PortuguĂȘs → ChinĂȘs foi de 0,32 BLEU (num total de 100). Os valores da tradução nĂŁo supervisionada vĂŁo estĂŁo alinhados com os obtidos por [GuzmĂĄn et al., 2019], [Kim et al., 2020]. A explicação dada para estes valores baixos prende-se com a qualidade dos cross-lingual embeddings. O desempenho dos cross-lingual embeddings tende a degradar-se quando mapeia pares de lĂ­nguas distantes e, sendo que modelo de tradução automĂĄtica nĂŁo supervisionado Ă© inicializado com os cross-lingual embeddings, caso estes sejam de baixa qualidade, o modelo nĂŁo converge para um Ăłtimo local, resultando nos valores obtidos na dissertação. Dos dois mĂ©todos testados, verifica-se que a abordagem pivĂŽ Ă© a que tem melhor performance. Tal como foi possĂ­vel averiguar pela literatura corrente e tambĂ©m pelos resultados obtidos nesta dissertação, o mĂ©todo neuronal nĂŁo-supervisionado proposto por Artetxe et al. [2018d] nĂŁo Ă© sufcientemente robusto para inicializar um sistema de tradução suportado por textos monolingues em lĂ­nguas distantes. PorĂ©m Ă© uma abordagem promissora porque permitiria colmatar uma das grandes lacunas na ĂĄrea de Tradução AutomĂĄtica que se cinge Ă  falta de dados paralelos de boa qualidade. No entanto seria necessĂĄrio dar mais atenção ao problema dos cross-lingual embeddings em mapear lĂ­nguas distantes. Este trabalho fornece uma visĂŁo sobre o estudo de tĂ©cnicas nĂŁo supervisionadas para pares de lĂ­nguas distantes e providencia uma solução para a construção de sistemas de tradução automĂĄtica para os pares de lĂ­ngua portuguĂȘs-chinĂȘs e portuguĂȘs-coreano usando dados monolingues.This dissertation presents a comparative and reproduction study on Unsupervised Neural Machine Translation techniques in the pair of languages Portuguese (PT) → Chinese (ZH) and Portuguese (PT) → Korean(KR). We chose these language-pairs for two main reasons. The frst one refers to the importance that Asian languages play in the global panorama and the infuence that Portuguese has in the southern hemisphere. The second reason is purely academic. Since there is a lack of studies in the area of Natural Language Processing (NLP) regarding non-Germanic languages, we focused on studying the infuence of nonsupervised techniques in under-studied languages. In this dissertation, we worked on two approaches: (i) Unsupervised Neural Machine Translation; (ii) the Pivot approach. The frst approach uses only monolingual corpora. As for the second, it uses parallel corpora between the pivot and the non-pivot languages. The unsupervised approach was devised to mitigate the problem of low-resource languages where training traditional Neural Machine Translations was unfeasible due to requiring large amounts of data to achieve promising results. As such, the unsupervised machine translation only requires monolingual corpora. In this dissertation we chose the mplementation of Artetxe et al. [2018d] to develop our work. Another alternative to the lack of parallel corpora is the pivot approach. In this approach, the system uses a third language (called pivot) that connects the source language to the target language. The reasoning behind this is to take advantage of the performance of the neural networks when being fed with large amounts of data, making it enough to counterbalance the error propagation which is introduced when adding a third language. The results were evaluated using the BLEU metric and showed that for both language pairs Portuguese → Chinese and Portuguese → Korean, the pivot approach had a better performance making it a more suitable choice for these dissimilar low resource language pairs

    Multilingual Neural Translation

    Get PDF
    Machine translation (MT) refers to the technology that can automatically translate contents in one language into other languages. Being an important research area in the field of natural language processing, machine translation has typically been considered one of most challenging yet exciting problems. Thanks to research progress in the data-driven statistical machine translation (SMT), MT is recently capable of providing adequate translation services in many language directions and it has been widely deployed in various practical applications and scenarios. Nevertheless, there exist several drawbacks in the SMT framework. The major drawbacks of SMT lie in its dependency in separate components, its simple modeling approach, and the ignorance of global context in the translation process. Those inherent drawbacks prevent the over-tuned SMT models to gain any noticeable improvements over its horizon. Furthermore, SMT is unable to formulate a multilingual approach in which more than two languages are involved. The typical workaround is to develop multiple pair-wise SMT systems and connect them in a complex bundle to perform multilingual translation. Those limitations have called out for innovative approaches to address them effectively. On the other hand, it is noticeable how research on artificial neural networks has progressed rapidly since the beginning of the last decade, thanks to the improvement in computation, i.e faster hardware. Among other machine learning approaches, neural networks are known to be able to capture complex dependencies and learn latent representations. Naturally, it is tempting to apply neural networks in machine translation. First attempts revolve around replacing SMT sub-components by the neural counterparts. Later attempts are more revolutionary by fundamentally changing the whole core of SMT with neural networks, which is now popularly known as neural machine translation (NMT). NMT is an end-to-end system which directly estimate the translation model between the source and target sentences. Furthermore, it is later discovered to capture the inherent hierarchical structure of natural language. This is the key property of NMT that enables a new training paradigm and a less complex approach for multilingual machine translation using neural models. This thesis plays an important role in the evolutional course of machine translation by contributing to the transition of using neural components in SMT to the completely end-to-end NMT and most importantly being the first of the pioneers in building a neural multilingual translation system. First, we proposed an advanced neural-based component: the neural network discriminative word lexicon, which provides a global coverage for the source sentence during the translation process. We aim to alleviate the problems of phrase-based SMT models that are caused by the way how phrase-pair likelihoods are estimated. Such models are unable to gather information from beyond the phrase boundaries. In contrast, our discriminative word lexicon facilitates both the local and global contexts of the source sentences and models the translation using deep neural architectures. Our model has improved the translation quality greatly when being applied in different translation tasks. Moreover, our proposed model has motivated the development of end-to-end NMT architectures later, where both of the source and target sentences are represented with deep neural networks. The second and also the most significant contribution of this thesis is the idea of extending an NMT system to a multilingual neural translation framework without modifying its architecture. Based on the ability of deep neural networks to modeling complex relationships and structures, we utilize NMT to learn and share the cross-lingual information to benefit all translation directions. In order to achieve that purpose, we present two steps: first in incorporating language information into training corpora so that the NMT learns a common semantic space across languages and then force the NMT to translate into the desired target languages. The compelling aspect of the approach compared to other multilingual methods, however, lies in the fact that our multilingual extension is conducted in the preprocessing phase, thus, no change needs to be done inside the NMT architecture. Our proposed method, a universal approach for multilingual MT, enables a seamless coupling with any NMT architecture, thus makes the multilingual expansion to the NMT systems effortlessly. Our experiments and the studies from others have successfully employed our approach with numerous different NMT architectures and show the universality of the approach. Our multilingual neural machine translation accommodates cross-lingual information in a learned common semantic space to improve altogether every translation direction. It is then effectively applied and evaluated in various scenarios. We develop a multilingual translation system that relies on both source and target data to boost up the quality of a single translation direction. Another system could be deployed as a multilingual translation system that only requires being trained once using a multilingual corpus but is able to translate between many languages simultaneously and the delivered quality is more favorable than many translation systems trained separately. Such a system able to learn from large corpora of well-resourced languages, such as English → German or English → French, has proved to enhance other translation direction of low-resourced language pairs like English → Lithuania or German → Romanian. Even more, we show that kind of approach can be applied to the extreme case of zero-resourced translation where no parallel data is available for training without the need of pivot techniques. The research topics of this thesis are not limited to broadening application scopes of our multilingual approach but we also focus on improving its efficiency in practice. Our multilingual models have been further improved to adequately address the multilingual systems whose number of languages is large. The proposed strategies demonstrate that they are effective at achieving better performance in multi-way translation scenarios with greatly reduced training time. Beyond academic evaluations, we could deploy the multilingual ideas in the lecture-themed spontaneous speech translation service (Lecture Translator) at KIT. Interestingly, a derivative product of our systems, the multilingual word embedding corpus available in a dozen of languages, can serve as a useful resource for cross-lingual applications such as cross-lingual document classification, information retrieval, textual entailment or question answering. Detailed analysis shows excellent performance with regard to semantic similarity metrics when using the embeddings on standard cross-lingual classification tasks
    • 

    corecore