596 research outputs found

    Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

    Get PDF
    Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. WithWordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments inWikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    Bootstrapping word alignment via word packing

    Get PDF
    We introduce a simple method to pack words for statistical word alignment. Our goal is to simplify the task of automatic word alignment by packing several consecutive words together when we believe they correspond to a single word in the opposite language. This is done using the word aligner itself, i.e. by bootstrapping on its output. We evaluate the performance of our approach on a Chinese-to-English machine translation task, and report a 12.2% relative increase in BLEU score over a state-of-the art phrase-based SMT system

    Bilingual contexts from comparable corpora to mine for translations of collocations

    Get PDF
    Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing2016Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents

    Bilingually motivated word segmentation for statistical machine translation

    Get PDF
    We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight that PB-SMT systems can be improved by optimizing the input representation to reduce the predictive power of translation models. We firstly present an approach to optimize the existing segmentation of both source and target languages for PB-SMT and demonstrate the effectiveness of this approach using a Chinese–English MT task, that is, to measure the influence of the segmentation on the performance of PB-SMT systems. We report a 5.44% relative increase in Bleu score and a consistent increase according to other metrics. We then generalize this method for Chinese word segmentation without relying on any segmenters and show that using our segmentation PB-SMT can achieve more consistent state-of-the-art performance across two domains. There are two main advantages of our approach. First of all, it is adapted to the specific translation task at hand by taking the corresponding source (target) language into account. Second, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

    Developing online parallel corpus-based processing tools for translation research and pedagogy

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente, Florianópolis, 2013.Abstract : This study describes the key steps in developing online parallel corpus-based tools for processing COPA-TRAD (copa-trad.ufsc.br), a parallel corpus compiled for translation research and pedagogy. The study draws on Fernandes s (2009) proposal for corpus compilation, which divides the compiling process into three main parts: corpus design, corpus building and corpus processing. This compiling process received contributions from the good development practices of Software Engineering, especially the ones advocated by Pressman (2011). The tools developed can, for example, assist in the investigation of certain types of texts and translational practices related to certain linguistic patterns such as collocations and semantic prosody. As a result of these applications, COPA-TRAD becomes a suitable tool for the investigation of empirical phenomena with a view to translation research and pedagogy.Este estudo descreve as principais etapas no desenvolvimento de ferramentas online com base em corpus para o processamento do COPA-TRAD (Corpus Paralelo de Tradução - www.copa-trad.ufsc.br), um corpus paralelo compilado para a pesquisa e ensino de tradução. Para a compilação do corpus, o estudo utiliza a proposta de Fernandes (2009) que divide o processo de compilação em três etapas principais: desenho do corpus, construção do corpus e processamento do corpus. Este processo de compilação recebeu contribuições das boas práticas de desenvolvimento fornecidas pela Engenharia de Software, especialmente as que foram sugeridas por Pressman (2011). As ferramentas desenvolvidas podem, por exemplo, auxiliar na investigação de certos tipos de textos, bem como em práticas tradutórias relacionadas a certos padrões linguísticos tais como colocações e prosódia semântica. Como resultado dessas aplicações, o COPA-TRAD configura-se em uma ferramenta útil para a investigação empírica de fenômenos tradutórios com vistas à pesquisa e ao ensino de tradução
    corecore