Search CORE

596 research outputs found

Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

Author: Khairova N. F.
Lewoniewski Włodzimierz
Mamyrbayev Orken
Mukhsina Kuralay
Petrasova S. V.
Publication venue: MDPI AG, Switzerland
Publication date: 01/01/2018
Field of study

Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identification of connected knowledge fields. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. WithWordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments inWikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities

Directory of Open Access Journals

Electronic National Technical University "Kharkiv Polytechnic Institute" Institutional Repository (eNTUKhPIIR)

Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

Author: Fujii Atsushi
Ishikawa Tetsuya
Publication venue
Publication date: 01/01/2001
Field of study

Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

arXiv.org e-Print Archive

CiteSeerX

Bootstrapping word alignment via word packing

Author: Ma Yanjun
Stroppa Nicolas
Way Andy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2007
Field of study

We introduce a simple method to pack words for statistical word alignment. Our goal is to simplify the task of automatic word alignment by packing several consecutive words together when we believe they correspond to a single word in the opposite language. This is done using the word aligner itself, i.e. by bootstrapping on its output. We evaluate the performance of our approach on a Chinese-to-English machine translation task, and report a 12.2% relative increase in BLEU score over a state-of-the art phrase-based SMT system

Irish Universities

DCU Online Research Access Service

Bilingual contexts from comparable corpora to mine for translations of collocations

Author: Corpas Pastor Gloria
Fazly Afsaneh
Mitkov Ruslan
Taslimipoor Shiva
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing2016Due to the limited availability of parallel data in many languages, we propose a methodology that benefits from comparable corpora to find translation equivalents for collocations (as a specific type of difficult-to-translate multi-word expressions). Finding translations is known to be more difficult for collocations than for words. We propose a method based on bilingual context extraction and build a word (distributional) representation model drawing on these bilingual contexts (bilingual English-Spanish contexts in our case). We show that the bilingual context construction is effective for the task of translation equivalent learning and that our method outperforms a simplified distributional similarity baseline in finding translation equivalents

Wolverhampton Intellectual Repository and E-theses

Methods for the Extraction of Hungarian Multi-Word Lexemes

Author: Biró Tamás
Bouma Gosse
Kis Balázs
Moirón Begona Villada
Nerbonne John
Pohl Gábor
Ugray Gábor
Publication venue: Universiteit Antwerpen
Publication date: 01/01/2004
Field of study

Repository of the Academy's Library

Bilingually motivated word segmentation for statistical machine translation

Author: Andy Way
Banerjee S.
Birch A.
Chang P.-C.
Dyer C.
Fraser A.
Kneser R.
Koehn P.
Koehn P.
Ma Y.
Ma Y.
Macherey W.
Melamed I. D.
Paul M.
Stolcke A.
Takezawa T.
Tseng H.
Xu J.
Xu J.
Xu J.
Yanjun Ma
Zhang R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight that PB-SMT systems can be improved by optimizing the input representation to reduce the predictive power of translation models. We firstly present an approach to optimize the existing segmentation of both source and target languages for PB-SMT and demonstrate the effectiveness of this approach using a Chinese–English MT task, that is, to measure the influence of the segmentation on the performance of PB-SMT systems. We report a 5.44% relative increase in Bleu score and a consistent increase according to other metrics. We then generalize this method for Chinese word segmentation without relying on any segmenters and show that using our segmentation PB-SMT can achieve more consistent state-of-the-art performance across two domains. There are two main advantages of our approach. First of all, it is adapted to the specific translation task at hand by taking the corresponding source (target) language into account. Second, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Representation and parsing of multiword expressions

Author
Publication venue: Language Science Press
Publication date: 01/04/2020
Field of study

This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches

Directory of Open Access Books (DOAB)

Developing online parallel corpus-based processing tools for translation research and pedagogy

Author: Silva Carlos Eduardo da
Publication venue
Publication date: 01/01/2013
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente, Florianópolis, 2013.Abstract : This study describes the key steps in developing online parallel corpus-based tools for processing COPA-TRAD (copa-trad.ufsc.br), a parallel corpus compiled for translation research and pedagogy. The study draws on FernandesÂ s (2009) proposal for corpus compilation, which divides the compiling process into three main parts: corpus design, corpus building and corpus processing. This compiling process received contributions from the good development practices of Software Engineering, especially the ones advocated by Pressman (2011). The tools developed can, for example, assist in the investigation of certain types of texts and translational practices related to certain linguistic patterns such as collocations and semantic prosody. As a result of these applications, COPA-TRAD becomes a suitable tool for the investigation of empirical phenomena with a view to translation research and pedagogy.Este estudo descreve as principais etapas no desenvolvimento de ferramentas online com base em corpus para o processamento do COPA-TRAD (Corpus Paralelo de Tradução - www.copa-trad.ufsc.br), um corpus paralelo compilado para a pesquisa e ensino de tradução. Para a compilação do corpus, o estudo utiliza a proposta de Fernandes (2009) que divide o processo de compilação em três etapas principais: desenho do corpus, construção do corpus e processamento do corpus. Este processo de compilação recebeu contribuições das boas práticas de desenvolvimento fornecidas pela Engenharia de Software, especialmente as que foram sugeridas por Pressman (2011). As ferramentas desenvolvidas podem, por exemplo, auxiliar na investigação de certos tipos de textos, bem como em práticas tradutórias relacionadas a certos padrões linguísticos tais como colocações e prosódia semântica. Como resultado dessas aplicações, o COPA-TRAD configura-se em uma ferramenta útil para a investigação empírica de fenômenos tradutórios com vistas à pesquisa e ao ensino de tradução

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositório Institucional da UFSC

RCAAP - Repositório Científico de Acesso Aberto de Portugal