380 research outputs found
Learning to match names across languages
We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach, relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.
Generating Paired Transliterated-cognates Using Multiple Pronunciation Characteristics from Web corpora
A novel approach to automatically extracting paired transliterated-cognates from Web corpora is proposed in this paper. One of the most important issues addressed is that of taking multiple pronunciation characteristics into account. Terms from various languages may pronounce very differently. Incorporating the knowledge of word origin may improve the pronunciation accuracy of terms. The accuracy of generated phonetic information has an important impact on term transliteration and hence transliterated-term extraction. Transliterated-term extraction is a fundamental task in natural language processing to extract paired transliterated-terms in studying term transliteration. An experiment on transliterated-term extraction from two kinds of Web resources, Web pages and anchored texts, has been conducted and evaluated. The experimental results show that many transliterated-term pairs, which cannot be extracted using the approach only exploiting English pronunciation characteristics, have been successfully extracted using the proposed approach in this paper. By taking multiple language-specific pronunciation transformations into account may further improve the output of the transliterated-term extraction
Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Many important forms of data are stored digitally in XML format. Errors can
occur in the textual content of the data in the fields of the XML. Fixing these
errors manually is time-consuming and expensive, especially for large amounts
of data. There is increasing interest in the research, development, and use of
automated techniques for assisting with data cleaning. Electronic dictionaries
are an important form of data frequently stored in XML format that frequently
have errors introduced through a mixture of manual typographical entry errors
and optical character recognition errors. In this paper we describe methods for
flagging statistical anomalies as likely errors in electronic dictionaries
stored in XML format. We describe six systems based on different sources of
information. The systems detect errors using various signals in the data
including uncommon characters, text length, character-based language models,
word-based language models, tied-field length ratios, and tied-field
transliteration models. Four of the systems detect errors based on expectations
automatically inferred from content within elements of a single field type. We
call these single-field systems. Two of the systems detect errors based on
correspondence expectations automatically inferred from content within elements
of multiple related field types. We call these tied-field systems. For each
system, we provide an intuitive analysis of the type of error that it is
successful at detecting. Finally, we describe two larger-scale evaluations
using crowdsourcing with Amazon's Mechanical Turk platform and using the
annotations of a domain expert. The evaluations consistently show that the
systems are useful for improving the efficiency with which errors in XML
electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016
IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna
Hills, CA, USA, pages 79-86, February 201
- …