3 research outputs found
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary, whether via optical character
recognition or manual entry, it is inevitable that errors are introduced into
the electronic version that is created. We investigate automating the process
of detecting errors in an XML representation of a digitized print dictionary
using a hybrid approach that combines rule-based, feature-based, and language
model-based methods. We investigate combining methods and show that using
random forests is a promising approach. We find that in isolation, unsupervised
methods rival the performance of supervised methods. Random forests typically
require training data so we investigate how we can apply random forests to
combine individual base methods that are themselves unsupervised without
requiring large amounts of training data. Experiments reveal empirically that a
relatively small amount of data is sufficient and can potentially be further
reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the
Workshop on Innovative Hybrid Approaches to the Processing of Textual Data,
April 201
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary,
whether via optical character recognition or
manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random
forests is a promising approach. We find
that in isolation, unsupervised methods rival the performance of supervised methods.
Random forests typically require training
data so we investigate how we can apply
random forests to combine individual base
methods that are themselves unsupervised
without requiring large amounts of training
data. Experiments reveal empirically that
a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria
Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Many important forms of data are stored digitally in XML format. Errors can
occur in the textual content of the data in the fields of the XML. Fixing these
errors manually is time-consuming and expensive, especially for large amounts
of data. There is increasing interest in the research, development, and use of
automated techniques for assisting with data cleaning. Electronic dictionaries
are an important form of data frequently stored in XML format that frequently
have errors introduced through a mixture of manual typographical entry errors
and optical character recognition errors. In this paper we describe methods for
flagging statistical anomalies as likely errors in electronic dictionaries
stored in XML format. We describe six systems based on different sources of
information. The systems detect errors using various signals in the data
including uncommon characters, text length, character-based language models,
word-based language models, tied-field length ratios, and tied-field
transliteration models. Four of the systems detect errors based on expectations
automatically inferred from content within elements of a single field type. We
call these single-field systems. Two of the systems detect errors based on
correspondence expectations automatically inferred from content within elements
of multiple related field types. We call these tied-field systems. For each
system, we provide an intuitive analysis of the type of error that it is
successful at detecting. Finally, we describe two larger-scale evaluations
using crowdsourcing with Amazon's Mechanical Turk platform and using the
annotations of a domain expert. The evaluations consistently show that the
systems are useful for improving the efficiency with which errors in XML
electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016
IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna
Hills, CA, USA, pages 79-86, February 201