Search CORE

24,447 research outputs found

Translation Memory Retrieval Methods

Author: Bloodgood Michael
Strauss Benjamin
Publication venue
Publication date: 01/01/2014
Field of study

Translation Memory (TM) systems are one of the most widely used translation technologies. An important part of TM systems is the matching algorithm that determines what translations get retrieved from the bank of available translations to assist the human translator. Although detailed accounts of the matching algorithms used in commercial systems can't be found in the literature, it is widely believed that edit distance algorithms are used. This paper investigates and evaluates the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial TM systems. This paper presents results showing how well various matching algorithms correlate with human judgments of helpfulness (collected via crowdsourcing with Amazon's Mechanical Turk). A new algorithm based on weighted n-gram precision that can be adjusted for translator length preferences consistently returns translations judged to be most helpful by translators for multiple domains and language pairs.Comment: 9 pages, 6 tables, 3 figures; appeared in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, April 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Digital Repository at the University of Maryland

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Author: Bloodgood Michael
Strauss Benjamin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

arXiv.org e-Print Archive

Crossref

Digital Repository at the University of Maryland

Short-time Fourier transform laser Doppler holography

Author: Atlan Michael
Samson Benjamin
Publication venue: 'European Optical Society'
Publication date: 05/05/2013
Field of study

We report a demonstration of laser Doppler holography at a sustained acquisition rate of 250 Hz on a 1 Megapixel complementary metal-oxide-semiconductor (CMOS) sensor array and image display at 10 Hz frame rate. The holograms are optically acquired in off-axis configuration, with a frequency-shifted reference beam. Wide-field imaging of optical fluctuations in a 250 Hz frequency band is achieved by turning time-domain samplings to the dual domain via short-time temporal Fourier transformation. The measurement band can be positioned freely within the low radio-frequency spectrum by tuning the frequency of the reference beam in real-time. Video-rate image rendering is achieved by streamline image processing with commodity computer graphics hardware. This experimental scheme is validated by a non-contact vibrometry experiment

arXiv.org e-Print Archive

Hal-Diderot

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Author: Heinzerling Benjamin
Strube Michael
Publication venue
Publication date: 01/01/2017
Field of study

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpem

arXiv.org e-Print Archive

TUbiblio