6,209 research outputs found
Translation Memory Retrieval Methods
Translation Memory (TM) systems are one of the most widely used translation
technologies. An important part of TM systems is the matching algorithm that
determines what translations get retrieved from the bank of available
translations to assist the human translator. Although detailed accounts of the
matching algorithms used in commercial systems can't be found in the
literature, it is widely believed that edit distance algorithms are used. This
paper investigates and evaluates the use of several matching algorithms,
including the edit distance algorithm that is believed to be at the heart of
most modern commercial TM systems. This paper presents results showing how well
various matching algorithms correlate with human judgments of helpfulness
(collected via crowdsourcing with Amazon's Mechanical Turk). A new algorithm
based on weighted n-gram precision that can be adjusted for translator length
preferences consistently returns translations judged to be most helpful by
translators for multiple domains and language pairs.Comment: 9 pages, 6 tables, 3 figures; appeared in Proceedings of the 14th
Conference of the European Chapter of the Association for Computational
Linguistics, April 201
Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Many important forms of data are stored digitally in XML format. Errors can
occur in the textual content of the data in the fields of the XML. Fixing these
errors manually is time-consuming and expensive, especially for large amounts
of data. There is increasing interest in the research, development, and use of
automated techniques for assisting with data cleaning. Electronic dictionaries
are an important form of data frequently stored in XML format that frequently
have errors introduced through a mixture of manual typographical entry errors
and optical character recognition errors. In this paper we describe methods for
flagging statistical anomalies as likely errors in electronic dictionaries
stored in XML format. We describe six systems based on different sources of
information. The systems detect errors using various signals in the data
including uncommon characters, text length, character-based language models,
word-based language models, tied-field length ratios, and tied-field
transliteration models. Four of the systems detect errors based on expectations
automatically inferred from content within elements of a single field type. We
call these single-field systems. Two of the systems detect errors based on
correspondence expectations automatically inferred from content within elements
of multiple related field types. We call these tied-field systems. For each
system, we provide an intuitive analysis of the type of error that it is
successful at detecting. Finally, we describe two larger-scale evaluations
using crowdsourcing with Amazon's Mechanical Turk platform and using the
annotations of a domain expert. The evaluations consistently show that the
systems are useful for improving the efficiency with which errors in XML
electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016
IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna
Hills, CA, USA, pages 79-86, February 201
Using Global Constraints and Reranking to Improve Cognates Detection
Global constraints and reranking have not been used in cognates detection
research to date. We propose methods for using global constraints by performing
rescoring of the score matrices produced by state of the art cognates detection
systems. Using global constraints to perform rescoring is complementary to
state of the art methods for performing cognates detection and results in
significant performance improvements beyond current state of the art
performance on publicly available datasets with different language pairs and
various conditions such as different levels of baseline state of the art
performance and different data size conditions, including with more realistic
large data size conditions than have been evaluated with in the past.Comment: 10 pages, 6 figures, 6 tables; published in the Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics, pages
1983-1992, Vancouver, Canada, July 201
Geocoding Large Population‐level Administrative Datasets at Highly Resolved Spatial Scales
Using geographic information systems to link administrative databases with demographic, social, and environmental data allows researchers to use spatial approaches to explore relationships between exposures and health. Traditionally, spatial analysis in public health has focused on the county, ZIP code, or tract level because of limitations to geocoding at highly resolved scales. Using 2005 birth and death data from North Carolina, we examine our ability to geocode population‐level datasets at three spatial resolutions – zip code, street, and parcel. We achieve high geocoding rates at all three resolutions, with statewide street geocoding rates of 88.0% for births and 93.2% for deaths. We observe differences in geocoding rates across demographics and health outcomes, with lower geocoding rates in disadvantaged populations and the most dramatic differences occurring across the urban‐rural spectrum. Our results suggest that highly resolved spatial data architectures for population‐level datasets are viable through geocoding individual street addresses. We recommend routinely geocoding administrative datasets to the highest spatial resolution feasible, allowing public health researchers to choose the spatial resolution used in analysis based on an understanding of the spatial dimensions of the health outcomes and exposures being investigated. Such research, however, must acknowledge how disparate geocoding success across subpopulations may affect findings.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/108258/1/tgis12052.pd
Predicting Switch-like Behavior in Proteins using Logistic Regression on Sequence-based Descriptors
Ligands can bind at specific protein locations, inducing conformational changes such as those involving secondary structure. Identifying these possible switches from sequence, including homology, is an important ongoing area of research. We attempt to predict possible secondary structure switches from sequence in proteins using machine learning, specifically a logistic regression approach with 48 N-acetyltransferases as our learning set and 5 sirtuins as our test set. Validated residue binary assignments of 0 (no change in secondary structure) and 1 (change in secondary structure) were determined (DSSP) from 3D X-ray structures for sets of virtually identical chains crystallized under different conditions. Our sequence descriptors include amino acid type, six and twenty-term sequence entropy, Lobanov-Galzitskaya’s residue disorder propensity, Vkabat (variability with respect to predictions from sequence of helix, sheet and other), and all possible combinations. We find the optimal AUC values approaching 70% for the two models of just residue disorder propensity and separately Vkabat. We hope to follow up with a larger learning set and using residue charge as an additional descriptor
The photomultiplier tube calibration system of the MicroBooNE experiment
We report on the design and construction of a LED-based fiber calibration system for large liquid argon time projection detectors. This system was developed to calibrate the optical systems of the MicroBooNE experiment. As well as detailing the materials and installation procedure, we provide technical drawings and specifications so that the system may be easily replicated in future LArTPC detectors.National Science Foundation (U.S.) (Grant PHY-1205175
Classification and Casimir Invariants of Lie-Poisson Brackets
We classify Lie-Poisson brackets that are formed from Lie algebra extensions.
The problem is relevant because many physical systems owe their Hamiltonian
structure to such brackets. A classification involves reducing all brackets to
a set of normal forms, and is achieved partially through the use of Lie algebra
cohomology. For extensions of order less than five, the number of normal forms
is small and they involve no free parameters. We derive a general method of
finding Casimir invariants of Lie-Poisson bracket extensions. The Casimir
invariants of all low-order brackets are explicitly computed. We treat in
detail a four field model of compressible reduced magnetohydrodynamics.Comment: 59 pages, Elsevier macros. To be published in Physica
- …