6,209 research outputs found

    Translation Memory Retrieval Methods

    Get PDF
    Translation Memory (TM) systems are one of the most widely used translation technologies. An important part of TM systems is the matching algorithm that determines what translations get retrieved from the bank of available translations to assist the human translator. Although detailed accounts of the matching algorithms used in commercial systems can't be found in the literature, it is widely believed that edit distance algorithms are used. This paper investigates and evaluates the use of several matching algorithms, including the edit distance algorithm that is believed to be at the heart of most modern commercial TM systems. This paper presents results showing how well various matching algorithms correlate with human judgments of helpfulness (collected via crowdsourcing with Amazon's Mechanical Turk). A new algorithm based on weighted n-gram precision that can be adjusted for translator length preferences consistently returns translations judged to be most helpful by translators for multiple domains and language pairs.Comment: 9 pages, 6 tables, 3 figures; appeared in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, April 201

    Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

    Get PDF
    Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

    Using Global Constraints and Reranking to Improve Cognates Detection

    Full text link
    Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past.Comment: 10 pages, 6 figures, 6 tables; published in the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1983-1992, Vancouver, Canada, July 201

    Geocoding Large Population‐level Administrative Datasets at Highly Resolved Spatial Scales

    Full text link
    Using geographic information systems to link administrative databases with demographic, social, and environmental data allows researchers to use spatial approaches to explore relationships between exposures and health. Traditionally, spatial analysis in public health has focused on the county, ZIP code, or tract level because of limitations to geocoding at highly resolved scales. Using 2005 birth and death data from North Carolina, we examine our ability to geocode population‐level datasets at three spatial resolutions – zip code, street, and parcel. We achieve high geocoding rates at all three resolutions, with statewide street geocoding rates of 88.0% for births and 93.2% for deaths. We observe differences in geocoding rates across demographics and health outcomes, with lower geocoding rates in disadvantaged populations and the most dramatic differences occurring across the urban‐rural spectrum. Our results suggest that highly resolved spatial data architectures for population‐level datasets are viable through geocoding individual street addresses. We recommend routinely geocoding administrative datasets to the highest spatial resolution feasible, allowing public health researchers to choose the spatial resolution used in analysis based on an understanding of the spatial dimensions of the health outcomes and exposures being investigated. Such research, however, must acknowledge how disparate geocoding success across subpopulations may affect findings.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/108258/1/tgis12052.pd

    Online Tracking: Can the Free Market Create Choice Where None Exists?

    Get PDF

    Predicting Switch-like Behavior in Proteins using Logistic Regression on Sequence-based Descriptors

    Get PDF
    Ligands can bind at specific protein locations, inducing conformational changes such as those involving secondary structure. Identifying these possible switches from sequence, including homology, is an important ongoing area of research. We attempt to predict possible secondary structure switches from sequence in proteins using machine learning, specifically a logistic regression approach with 48 N-acetyltransferases as our learning set and 5 sirtuins as our test set. Validated residue binary assignments of 0 (no change in secondary structure) and 1 (change in secondary structure) were determined (DSSP) from 3D X-ray structures for sets of virtually identical chains crystallized under different conditions. Our sequence descriptors include amino acid type, six and twenty-term sequence entropy, Lobanov-Galzitskaya’s residue disorder propensity, Vkabat (variability with respect to predictions from sequence of helix, sheet and other), and all possible combinations. We find the optimal AUC values approaching 70% for the two models of just residue disorder propensity and separately Vkabat. We hope to follow up with a larger learning set and using residue charge as an additional descriptor

    The photomultiplier tube calibration system of the MicroBooNE experiment

    Get PDF
    We report on the design and construction of a LED-based fiber calibration system for large liquid argon time projection detectors. This system was developed to calibrate the optical systems of the MicroBooNE experiment. As well as detailing the materials and installation procedure, we provide technical drawings and specifications so that the system may be easily replicated in future LArTPC detectors.National Science Foundation (U.S.) (Grant PHY-1205175

    Equity Crowdfunding Market: Assets and Drawbacks

    Get PDF

    Classification and Casimir Invariants of Lie-Poisson Brackets

    Full text link
    We classify Lie-Poisson brackets that are formed from Lie algebra extensions. The problem is relevant because many physical systems owe their Hamiltonian structure to such brackets. A classification involves reducing all brackets to a set of normal forms, and is achieved partially through the use of Lie algebra cohomology. For extensions of order less than five, the number of normal forms is small and they involve no free parameters. We derive a general method of finding Casimir invariants of Lie-Poisson bracket extensions. The Casimir invariants of all low-order brackets are explicitly computed. We treat in detail a four field model of compressible reduced magnetohydrodynamics.Comment: 59 pages, Elsevier macros. To be published in Physica
    corecore