187,903 research outputs found
Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection
Many important forms of data are stored digitally in XML format. Errors can
occur in the textual content of the data in the fields of the XML. Fixing these
errors manually is time-consuming and expensive, especially for large amounts
of data. There is increasing interest in the research, development, and use of
automated techniques for assisting with data cleaning. Electronic dictionaries
are an important form of data frequently stored in XML format that frequently
have errors introduced through a mixture of manual typographical entry errors
and optical character recognition errors. In this paper we describe methods for
flagging statistical anomalies as likely errors in electronic dictionaries
stored in XML format. We describe six systems based on different sources of
information. The systems detect errors using various signals in the data
including uncommon characters, text length, character-based language models,
word-based language models, tied-field length ratios, and tied-field
transliteration models. Four of the systems detect errors based on expectations
automatically inferred from content within elements of a single field type. We
call these single-field systems. Two of the systems detect errors based on
correspondence expectations automatically inferred from content within elements
of multiple related field types. We call these tied-field systems. For each
system, we provide an intuitive analysis of the type of error that it is
successful at detecting. Finally, we describe two larger-scale evaluations
using crowdsourcing with Amazon's Mechanical Turk platform and using the
annotations of a domain expert. The evaluations consistently show that the
systems are useful for improving the efficiency with which errors in XML
electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016
IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna
Hills, CA, USA, pages 79-86, February 201
A cascaded approach to normalising gene mentions in biomedical literature
Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%
Design and Implementation of the UniProt Website
The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The "www.uniprot.org":http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access
Recommended from our members
Immigration: Visa Security Policies
[Excerpt] The report opens with an overview of visa issuance policy. It then explains the key provisions that guide the documentary requirements and approval/disapproval process. The section on consular screening procedures includes an analysis of trends over time in denying visas. Visa revocation, a reoccurring issue of concern to Congress, and the visa security program are discussed as well
Recommended from our members
Visa Waiver Program
Since the events of September 11, 2001, concerns have been raised about the
ability of terrorists to enter the United States under the visa waiver program. The
visa waiver program (VWP) allows nationals from certain countries to enter the
United States as temporary visitors (nonimmigrants) for business or pleasure without
first obtaining a visa from a U.S. consulate abroad. Temporary visitors for business
or pleasure from non-VWP countries must obtain a visa from Department of State
(DOS) officers at a consular post abroad before coming to the United States. The
VWP constitutes one of a few exceptions under the Immigration and Nationality Act
(INA) in which foreign nationals are admitted into the United States without a valid
visa
Automated attendance accounting system
An automated accounting system useful for applying data to a computer from any or all of a multiplicity of data terminals is disclosed. The system essentially includes a preselected number of data terminals which are each adapted to convert data words of decimal form to another form, i.e., binary, usable with the computer. Each data terminal may take the form of a keyboard unit having a number of depressable buttons or switches corresponding to selected data digits and/or function digits. A bank of data buffers, one of which is associated with each data terminal, is provided as a temporary storage. Data from the terminals is applied to the data buffers on a digit by digit basis for transfer via a multiplexer to the computer
- …