Search CORE

187,903 research outputs found

Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

Author: Bloodgood Michael
Strauss Benjamin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

arXiv.org e-Print Archive

Crossref

Digital Repository at the University of Maryland

A cascaded approach to normalising gene mentions in biomedical literature

Author: Keane John A.
Nenadic Goran
Yang Hui
Publication venue: 'Biomedical Informatics'
Publication date: 01/01/2007
Field of study

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%

Crossref

University of Birmingham Research Portal

Open Research Online (The Open University)

PubMed Central

The University of Manchester - Institutional Repository

Digitising the 1941 National Farm Survey: an initial assessment

Author: Southall Humphrey
Publication venue: 'International Archives of Obstetrics and Gynecology'
Publication date: 01/01/2006
Field of study

Portsmouth University Research Portal (Pure)

The MONTRASEC demo. A bench-mark for member state and EU automated data collection and reporting on trafficking in human beings and sexual exploitation of children

Author: Paterson Neil
Vermeulen Gert
Publication venue: Maklu
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Design and Implementation of the UniProt Website

Author: Amos Bairoch
Elisabeth Gasteiger
Eric Jain
Isabelle Phan
Maria J. Martin
Nicole Redaschi
Peter McGarvey
Severine Duvaud
Publication venue
Publication date: 06/12/2008
Field of study

The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The "www.uniprot.org":http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access

Nature Precedings

Recommended from our members

Immigration: Visa Security Policies

Author: Wasem Ruth Ellen
Publication venue: DigitalCommons@ILR
Publication date: 18/11/2015
Field of study

[Excerpt] The report opens with an overview of visa issuance policy. It then explains the key provisions that guide the documentary requirements and approval/disapproval process. The section on consular screening procedures includes an analysis of trends over time in denying visas. Visa revocation, a reoccurring issue of concern to Congress, and the visa security program are discussed as well

UNT Digital Library

DigitalCommons@ILR

eCommons@Cornell

Recommended from our members

Visa Waiver Program

Author: Siskin Alison
Publication venue: DigitalCommons@ILR
Publication date: 10/02/2004
Field of study

Since the events of September 11, 2001, concerns have been raised about the ability of terrorists to enter the United States under the visa waiver program. The visa waiver program (VWP) allows nationals from certain countries to enter the United States as temporary visitors (nonimmigrants) for business or pleasure without first obtaining a visa from a U.S. consulate abroad. Temporary visitors for business or pleasure from non-VWP countries must obtain a visa from Department of State (DOS) officers at a consular post abroad before coming to the United States. The VWP constitutes one of a few exceptions under the Immigration and Nationality Act (INA) in which foreign nationals are admitted into the United States without a valid visa

UNT Digital Library

DigitalCommons@ILR

eCommons@Cornell

Automated attendance accounting system

Author: Chapman C. P.
Publication venue
Publication date: 19/06/1973
Field of study

An automated accounting system useful for applying data to a computer from any or all of a multiplicity of data terminals is disclosed. The system essentially includes a preselected number of data terminals which are each adapted to convert data words of decimal form to another form, i.e., binary, usable with the computer. Each data terminal may take the form of a keyboard unit having a number of depressable buttons or switches corresponding to selected data digits and/or function digits. A bank of data buffers, one of which is associated with each data terminal, is provided as a temporary storage. Data from the terminals is applied to the data buffers on a digit by digit basis for transfer via a multiplexer to the computer

NASA Technical Reports Server