1,467 research outputs found
Error-tolerant Finite State Recognition with Applications to Morphological Analysis and Spelling Correction
Error-tolerant recognition enables the recognition of strings that deviate
mildly from any string in the regular set recognized by the underlying finite
state recognizer. Such recognition has applications in error-tolerant
morphological processing, spelling correction, and approximate string matching
in information retrieval. After a description of the concepts and algorithms
involved, we give examples from two applications: In the context of
morphological analysis, error-tolerant recognition allows misspelled input word
forms to be corrected, and morphologically analyzed concurrently. We present an
application of this to error-tolerant analysis of agglutinative morphology of
Turkish words. The algorithm can be applied to morphological analysis of any
language whose morphology is fully captured by a single (and possibly very
large) finite state transducer, regardless of the word formation processes and
morphographemic phenomena involved. In the context of spelling correction,
error-tolerant recognition can be used to enumerate correct candidate forms
from a given misspelled string within a certain edit distance. Again, it can be
applied to any language with a word list comprising all inflected forms, or
whose morphology is fully described by a finite state transducer. We present
experimental results for spelling correction for a number of languages. These
results indicate that such recognition works very efficiently for candidate
generation in spelling correction for many European languages such as English,
Dutch, French, German, Italian (and others) with very large word lists of root
and inflected forms (some containing well over 200,000 forms), generating all
candidate solutions within 10 to 45 milliseconds (with edit distance 1) on a
SparcStation 10/41. For spelling correction in Turkish, error-tolerantComment: Replaces 9504031. gzipped, uuencoded postscript file. To appear in
Computational Linguistics Volume 22 No:1, 1996, Also available as
ftp://ftp.cs.bilkent.edu.tr/pub/ko/clpaper9512.ps.
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Macro Grammars and Holistic Triggering for Efficient Semantic Parsing
To learn a semantic parser from denotations, a learning algorithm must search
over a combinatorially large space of logical forms for ones consistent with
the annotated denotations. We propose a new online learning algorithm that
searches faster as training progresses. The two key ideas are using macro
grammars to cache the abstract patterns of useful logical forms found thus far,
and holistic triggering to efficiently retrieve the most relevant patterns
based on sentence similarity. On the WikiTableQuestions dataset, we first
expand the search space of an existing model to improve the state-of-the-art
accuracy from 38.7% to 42.7%, and then use macro grammars and holistic
triggering to achieve an 11x speedup and an accuracy of 43.7%.Comment: EMNLP 201
Combining Visual Layout and Lexical Cohesion Features for Text Segmentation
We propose integrating features from lexical cohesion with elements from layout recognition to build a composite framework. We use supervised machine learning on this composite feature set to derive discourse structure on the topic level. We demonstrate a system based on this principle and use both an intrinsic evaluation as well as the task of genre classification to assess its performance
Machine Learning Approaches to Identify Nicknames from A Statewide Health Information Exchange
Patient matching is essential to minimize fragmentation of patient data. Existing patient matching efforts often do not account for nickname use. We sought to develop decision models that could identify true nicknames using features representing the phonetical and structural similarity of nickname pairs. We identified potential male and female name pairs from the Indiana Network for Patient Care (INPC), and developed a series of features that represented their phonetical and structural similarities. Next, we used the XGBoost classifier and hyperparameter tuning to build decision models to identify nicknames using these feature sets and a manually reviewed gold standard. Decision models reported high Precision/Positive Predictive Value and Accuracy scores for both male and female name pairs despite the low number of true nickname matches in the datasets under study. Ours is one of the first efforts to identify patient nicknames using machine learning approaches
Matching XML schemas by a new tree matching algorithm.
Schema matching is one of the key operations in XML-based information integration and exchanging applications. Automatically or semi-automatically matching XML schemas has attracted a lot of attentions in academia and industry due to the extensive adoption of XML techniques. One of the most difficult tasks in this problem is to identify structural relations between two XML schemas. This thesis builds an automatic XML schema matching system which generates two types of outputs: the element mappings and schema similarity. In this system, an XML schema is modeled as a tree. We have proposed a new tree matching algorithm to compute the structural relation by extracting the most similar common substructures. The algorithm has been designed to achieve a trade-off between matching optimality and time complexity. The experimental results show that this system can be used to match large XML schemas.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .W366. Source: Masters Abstracts International, Volume: 43-05, page: 1758. Adviser: Jianguo Lu. Thesis (M.Sc.)--University of Windsor (Canada), 2004
A Survey on Retrieval of Mathematical Knowledge
We present a short survey of the literature on indexing and retrieval of
mathematical knowledge, with pointers to 72 papers and tentative taxonomies of
both retrieval problems and recurring techniques.Comment: CICM 2015, 20 page
Machine Learning-Friendly Biomedical Datasets for Equivalence and Subsumption Ontology Matching
Ontology Matching (OM) plays an important role in many domains such as
bioinformatics and the Semantic Web, and its research is becoming increasingly
popular, especially with the application of machine learning (ML) techniques.
Although the Ontology Alignment Evaluation Initiative (OAEI) represents an
impressive effort for the systematic evaluation of OM systems, it still suffers
from several limitations including limited evaluation of subsumption mappings,
suboptimal reference mappings, and limited support for the evaluation of
ML-based systems. To tackle these limitations, we introduce five new biomedical
OM tasks involving ontologies extracted from Mondo and UMLS. Each task includes
both equivalence and subsumption matching; the quality of reference mappings is
ensured by human curation, ontology pruning, etc.; and a comprehensive
evaluation framework is proposed to measure OM performance from various
perspectives for both ML-based and non-ML-based OM systems. We report
evaluation results for OM systems of different types to demonstrate the usage
of these resources, all of which are publicly available as part of the new
BioML track at OAEI 2022.Comment: Accepted paper in the 21st International Semantic Web Conference
(ISWC-2022); DOI for Bio-ML Dataset: 10.5281/zenodo.651008
- …