2,256 research outputs found
Recommended from our members
A Bayesian mixture model for term re-occurrence and burstiness
This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using
a mixture of exponential distributions. Parameter
estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a termтАЩs re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content
word, medium frequency term or frequent function word. A measure is proposed to account for the termтАЩs importance based on its distribution pattern in the corpus
Recommended from our members
Beyond TREC's filtering track
Following the withdrawal of the filtering track from the latest TREC conferences, there is a niche for new evaluation standards. Towards this end, we suggest, based on variations of TREC's routing subtask, two new evaluation methodologies. The first can be used for evaluating single, multi-topic profiles and the second for testing the ability of a multi-topic profile to adapt to both modest variations and radical drifts in user interests
Recommended from our members
ComTax: community-driven curation for taxonomic databases
This poster presents the work of the ComTax project to develop a community-driven curation process among practicing scientists and citizen scientists. The project provides tools to help scientists identify and validate appropriate taxonomic names from the scanned historical literature. The system operates on scanned documents, typically taken from the Biodiversity Heritage Library, although documents sourced from other repositories could be used.
The system is intended to be used on uncorrected text after optical character recognition (OCR) on the scanned images. The key stages are:
1. Identify possible taxonomic names in the scanned text using machine learning techniques.
2. Verify the extracted names against existing databases. If present, the source scanned text can be automatically marked-up with the name.
3. Unverified names might mean they are not currently recorded in the verification databases, typically because the old name in the literature has been reclassified, or because erroneous OCR means that the name is incorrectly transcribed in the scanned text. In either case:
3.1. Present the proposed name to domain experts or citizen scientists for validation or correction, potentially through a voting mechanism to collect expert judgments on the putative taxonomic name.
3.2. Mark-up the scanned text with the corrected spelling of the name and offer validated taxonomic names for further use by the community.
This poster will describe the technical challenges facing the ComTax project, and highlight potential extensions of the work to the curation of other entities of interest in the legacy literature or of different disciplines
SVO triple based Latent Semantic Analysis for recognising textual entailment
Burek G, Pietsch C, De Roeck A. SVO triple based Latent Semantic Analysis for recognising textual entailment. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (WTEP). Association for Computational Linguistics; 2007: 113-118.Latent Semantic Analysis has only recently been applied to textual entailment recognition. However, these efforts have suffered from inadequate bag of words vector representations. Our prototype implementation for the Third Recognising Textual Entailment Challenge (RTE-3) improves the approach by applying it to vector representations that contain semi-structured representations of words. It uses variable size n-grams of word stems to model independently verbs, subjects and objects displayed in textual statements. The system performance shows positive results and provides insights about how to improve them further
Hybrid mappings of complex questions over an integrated semantic space
We address the issue of measuring semantic similarity between ontologies and text by means of applying Latent Semantic Analysis. This method allows ranking of vector representations describing semantic relations according to their cosine similarity with a particular query. Our work is expected to make contributions including the introduction of reasoning about uncertainty when mapping between ontologies, an algorithm that can perform automatic mapping between concepts or relations derived from text and concepts or relations belonging to different ontologies, and the capability to infer implicit similarity between concepts or relations
Literature-driven Curation for Taxonomic Name Databases
Digitized biodiversity literature provides a wealth of content for using biodiversity knowledge by machines. However, identifying taxonomic names and the associated semantic metadata is a difficult and labour intensive process. We present a system to support human assisted creation of semantic metadata. Information extraction techniques auto-matically identify taxonomic names from scanned documents. They are then presented to users for manual correction or verification. The tools that support the curation process include taxonomic name identification and mapping, and community-driven taxonomic name verification. Our research shows the potential for these information extrac-tion techniques to support research and curation in disciplines dependent upon scanned document
Handling instance coreferencing in the KnoFuss architecture
Finding RDF individuals that refer to the same real-world entities but have different URIs is necessary for the efficient use of data across sources. The requirements for such instance-level integration of RDF data are different from both database record linkage and ontology schema matching scenarios. Flexible configuration and reuse of different methods is needed to achieve good performance. Our data integration architecture, called KnoFuss, implements a component-based approach, which allows flexible selection and tuning of methods and takes the ontological schemata into account to improve the reusability of methods
Detecting dangerous coordination ambiguities using word distribution
In this paper we present heuristics for resolving coordination ambiguities. We test the hypothesis that the most likely reading of a coordination can be predicted using word distribution information from a generic corpus. Our heuristics are based upon the relative frequency of the coordination in the corpus, the distributional similarity of the coordinated words, and the collocation frequency between the coordinated words and their modifiers. These heuristics have varying but useful predictive power. They also take into account our view that many ambiguities cannot be effectively disambiguated, since human perceptions vary widely
- тАж