Search CORE

2,256 research outputs found

Recommended from our members

A Bayesian mixture model for term re-occurrence and burstiness

Author: De Roeck Anne
Garthwaite Paul
Sarkar Avik
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus

Open Research Online

Recommended from our members

Human fallibility: How well do human markers agree?

Author: De Roeck Anne
Haley Debra
Petre Marian
Thomas Pete
Publication venue
Publication date: 01/01/2009
Field of study

Open Research Online

Recommended from our members

Beyond TREC's filtering track

Author: De Roeck Anne
Domingue John
Nanas Nikolaos
Uren Victoria
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2004
Field of study

Following the withdrawal of the filtering track from the latest TREC conferences, there is a niche for new evaluation standards. Towards this end, we suggest, based on variations of TREC's routing subtask, two new evaluation methodologies. The first can be used for evaluating single, multi-topic profiles and the second for testing the ability of a multi-topic profile to adapt to both modest variations and radical drifts in user interests

Open Research Online

Recommended from our members

ComTax: community-driven curation for taxonomic databases

Author: De Roeck Anne
King David
Morse David
Willis Alistair
Yang Hui
Publication venue
Publication date: 01/10/2013
Field of study

This poster presents the work of the ComTax project to develop a community-driven curation process among practicing scientists and citizen scientists. The project provides tools to help scientists identify and validate appropriate taxonomic names from the scanned historical literature. The system operates on scanned documents, typically taken from the Biodiversity Heritage Library, although documents sourced from other repositories could be used. The system is intended to be used on uncorrected text after optical character recognition (OCR) on the scanned images. The key stages are: 1. Identify possible taxonomic names in the scanned text using machine learning techniques. 2. Verify the extracted names against existing databases. If present, the source scanned text can be automatically marked-up with the name. 3. Unverified names might mean they are not currently recorded in the verification databases, typically because the old name in the literature has been reclassified, or because erroneous OCR means that the name is incorrectly transcribed in the scanned text. In either case: 3.1. Present the proposed name to domain experts or citizen scientists for validation or correction, potentially through a voting mechanism to collect expert judgments on the putative taxonomic name. 3.2. Mark-up the scanned text with the corrected spelling of the name and offer validated taxonomic names for further use by the community. This poster will describe the technical challenges facing the ComTax project, and highlight potential extensions of the work to the curation of other entities of interest in the legacy literature or of different disciplines

Open Research Online

On presuppositions in requirements

Author: De Roeck Anne
Ma Lin
Nuseibeh Bashar
Piwek Paul
Willis Alistair
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2009
Field of study

Crossref

Open Research Online

SVO triple based Latent Semantic Analysis for recognising textual entailment

Author: Burek Gaston
De Roeck Anne
Pietsch Christian
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2007
Field of study

Burek G, Pietsch C, De Roeck A. SVO triple based Latent Semantic Analysis for recognising textual entailment. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (WTEP). Association for Computational Linguistics; 2007: 113-118.Latent Semantic Analysis has only recently been applied to textual entailment recognition. However, these efforts have suffered from inadequate bag of words vector representations. Our prototype implementation for the Third Recognising Textual Entailment Challenge (RTE-3) improves the approach by applying it to vector representations that contain semi-structured representations of words. It uses variable size n-grams of word stems to model independently verbs, subjects and objects displayed in textual statements. The system performance shows positive results and provides insights about how to improve them further

Crossref

Open Research Online

Publications at Bielefeld University

Hybrid mappings of complex questions over an integrated semantic space

Author: Burek Gaston
De Roeck Anne
Zdrahal Zdenek
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

We address the issue of measuring semantic similarity between ontologies and text by means of applying Latent Semantic Analysis. This method allows ranking of vector representations describing semantic relations according to their cosine similarity with a particular query. Our work is expected to make contributions including the introduction of reasoning about uncertainty when mapping between ontologies, an algorithm that can perform automatic mapping between concepts or relations derived from text and concepts or relations belonging to different ontologies, and the capability to infer implicit similarity between concepts or relations

Crossref

Open Research Online

Literature-driven Curation for Taxonomic Name Databases

Author: De Roeck Anne
Morse David
Willis Alistair
Yang Hui
Publication venue
Publication date: 12/09/2013
Field of study

Digitized biodiversity literature provides a wealth of content for using biodiversity knowledge by machines. However, identifying taxonomic names and the associated semantic metadata is a difficult and labour intensive process. We present a system to support human assisted creation of semantic metadata. Information extraction techniques auto-matically identify taxonomic names from scanned documents. They are then presented to users for manual correction or verification. The tools that support the curation process include taxonomic name identification and mapping, and community-driven taxonomic name verification. Our research shows the potential for these information extrac-tion techniques to support research and curation in disciplines dependent upon scanned document

CiteSeerX

Open Research Online

Handling instance coreferencing in the KnoFuss architecture

Author: De Roeck Anne
Motta Enrico
Nikolov Andriy
Uren Victoria
Publication venue
Publication date: 01/01/2008
Field of study

Finding RDF individuals that refer to the same real-world entities but have different URIs is necessary for the efficient use of data across sources. The requirements for such instance-level integration of RDF data are different from both database record linkage and ontology schema matching scenarios. Flexible configuration and reuse of different methods is needed to achieve good performance. Our data integration architecture, called KnoFuss, implements a component-based approach, which allows flexible selection and tuning of methods and takes the ontological schemata into account to improve the reusability of methods

CiteSeerX

Open Research Online

Detecting dangerous coordination ambiguities using word distribution

Author: Chantree Francis
De Roeck Anne
Kilgarriff Adam
Willis Alistair
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2007
Field of study

In this paper we present heuristics for resolving coordination ambiguities. We test the hypothesis that the most likely reading of a coordination can be predicted using word distribution information from a generic corpus. Our heuristics are based upon the relative frequency of the coordination in the corpus, the distributional similarity of the coordinated words, and the collocation frequency between the coordinated words and their modifiers. These heuristics have varying but useful predictive power. They also take into account our view that many ambiguities cannot be effectively disambiguated, since human perceptions vary widely

Crossref

Open Research Online