163 research outputs found
Evaluating two methods for Treebank grammar compaction
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be âread offâ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar.
In this paper, we explore ways by which a treebank grammar can be reduced in size or âcompactedâ, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision
University of Sheffield TREC-8 Q & A System
The system entered by the University of Sheffield in the question answering track of TREC-8 is the result of coupling two existing technologies - information retrieval (IR) and information extraction (IE). In essence the approach is this: the IR system treats the question as a query and returns a set of top ranked documents or passages; the IE system uses NLP techniques to parse the question, analyse the top ranked documents or passages returned by the IR system, and instantiate a query variable in the semantic representation of the question against the semantic representation of the analysed documents or passages. Thus, while the IE system by no means attempts âfull text understanding", this approach is a relatively deep approach which attempts to work with meaning representations.
Since the information retrieval systems we used were not our own (AT&T and UMass) and were used more or less âoff the shelf", this paper concentrates on describing the modifications made to our existing information extraction system to allow it to participate in the Q & A task
Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines
In this paper, we present the task of generating image descriptions with gold standard visual detections as input, rather than directly from an image. This allows the Natural Language Generation community to focus on the text generation process, rather than dealing with the noise and complications arising from the visual detection process. We propose a fine-grained evaluation metric specifically for evaluating the content selection capabilities of image description generation systems. To demonstrate the evaluation metric on the task, several baselines are presented using bounding box information and textual information as priors for content selection. The baselines are evaluated using the proposed metric, showing that the fine-grained metric is useful for evaluating the content selection phase of an image description generation system
Cross-validating Image Description Datasets and Evaluation Metrics
The task of automatically generating sentential descriptions of image content has become increasingly popular in recent years, resulting
in the development of large-scale image description datasets and the proposal of various metrics for evaluating image description
generation systems. However, not much work has been done to analyse and understand both datasets and the metrics. In this paper,
we propose using a leave-one-out cross validation (LOOCV) process as a means to analyse multiply annotated, human-authored image
description datasets and the various evaluation metrics, i.e. evaluating one image description against other human-authored descriptions
of the same image. Such an evaluation process affords various insights into the image description datasets and evaluation metrics,
such as the variations of image descriptions within and across datasets and also what the metrics capture. We compute and analyse
(i) human upper-bound performance; (ii) ranked correlation between metric pairs across datasets; (iii) lower-bound performance by
comparing a set of descriptions describing one image to another sentence not describing that image. Interesting observations are made
about the evaluation metrics and image description datasets, and we conclude that such cross-validation methods are extremely useful
for assessing and gaining insights into image description datasets and evaluation metrics for image descriptions
Extracting information from short messages
Much currently transmitted information takes the form of e-mails or SMS text messages and so extracting information from such short messages is increasingly important. The words in a message can be partitioned into the syntactic structure, terms from the domain of discourse and the data being transmitted. This paper describes a light-weight Information Extraction component which uses pattern matching to separate the three aspects: the structure is supplied as a template; domain terms are the metadata of a data source (or their synonyms), and data is extracted as those words matching placeholders in the templates
Using Section Headings to Compute Cross-Lingual Similarity of Wikipedia Articles
Measuring the similarity of interlanguage-linked Wikipedia articles often requires the use of suitable language resources (e.g., dictionaries and MT systems) which can be problematic for languages with limited or poor translation resources. The size of Wikipedia can also present computational demands when computing similarity. This paper presents a âlightweightâ approach to measure cross-lingual similarity in Wikipedia using section headings rather than the entire Wikipedia article, and language resources derived from Wikipedia and Wiktionary to perform translation. Using an existing dataset we evaluate the approach for 7 language pairs. Results show that the performance using section headings is comparable to using all article content, dictionaries derived from Wikipedia and Wiktionary are sufficient to compute cross-lingual similarity and combinations of features can further improve results
Joining up health and bioinformatics: e-science meets e-health
CLEF (Co-operative Clinical e-Science Framework) is an MRC sponsored project in the e-Science programme that aims to establish methodologies and a technical infrastructure forthe next generation of integrated clinical and bioscience research. It is developing methodsfor managing and using pseudonymised repositories of the long-term patient histories whichcan be linked to genetic, genomic information or used to support patient care. CLEF concentrateson removing key barriers to managing such repositories ? ethical issues, informationcapture, integration of disparate sources into coherent ?chronicles? of events, userorientedmechanisms for querying and displaying the information, and compiling the requiredknowledge resources. This paper describes the overall information flow and technicalapproach designed to meet these aims within a Grid framework
Don't mention the shoe! A learning to rank approach to content selection for image description generation
We tackle the sub-task of content selection as part of the broader challenge of automatically generating image descriptions. More specifically, we explore how decisions can be made to select what object instances should be mentioned in an image description, given an image and labelled bounding boxes. We propose casting
the content selection problem as a learning to rank problem, where object instances that are most likely to be mentioned by humans when describing an image are ranked higher than those that are
less likely to be mentioned. Several features are explored: those derived from bounding box localisations, from concept labels, and from image regions. Object instances are then selected based on the ranked list, where we investigate several methods for choosing a stopping criterion as the âcut-offâ point for objects in the ranked list. Our best-performing method achieves state-of-the-art performance on the ImageCLEF2015 sentence generation
challenge
Cross-lingual document retrieval categorisation and navigation based on distributed services
The widespread use of the Internet across countries has increased the need for access to document collections
that are often written in languages different from a userâs native language. In this paper we describe Clarity, a
Cross Language Information Retrieval (CLIR) system for English, Finnish, Swedish, Latvian and Lithuanian.
Clarity is a fully-fledged retrieval system that supports the user during the whole process of query formulation,
text retrieval and document browsing. We address four of the major aspects of Clarity: (i) the user-driven
methodology that formed the basis for the iterative design cycle and framework in the project, (ii) the system
architecture that was developed to support the interaction and coordination of Clarityâs distributed services, (iii)
the data resources and methods for query translation, and (iv) the support for Baltic languages. Clarity is an
example of a distributed CLIR system built with minimal translation resources and, to our knowledge, the only
such system that currently supports Baltic languages
The SENSEI Overview of Newspaper Readersâ Comments
Automatic summarization of reader comments in on-line news
is a challenging but clearly useful task. Work to date has produced extractive
summaries using well-known techniques from other areas of NLP.
But do users really want these, and do they support users in realistic
tasks? We specify an alternative summary type for reader comments,
based on the notions of issues and viewpoints, and demonstrate our user
interface to present it. An evaluation to assess how well summarization
systems support users in time-limited tasks (identifying issues and characterizing
opinions) gives good results for this prototype
- âŠ