5,233 research outputs found

    Image annotation with Photocopain

    Get PDF
    Photo annotation is a resource-intensive task, yet is increasingly essential as image archives and personal photo collections grow in size. There is an inherent conflict in the process of describing and archiving personal experiences, because casual users are generally unwilling to expend large amounts of effort on creating the annotations which are required to organise their collections so that they can make best use of them. This paper describes the Photocopain system, a semi-automatic image annotation system which combines information about the context in which a photograph was captured with information from other readily available sources in order to generate outline annotations for that photograph that the user may further extend or amend

    Vagueness and referential ambiguity in a large-scale annotated corpus

    Get PDF
    In this paper, we argue that difficulties in the definition of coreference itself contribute to lower inter-annotator agreement in certain cases. Data from a large referentially annotated corpus serves to corroborate this point, using a quantitative investigation to assess which effects or problems are likely to be the most prominent. Several examples where such problems occur are discussed in more detail, and we then propose a generalisation of Poesio, Reyle and Stevenson’s Justified Sloppiness Hypothesis to provide a unified model for these cases of disagreement and argue that a deeper understanding of the phenomena involved allows to tackle problematic cases in a more principled fashion than would be possible using only pre-theoretic intuitions

    Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis

    No full text
    The Market Blended Insight project1 has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the unstructured text on the web, to annotate and then translate the extracted data according to the backend schema

    CHORUS Deliverable 4.4: Report of the 2nd CHORUS Conference

    Get PDF
    The Second CHORUS Conference and third Yahoo! Research Workshop on the Future of Web Search was held during April 4-5, 2008, in Granvalira, Andorra to discuss future directions in multi-medial information access and other specialised topics in the near future of retrieval. Attendance was at capacity, with 97 participants from 11 countries and 3 continents

    A plant disease extension of the Infectious Disease Ontology

    Get PDF
    Plants from a handful of species provide the primary source of food for all people, yet this source is vulnerable to multiple stressors, such as disease, drought, and nutrient deficiency. With rapid population growth and climate uncertainty, the need to produce crops that can tolerate or resist plant stressors is more crucial than ever. Traditional plant breeding methods may not be sufficient to overcome this challenge, and methods such as highOthroughput sequencing and automated scoring of phenotypes can provide significant new insights. Ontologies are essential tools for accessing and analysing the large quantities of data that come with these newer methods. As part of a larger project to develop ontologies that describe plant phenotypes and stresses, we are developing a plant disease extension of the Infectious Disease Ontology (IDOPlant). The IDOPlant is envisioned as a reference ontology designed to cover any plant infectious disease. In addition to novel terms for infectious diseases, IDOPlant includes terms imported from other ontologies that describe plants, pathogens, and vectors, the geographic location and ecology of diseases and hosts, and molecular functions and interactions of hosts and pathogens. To encompass this range of data, we are suggesting inOhouse ontology development complemented with reuse of terms from orthogonal ontologies developed as part of the Open Biomedical Ontologies (OBO) Foundry. The study of plant diseases provides an example of how an ontological framework can be used to model complex biological phenomena such as plant disease, and how plant infectious diseases differ from, and are similar to, infectious diseases in other organism

    Recent developments in linguistic annotations of the TĂźBa-D/Z treebank

    Get PDF
    The purpose of this paper is to describe recent developments in the morphological, syntactic, and semantic annotation of the TĂźBa-D/Z treebank of German. The TĂźBa-D/Z annotation scheme is derived from the Verbmobil treebank of spoken German [4, 10], but has been extended along various dimensions to accommodate the characteristics of written texts. TĂźBa-D/Z uses as its data source the "die tageszeitung" (taz) newspaper corpus. The Verbmobil treebank annotation scheme distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level. The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted among descriptive linguists of German [3, 6]. The TĂźBa-D/Z annotation relies on a context-free backbone (i.e. proper trees without crossing branches) of phrase structure combined with edge labels that specify the grammatical function of the phrase in question. The syntactic annotation scheme of the TĂźBa-D/Z is described in more detail in [12, 11]. TĂźBa-D/Z currently comprises approximately 15 000 sentences, with approximately 7 000 sentences being in the correction phase. The latter will be released along with an updated version of the existing treebank before the end of this year. The treebank is available in an XML format, in the NEGRA export format [1] and in the Penn treebank bracketing format. The XML format contains all types of information as described above, the NEGRA export format contains all sentenceinternal information while the Penn treebank format includes only those layers of information that can be expressed as pure tree structures. Over the course of the last year, more fine grained linguistic annotations have been added along the following dimensions: 1. the basic Stuttgart-TĂźbingen tagset, STTS, [9] labels have been enriched by relevant features of inflectional morphology, 2. named entity information has been encoded as part of the syntactic annotation, and 3. a set of anaphoric and coreference relations has been added to link referentially dependent noun phrases. In the following sections, we will describe each of these innovations in turn and will demonstrate how the additional annotations can be incorporated into one comprehensive annotation scheme

    Ontological representation of CDC Active Bacterial Core Surveillance Case Reports

    Get PDF
    The Center for Disease Control and Prevention’s Active Bacterial Core Surveillance (CDC ABCs) Program is a collaborative effort betweeen the CDC, state health departments, laboratories, and universities to track invasive bacterial pathogens of particular importance to public health [1]. The year-end surveillance reports produced by this program help to shape public policy and coordinate responses to emerging infectious diseases over time. The ABCs case report form (CRF) data represents an excellent opportunity for data reuse beyond the original surveillance purposes

    ATLAS: A flexible and extensible architecture for linguistic annotation

    Full text link
    We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic ``signals,'' including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure
    • …
    corecore