26,352 research outputs found

    Resource-aware annotation through active learning

    Get PDF
    The annotation of corpora has become a crucial prerequisite for information extraction systems which heavily rely on supervised machine learning techniques and therefore require large amounts of annotated training material. Annotation, however, requires human intervention and is thus an extremely costly, labor-intensive, and error-prone process. The burden of annotation is one of the major obstacles when well-established information extraction systems are to be applied to real-world problems and so a pressing research question is how annotation can be made more efficient. Most annotated corpora are built by collecting the documents to be annotated on a random sampling basis or based on simple keyword search. Only recently, more sophisticated approaches to select the base material in order to reduce annotation effort are being investigated. One promising direction is known as Active Learning (AL) where only examples of high utility for classifier training are selected for manual annotation. Because of this intelligent selection, classifiers of a certain target performance can be yieled with less labeled data points. This thesis centers around the question how AL can be applied as resource-aware strategy for linguistic annotation. A set of requirements is defined and several approaches and adaptations to the standard form of AL are proposed to meet these requirements. This includes: (1) a novel method to monitor and stop the AL-driven annotation process; (2) an approach to semi-supervised AL where only highly critical tokens have to actually be manually annotated while the rest is automatically tagged; (3) a discussion and empirical investigation of the reusability of actively drawn samples; (4) a comparative study how class imbalance can be reduced right upfront during AL-driven data acquisition; (5) two methods for selective sampling of examples which are useful for multiple learning problems; (6) an extensive evaluation of the proposed approaches to AL for Named Entity Recognition with respect to both savings in corpus size and actual annotation time; and finally (7) three methods how these approaches can be made cost-conscious so as to reduce annotation time even more

    Context Aware Computing for The Internet of Things: A Survey

    Get PDF
    As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.Comment: IEEE Communications Surveys & Tutorials Journal, 201

    Building a semantically annotated corpus of clinical texts

    Get PDF
    In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains

    A Context-aware Attention Network for Interactive Question Answering

    Full text link
    Neural network based sequence-to-sequence models in an encoder-decoder framework have been successfully applied to solve Question Answering (QA) problems, predicting answers from statements and questions. However, almost all previous models have failed to consider detailed context information and unknown states under which systems do not have enough information to answer given questions. These scenarios with incomplete or ambiguous information are very common in the setting of Interactive Question Answering (IQA). To address this challenge, we develop a novel model, employing context-dependent word-level attention for more accurate statement representations and question-guided sentence-level attention for better context modeling. We also generate unique IQA datasets to test our model, which will be made publicly available. Employing these attention mechanisms, our model accurately understands when it can output an answer or when it requires generating a supplementary question for additional input depending on different contexts. When available, user's feedback is encoded and directly applied to update sentence-level attention to infer an answer. Extensive experiments on QA and IQA datasets quantitatively demonstrate the effectiveness of our model with significant improvement over state-of-the-art conventional QA models.Comment: 9 page

    A quick guide for student-driven community genome annotation

    Full text link
    High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions

    Magpie: towards a semantic web browser

    Get PDF
    Web browsing involves two tasks: finding the right web page and then making sense of its content. So far, research has focused on supporting the task of finding web resources through ‘standard’ information retrieval mechanisms, or semantics-enhanced search. Much less attention has been paid to the second problem. In this paper we describe Magpie, a tool which supports the interpretation of web pages. Magpie offers complementary knowledge sources, which a reader can call upon to quickly gain access to any background knowledge relevant to a web resource. Magpie automatically associates an ontologybased semantic layer to web resources, allowing relevant services to be invoked within a standard web browser. Hence, Magpie may be seen as a step towards a semantic web browser. The functionality of Magpie is illustrated using examples of how it has been integrated with our lab’s web resources

    An artefact repository to support distributed software engineering

    Get PDF
    The Open Source Component Artefact Repository (OSCAR) system is a component of the GENESIS platform designed to non-invasively inter-operate with work-flow management systems, development tools and existing repository systems to support a distributed software engineering team working collaboratively. Every artefact possesses a collection of associated meta-data, both standard and domain-specific presented as an XML document. Within OSCAR, artefacts are made aware of changes to related artefacts using notifications, allowing them to modify their own meta-data actively in contrast to other software repositories where users must perform all and any modifications, however trivial. This recording of events, including user interactions provides a complete picture of an artefact's life from creation to (eventual) retirement with the intention of supporting collaboration both amongst the members of the software engineering team and agents acting on their behalf
    • 

    corecore