386 research outputs found

    Towards a query language for annotation graphs

    Get PDF
    The multidimensional, heterogeneous, and temporal nature of speech databases raises interesting challenges for representation and query. Recently, annotation graphs have been proposed as a general-purpose representational framework for speech databases. Typical queries on annotation graphs require path expressions similar to those used in semistructured query languages. However, the underlying model is rather different from the customary graph models for semistructured data: the graph is acyclic and unrooted, and both temporal and inclusion relationships are important. We develop a query language and describe optimization techniques for an underlying relational representation.Comment: 8 pages, 10 figure

    The Mate Workbench - a tool for annotating XML corpora

    Get PDF
    This paper describes the design and implementation of the MATE workbench, a program which provides support for flexible display and editing of XML annotations, and complex querying of a set of linked files. The workbench was designed to support the annotation of XML coded linguistic corpora, but it could be used to annotate any kind of data, as it is not dependent on any particular annotation scheme. Rather than being a general purpose XMLaware editor it is a system for writing specialised editors tailored to a particular annotation task. A particular editor is defined using a transformation language, with suitable display formats and allowable editing operations. The workbench is written in Java, which means that it is platform-independent. This paper outlines the design of the workbench software and compares it with other annotation programs. 1. Introduction The annotation or markup of files with linguistic or other complex information usually requires either human coding or human ..

    Designing Focused and Efficient Annotation Tools

    Get PDF

    Report on the 2015 NSF Workshop on Unified Annotation Tooling

    Get PDF
    On March 30 & 31, 2015, an international group of twenty-three researchers with expertise in linguistic annotation convened in Sunny Isles Beach, Florida to discuss problems with and potential solutions for the state of linguistic annotation tooling. The participants comprised 14 researchers from the U.S. and 9 from outside the U.S., with 7 countries and 4 continents represented, and hailed from fields and specialties including computational linguistics, artificial intelligence, speech processing, multi-modal data processing, clinical & medical natural language processing, linguistics, documentary linguistics, sign-language linguistics, corpus linguistics, and the digital humanities. The motivating problem of the workshop was the balkanization of annotation tooling, namely, that even though linguistic annotation requires sophisticated tool support to efficiently generate high-quality data, the landscape of tools for the field is fractured, incompatible, inconsistent, and lacks key capabilities. The overall goal of the workshop was to chart the way forward, centering on five key questions: (1) What are the problems with current tool landscape? (2) What are the possible benefits of solving some or all of these problems? (3) What capabilities are most needed? (4) How should we go about implementing these capabilities? And, (5) How should we ensure longevity and sustainability of the solution? I surveyed the participants before their arrival, which provided significant raw material for ideas, and the workshop discussion itself resulted in identification of ten specific classes of problems, five sets of most-needed capabilities. Importantly, we identified annotation project managers in computational linguistics as the key recipients and users of any solution, thereby succinctly addressing questions about the scope and audience of potential solutions. We discussed management and sustainability of potential solutions at length. The participants agreed on sixteen recommendations for future work. This technical report contains a detailed discussion of all these topics, a point-by-point review of the discussion in the workshop as it unfolded, detailed information on the participants and their expertise, and the summarized data from the surveys

    FinnFN 1.0: The Finnish frame semantic database

    Get PDF
    The article describes the process of creating a Finnish language FrameNet or FinnFN, based on the original English language FrameNet hosted at the International Computer Science Institute in Berkeley, California. We outline the goals and results relating to the FinnFN project and especially to the creation of the FinnFrame corpus. The main aim of the project was to test the universal applicability of frame semantics by annotating real Finnish using the same frames and annotation conventions as in the original Berkeley FrameNet project. From Finnish newspaper corpora, 40,721 sentences were automatically retrieved and manually annotated as example sentences evoking certain frames. This became the FinnFrame corpus. Applying the Berkeley FrameNet annotation conventions to the Finnish language required some modifications due to Finnish morphology, and a convention for annotating individual morphemes within words was introduced for phenomena such as compounding, comparatives and case endings. Various questions about cultural salience across the two languages arose during the project, but problematic situations occurred only in a few examples, which we also discuss in the article. The article shows that, barring a few minor instances, the universality hypothesis of frames is largely confirmed for languages as different as Finnish and English.Peer reviewe

    Communicating with Culture: How Humans and Machines Detect Narrative Elements

    Get PDF
    To understand how people communicate, we must understand how they leverage shared stories and all the knowledge, information, and associations contained within those stories. I examine three classes of narrative elements that convey a wealth of cultural knowledge: Propp\u27s morphology, motifs, and discourse structure. Propp\u27s morphology communicates how roles and actions drive a narrative forward; motifs fill those roles and actions with specific, remarkable events; discourse groups these into a coherent structure to convey a point. My thesis has three aims: first, to demonstrate that people can reliably detect and identify all three of these narrative elements; second, to develop automatic detectors for discourse and motifs; third, to demonstrate the deep relation between these narrative elements and other theories of narrative structure and knowledge representation that I refer to as the \textit{continuum of communication}. The first step of my work answers two key questions about Propp\u27s morphology by demonstrating the reliability of annotators applying Propp\u27s scheme across a variety of experiments, in a double-blind annotation study. Additionally, I demonstrate a shortcoming in Propp\u27s scheme, demonstrating areas in which there are elements present in the folktales he analyzed that are not part of his morphology. The second step of my work, showing that people familiar with motifs can reliably detect when they are being used to share information and associations, approaches this problem by performing a large-scale annotation study of 21,000 examples into four categories performed by three pairs of annotators over a period of 11 weeks. I show that, in a double-blind annotation study, people familiar with the motifs had a moderate to high degree of agreement, demonstrating the reliability of humans at this task. The third step demonstrates the reliability of applying a theory of news discourse structure to news articles via a double-blind annotation study and, using the results of this annotation, demonstrate a preliminary detector of the news discourse function of paragraphs in news articles. The fourth step of my work, detecting motific usage automatically, consists of a large-scale pipeline that achieves moderate performance. This pipeline is the first work towards automatically detecting motific usage of motifs and beats out simple baselines while comparing favorably too and generalizing better than a simple neural network baseline system. Additionally, the pipeline uses explainable features that can be used in future work to further develop our understanding of how humans automatically detect motifs. Finally, I describe an exploration of the broader scope of narrative elements that communicate information between individuals who share a cultural or sub-cultural background. This work is based off of a small-scale, in-lab annotation of posts from the “incel” subculture, a niche internet community with extremist elements and, at times, disturbing content. This small annotation has revealed a complex landscape encompassing fourteen categories, more than three times the number of elements as the large-scale annotation, many of which resemble the moving parts of other theories on narrative structure and cognition, including Vladimir Propp\u27s morphology of folktales and Silvan Tomkins\u27 script theory. I describe these relations and provide a rough continuum of the landscape of narrative communication

    SPEEDy. A Practical Editor for Texts Annotated with Standoff Properties

    Get PDF
    Standoff properties can be used to record textual properties or annotations that may freely overlap and need not conform to a context-free grammar. In this way they avoid the ‘overlapping hierarchies’ problem inherent in markup languages like XML. Instead of embedding markup tags directly into the text stream, standoff properties are stored separately, and refer to positions in the text where each property starts and ends. However, this has the effect of tightly binding the properties to the text, and hence any change in the underlying text invalidates them. This limitation usually makes this method impractical in cases where the text is mutable, and is mostly used when the text is already fixed or proofread to a high standard. However, if it did become feasible to use standoff properties on mutable texts, this method could also be used in the process of text production, on dynamically evolving texts, such as emails, forum messages, personal notes and even drafts of academic papers. Digitised transcriptions of historical documents, whether produced manually or through OCR, could then be easily corrected at an earlier stage of typographic correctness. By overcoming the overlapping hierarchies problem this technique thus offers the prospect of significant productivity gains for producing digital editions, as well as a new mode of engagement for annotation. This paper describes the SPEEDy editor, a practical realisation of this technique. It outlines the editor’s foundational concepts, its standoff properties model, and its main interface features

    A polychromatic 'greenbeard' locus determines patterns of cooperation in a social amoeba

    Get PDF
    Cheaters disrupt cooperation by reaping the benefits without paying their fair share of associated costs. Cheater impact can be diminished if cooperators display a tag (‘greenbeard’) and recognise and preferentially direct cooperation towards other tag carriers. Despite its popular appeal, the feasibility of such greenbeards has been questioned because the complex patterns of partner-specific cooperative behaviours seen in nature require greenbeards to come in different colours. Here we show that a locus (‘Tgr’) of a social amoeba represents a polychromatic greenbeard. Patterns of natural Tgr locus sequence polymorphisms predict partner-specific patterns of cooperation by underlying variation in partner-specific protein–protein binding strength and recognition specificity. Finally, Tgr locus polymorphisms increase fitness because they help avoid potential costs of cooperating with incompatible partners. These results suggest that a polychromatic greenbeard can provide a key mechanism for the evolutionary maintenance of cooperation

    Validating a sentiment dictionary for German political language - a workbench note

    Get PDF
    Automated sentiment scoring offers relevant empirical information for many political science applications. However, apart from English language resources, validated dictionaries are rare. This note introduces a German sentiment dictionary and assesses its performance against human intuition in parliamentary speeches, party manifestos, and media coverage. The tool published with this note is indeed able to discriminate positive and negative political language. But the validation exercises indicate that positive language is easier to detect than negative language, while the scores are numerically biased to zero. This warrants caution when interpreting sentiment scores as interval or even ratio scales in applied research
    • 

    corecore