1,102 research outputs found

    Validation of Coding Schemes and Coding Workbench

    Get PDF
    This report presents methodology and results of the validation of the MATE best practice coding schemes and the MATE workbench. The validation phase covered the period from September 1999 to February 2000, and involved project partners as well as Advisory Panel members who kindly volunteered to act as external evaluators. The first part of the report focuses on the evaluation of the theoretical work in MATE while the second part concentrates on the workbench . In both cases, a questionnaire has been used as a core tool to obtain feedback from evaluators. A major probem has been the short time available for evaluation which has implied that less feedbach than originally expected could be obtained . Evaluation of MATE results will continue after the end of the project

    Exploring lexical patterns in text : lexical cohesion analysis with WordNet

    Get PDF
    We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource

    The Mate Workbench - a tool for annotating XML corpora

    Get PDF
    This paper describes the design and implementation of the MATE workbench, a program which provides support for flexible display and editing of XML annotations, and complex querying of a set of linked files. The workbench was designed to support the annotation of XML coded linguistic corpora, but it could be used to annotate any kind of data, as it is not dependent on any particular annotation scheme. Rather than being a general purpose XMLaware editor it is a system for writing specialised editors tailored to a particular annotation task. A particular editor is defined using a transformation language, with suitable display formats and allowable editing operations. The workbench is written in Java, which means that it is platform-independent. This paper outlines the design of the workbench software and compares it with other annotation programs. 1. Introduction The annotation or markup of files with linguistic or other complex information usually requires either human coding or human ..

    A draft genome sequence of the rose black spot fungus Diplocarpon rosae reveals a high degree of genome duplication

    Get PDF
    Background: Black spot is one of the most severe and damaging diseases of garden roses. We present the draft genome sequence of its causative agent Diplocarpon rosae as a working tool to generate molecular markers and to analyze functional and structural characteristics of this fungus. Results: The isolate DortE4 was sequenced with 191x coverage of different read types which were assembled into 2457 scaffolds. By evidence supported genome annotation with the MAKER pipeline 14,004 gene models were predicted and transcriptomic data indicated that 88.5% of them are expressed during the early stages of infection. Analyses of k-mer distributions resulted in unexpectedly large genome size estimations between 72.5 and 91.4 Mb, which cannot be attributed to its repeat structure and content of transposable elements alone, factors explaining such differences in other fungal genomes. In contrast, different lines of evidences demonstrate that a huge proportion (approximately 80%) of genes are duplicated, which might indicate a whole genome duplication event. By PCR-RFLP analysis of six paralogous gene pairs of BUSCO orthologs, which are expected to be single copy genes, we could show experimentally that the duplication is not due to technical error and that not all isolates tested possess all of the paralogs. Conclusions: The presented genome sequence is still a fragmented draft but contains almost the complete gene space. Therefore, it provides a useful working tool to study the interaction of D. rosae with the host and the influence of a genome duplication outside of the model yeast in the background of a phytopathogen

    Corpus access for beginners: the W3Corpora project

    Get PDF

    Report on the 2015 NSF Workshop on Unified Annotation Tooling

    Get PDF
    On March 30 & 31, 2015, an international group of twenty-three researchers with expertise in linguistic annotation convened in Sunny Isles Beach, Florida to discuss problems with and potential solutions for the state of linguistic annotation tooling. The participants comprised 14 researchers from the U.S. and 9 from outside the U.S., with 7 countries and 4 continents represented, and hailed from fields and specialties including computational linguistics, artificial intelligence, speech processing, multi-modal data processing, clinical & medical natural language processing, linguistics, documentary linguistics, sign-language linguistics, corpus linguistics, and the digital humanities. The motivating problem of the workshop was the balkanization of annotation tooling, namely, that even though linguistic annotation requires sophisticated tool support to efficiently generate high-quality data, the landscape of tools for the field is fractured, incompatible, inconsistent, and lacks key capabilities. The overall goal of the workshop was to chart the way forward, centering on five key questions: (1) What are the problems with current tool landscape? (2) What are the possible benefits of solving some or all of these problems? (3) What capabilities are most needed? (4) How should we go about implementing these capabilities? And, (5) How should we ensure longevity and sustainability of the solution? I surveyed the participants before their arrival, which provided significant raw material for ideas, and the workshop discussion itself resulted in identification of ten specific classes of problems, five sets of most-needed capabilities. Importantly, we identified annotation project managers in computational linguistics as the key recipients and users of any solution, thereby succinctly addressing questions about the scope and audience of potential solutions. We discussed management and sustainability of potential solutions at length. The participants agreed on sixteen recommendations for future work. This technical report contains a detailed discussion of all these topics, a point-by-point review of the discussion in the workshop as it unfolded, detailed information on the participants and their expertise, and the summarized data from the surveys

    Creating and exploiting multimodal annotated corpora

    Get PDF
    International audienceThe paper presents a project of the Laboratoire Parole et Langage which aims at collecting, annotating and exploiting a corpus of spoken French in a multimodal perspective. The project directly meets the present needs in linguistics where a growing number of researchers become aware of the fact that a theory of communication which aims at describing real interactions should take into account the complexity of these interactions. However, in order to take into account such a complexity, linguists should have access to spoken corpora annotated in different fields. The paper presents the annotation schemes used in phonetics, morphology and syntax, prosody, gestuality at the LPL together with the type of linguistic description made from the annotations seen in two examples

    Capturing and sharing our collective expertise on climate data: the CHARMe project

    Get PDF
    For users of climate services, the ability to quickly determine the datasets that best fit one's needs would be invaluable. The volume, variety and complexity of climate data makes this judgment difficult. The ambition of CHARMe ("Characterization of metadata to enable high-quality climate services") is to give a wider interdisciplinary community access to a range of supporting information, such as journal articles, technical reports or feedback on previous applications of the data. The capture and discovery of this "commentary" information, often created by data users rather than data providers, and currently not linked to the data themselves, has not been significantly addressed previously. CHARMe applies the principles of Linked Data and open web standards to associate, record, search and publish user-derived annotations in a way that can be read both by users and automated systems. Tools have been developed within the CHARMe project that enable annotation capability for data delivery systems already in wide use for discovering climate data. In addition, the project has developed advanced tools for exploring data and commentary in innovative ways, including an interactive data explorer and comparator ("CHARMe Maps") and a tool for correlating climate time series with external "significant events" (e.g. instrument failures or large volcanic eruptions) that affect the data quality. Although the project focuses on climate science, the concepts are general and could be applied to other fields. All CHARMe system software is open-source, released under a liberal licence, permitting future projects to re-use the source code as they wish

    Corpora for Computational Linguistics

    Get PDF
    Since the mid 90s corpora has become very important for computational linguistics. This paper offers a survey of how they are currently used in different fields of the discipline, with particular emphasis on anaphora and coreference resolution, automatic summarisation and term extraction. Their influence on other fields is also briefly discussed
    corecore