12,849 research outputs found

    GeoAnnotator: A Collaborative Semi-Automatic Platform for Constructing Geo-Annotated Text Corpora

    Get PDF
    Ground-truth datasets are essential for the training and evaluation of any automated algorithm. As such, gold-standard annotated corpora underlie most advances in natural language processing (NLP). However, only a few relatively small (geo-)annotated datasets are available for geoparsing, i.e., the automatic recognition and geolocation of place references in unstructured text. The creation of geoparsing corpora that include both the recognition of place names in text and matching of those names to toponyms in a geographic gazetteer (a process we call geo-annotation), is a laborious, time-consuming and expensive task. The field lacks efficient geo-annotation tools to support corpus building and lacks design guidelines for the development of such tools. Here, we present the iterative design of GeoAnnotator, a web-based, semi-automatic and collaborative visual analytics platform for geo-annotation. GeoAnnotator facilitates collaborative, multi-annotator creation of large corpora of geo-annotated text by generating computationally-generated pre-annotations that can be improved by human-annotator users. The resulting corpora can be used in improving and benchmarking geoparsing algorithms as well as various other spatial language-related methods. Further, the iterative design process and the resulting design decisions can be used in annotation platforms tailored for other application domains of NLP

    ATLAS: A flexible and extensible architecture for linguistic annotation

    Full text link
    We describe a formal model for annotating linguistic artifacts, from which we derive an application programming interface (API) to a suite of tools for manipulating these annotations. The abstract logical model provides for a range of storage formats and promotes the reuse of tools that interact through this API. We focus first on ``Annotation Graphs,'' a graph model for annotations on linear signals (such as text and speech) indexed by intervals, for which efficient database storage and querying techniques are applicable. We note how a wide range of existing annotated corpora can be mapped to this annotation graph model. This model is then generalized to encompass a wider variety of linguistic ``signals,'' including both naturally occuring phenomena (as recorded in images, video, multi-modal interactions, etc.), as well as the derived resources that are increasingly important to the engineering of natural language processing systems (such as word lists, dictionaries, aligned bilingual corpora, etc.). We conclude with a review of the current efforts towards implementing key pieces of this architecture.Comment: 8 pages, 9 figure

    Compilation and Exploitation of Parallel Corpora

    Get PDF
    With more and more text being available in electronic form, it is becoming relatively easy to obtain digital texts together with their translations. The paper presents the processing steps necessary to compile such texts into parallel corpora, an extremely useful language resource. Parallel corpora can be used as a translation aid for second-language learners, for translators and lexicographers, or as a data-source for various language technology tools. We present our work in this direction, which is characterised by the use of open standards for text annotation, the use of publicly available third-party tools and wide availability of the produced resources. Explained is the corpus annotation chain involving normalisation, tokenisation, segmentation, alignment, word-class syntactic tagging, and lemmatisation. Two exploitation results over our annotated corpora are also presented, namely aWeb concordancer and the extraction of bi-lingual lexica

    OpenTagger: A flexible and user-friendly linguistic tagger

    Get PDF
    Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms. This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions. OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1

    OpenTagger: A flexible and user-friendly linguistic tagger

    Get PDF
    Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms. This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions. OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1

    Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    Get PDF
    Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers

    Unity in diversity : integrating differing linguistic data in TUSNELDA

    Get PDF
    This paper describes the creation and preparation of TUSNELDA, a collection of corpus data built for linguistic research. This collection contains a number of linguistically annotated corpora which differ in various aspects such as language, text sorts / data types, encoded annotation levels, and linguistic theories underlying the annotation. The paper focuses on this variation on the one hand and the way how these heterogeneous data are integrated into one resource on the other hand

    Exploring lexical patterns in text : lexical cohesion analysis with WordNet

    Get PDF
    We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    BibGlimpse: The case for a light-weight reprint manager in distributed literature research

    Get PDF
    Background While text-mining and distributed annotation systems both aim at capturing knowledge and presenting it in a standardized form, there have been few attempts to investigate potential synergies between these two fields. For instance, distributed annotation would be very well suited for providing topic focussed, expert knowledge enriched text corpora. A key limitation for this approach is the availability of literature annotation systems that can be routinely used by groups of collaborating researchers on a day to day basis, not distracting from the main focus of their work. Results For this purpose, we have designed BibGlimpse. Features like drop-to-file, SVM based automated retrieval of PubMed bibliography for PDF reprints, and annotation support make BibGlimpse an efficient, light-weight reprint manager that facilitates distributed literature research for work groups. Building on an established open search engine, full-text search and structured queries are supported, while at the same time making shared collections of annotated reprints accessible to literature classification and text-mining tools. Conclusion BibGlimpse offers scientists a tool that enhances their own literature management. Moreover, it may be used to create content enriched, annotated text corpora for research in text-mining
    corecore