167 research outputs found

    Sequence Labeling for Citation Field Extraction from Cyrillic Script References

    Get PDF
    Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data

    LaTeX, metadata, and publishing workflows

    Full text link
    The field of scientific publishing that is served by LaTeX is increasingly dependent on the availability of metadata about publications. We discuss how to use LaTeX classes and BibTeX styles to curate metadata throughout the life cycle of a published article. Our focus is on streamlining and automating much of publishing workflow. We survey the various options and drawbacks of the existing approaches and outline our approach as applied in a new LaTeX style file where we have as main goal to make it easier for authors to specify their metadata only once and use this throughout the entire publishing pipeline. We believe this can help to reduce the cost of publishing, by reducing the amount of human effort required for editing and providing of publication metadata

    Visual approaches to knowledge organization and contextual exploration

    Get PDF
    This thesis explores possible visual approaches for the representation of semantic structures, such as zz-structures. Some holistic visual representations of complex domains have been investigated through the proposal of new views - the so-called zz-views - that allow both to make visible the interconnections between elements and to support a contextual and multilevel exploration of knowledge. The potential of this approach has been examined in the context of two case studies that have led to the creation of two Web applications. The \ufb01rst domain of study regarded the visual representation, analysis and management of scienti\ufb01c bibliographies. In this context, we modeled a Web application, we called VisualBib, to support researchers in building, re\ufb01ning, analyzing and sharing bibliographies. We adopted a multi-faceted approach integrating features that are typical of three di\ufb00erent classes of tools: bibliography visual analysis systems, bibliographic citation indexes and personal research assistants. The evaluation studies carried out on a \ufb01rst prototype highlighted the positive impact of our visual model and encouraged us to improve it and develop further visual analysis features we incorporated in the version 3.0 of the application. The second case study concerned the modeling and development of a multimedia catalog of Web and mobile applications. The objective was to provide an overview of a significant number of tools that can help teachers in the implementation of active learning approaches supported by technology and in the design of Teaching and Learning Activities (TLAs). We analyzed and documented 281 applications, preparing for each of them a detailed multilingual card and a video-presentation, organizing all the material in an original purpose-based taxonomy, visually represented through a browsable holistic view. The catalog, we called AppInventory, provides contextual exploration mechanisms based on zz-structures, collects user contributions and evaluations about the apps and o\ufb00ers visual analysis tools for the comparison of the applications data and user evaluations. The results of two user studies carried out on groups of teachers and students shown a very positive impact of our proposal in term of graphical layout, semantic structure, navigation mechanisms and usability, also in comparison with two similar catalogs

    BibGlimpse: The case for a light-weight reprint manager in distributed literature research

    Get PDF
    Background While text-mining and distributed annotation systems both aim at capturing knowledge and presenting it in a standardized form, there have been few attempts to investigate potential synergies between these two fields. For instance, distributed annotation would be very well suited for providing topic focussed, expert knowledge enriched text corpora. A key limitation for this approach is the availability of literature annotation systems that can be routinely used by groups of collaborating researchers on a day to day basis, not distracting from the main focus of their work. Results For this purpose, we have designed BibGlimpse. Features like drop-to-file, SVM based automated retrieval of PubMed bibliography for PDF reprints, and annotation support make BibGlimpse an efficient, light-weight reprint manager that facilitates distributed literature research for work groups. Building on an established open search engine, full-text search and structured queries are supported, while at the same time making shared collections of annotated reprints accessible to literature classification and text-mining tools. Conclusion BibGlimpse offers scientists a tool that enhances their own literature management. Moreover, it may be used to create content enriched, annotated text corpora for research in text-mining

    The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research

    Full text link
    Recent advances in deep learning methods for natural language processing (NLP) have created new business opportunities and made NLP research critical for industry development. As one of the big players in the field of NLP, together with governments and universities, it is important to track the influence of industry on research. In this study, we seek to quantify and characterize industry presence in the NLP community over time. Using a corpus with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP publication authors, we explore the industry presence in the field since the early 90s. We find that industry presence among NLP authors has been steady before a steep increase over the past five years (180% growth from 2017 to 2022). A few companies account for most of the publications and provide funding to academic researchers through grants and internships. Our study shows that the presence and impact of the industry on natural language processing research are significant and fast-growing. This work calls for increased transparency of industry influence in the field
    • …
    corecore