167 research outputs found
Sequence Labeling for Citation Field Extraction from Cyrillic Script References
Extracting structured data from bibliographic references is a crucial task for the creation of scholarly databases. While approaches, tools, and evaluation data sets for the task exist, there is a distinct lack of support for languages other than English and scripts other than the Latin alphabet. A significant portion of the scientific literature that is thereby excluded consists of publications written in Cyrillic script languages. To address this problem, we introduce a new multilingual and multidisciplinary data set of over 100,000 labeled reference strings. The data set covers multiple Cyrillic languages and contains over 700 manually labeled references, while the remaining are generated synthetically. With random samples of varying size of this data, we train multiple well performing sequence labeling BERT models and thus show the usability of our proposed data set. To this end, we showcase an implementation of a multilingual BERT model trained on the synthetic data and evaluated on the manually labeled references. Our model achieves an F1 score of 0.93 and thereby significantly outperforms a state-of-the-art model we retrain and evaluate on our data
LaTeX, metadata, and publishing workflows
The field of scientific publishing that is served by LaTeX is increasingly
dependent on the availability of metadata about publications. We discuss how to
use LaTeX classes and BibTeX styles to curate metadata throughout the life
cycle of a published article. Our focus is on streamlining and automating much
of publishing workflow. We survey the various options and drawbacks of the
existing approaches and outline our approach as applied in a new LaTeX style
file where we have as main goal to make it easier for authors to specify their
metadata only once and use this throughout the entire publishing pipeline. We
believe this can help to reduce the cost of publishing, by reducing the amount
of human effort required for editing and providing of publication metadata
Visual approaches to knowledge organization and contextual exploration
This thesis explores possible visual approaches for the representation of semantic structures, such as zz-structures. Some holistic visual representations of complex domains have been investigated through the proposal of new views - the so-called zz-views - that allow both to make visible the interconnections between elements and to support a contextual and multilevel exploration of knowledge. The potential of this approach has been examined in the context of two case studies that have led to the creation of two Web applications.
The \ufb01rst domain of study regarded the visual representation, analysis and management of scienti\ufb01c bibliographies. In this context, we modeled a Web application, we called VisualBib, to support researchers in building, re\ufb01ning, analyzing and sharing bibliographies. We adopted a multi-faceted approach integrating features that are typical of three di\ufb00erent classes of tools: bibliography visual analysis systems, bibliographic citation indexes and personal research assistants. The evaluation studies carried out on a \ufb01rst prototype highlighted the positive impact of our visual model and encouraged us to improve it and develop further visual analysis features we incorporated in the version 3.0 of the application.
The second case study concerned the modeling and development of a multimedia catalog of Web and mobile applications. The objective was to provide an overview of a significant number of tools that can help teachers in the implementation of active learning approaches supported by technology and in the design of Teaching and Learning Activities (TLAs). We analyzed and documented 281 applications, preparing for each of them a detailed multilingual card and a video-presentation, organizing all the material in an original purpose-based taxonomy, visually represented through a browsable holistic view. The catalog, we called AppInventory, provides contextual exploration mechanisms based on zz-structures, collects user contributions and evaluations about the apps and o\ufb00ers visual analysis tools for the comparison of the applications data and user evaluations. The results of two user studies carried out on groups of teachers and students shown a very positive impact of our proposal in term of graphical layout, semantic structure, navigation mechanisms and usability, also in comparison with two similar catalogs
BibGlimpse: The case for a light-weight reprint manager in distributed literature research
Background
While text-mining and distributed annotation systems both aim at capturing knowledge and presenting it in a standardized form, there have been few attempts to investigate potential synergies between these two fields. For instance, distributed annotation would be very well suited for providing topic focussed, expert knowledge enriched text corpora. A key limitation for this approach is the availability of literature annotation systems that can be routinely used by groups of collaborating researchers on a day to day basis, not distracting from the main focus of their work.
Results
For this purpose, we have designed BibGlimpse. Features like drop-to-file, SVM based automated retrieval of PubMed bibliography for PDF reprints, and annotation support make BibGlimpse an efficient, light-weight reprint manager that facilitates distributed literature research for work groups. Building on an established open search engine, full-text search and structured queries are supported, while at the same time making shared collections of annotated reprints accessible to literature classification and text-mining tools.
Conclusion
BibGlimpse offers scientists a tool that enhances their own literature management. Moreover, it may be used to create content enriched, annotated text corpora for research in text-mining
The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research
Recent advances in deep learning methods for natural language processing
(NLP) have created new business opportunities and made NLP research critical
for industry development. As one of the big players in the field of NLP,
together with governments and universities, it is important to track the
influence of industry on research. In this study, we seek to quantify and
characterize industry presence in the NLP community over time. Using a corpus
with comprehensive metadata of 78,187 NLP publications and 701 resumes of NLP
publication authors, we explore the industry presence in the field since the
early 90s. We find that industry presence among NLP authors has been steady
before a steep increase over the past five years (180% growth from 2017 to
2022). A few companies account for most of the publications and provide funding
to academic researchers through grants and internships. Our study shows that
the presence and impact of the industry on natural language processing research
are significant and fast-growing. This work calls for increased transparency of
industry influence in the field
- …