3 research outputs found
Measuring academic influence: Not all citations are equal
The importance of a research article is routinely measured by counting how
many times it has been cited. However, treating all citations with equal weight
ignores the wide variety of functions that citations perform. We want to
automatically identify the subset of references in a bibliography that have a
central academic influence on the citing paper. For this purpose, we examine
the effectiveness of a variety of features for determining the academic
influence of a citation. By asking authors to identify the key references in
their own work, we created a data set in which citations were labeled according
to their academic influence. Using automatic feature selection with supervised
machine learning, we found a model for predicting academic influence that
achieves good performance on this data set using only four features. The best
features, among those we evaluated, were those based on the number of times a
reference is mentioned in the body of a citing paper. The performance of these
features inspired us to design an influence-primed h-index (the hip-index).
Unlike the conventional h-index, it weights citations by how many times a
reference is mentioned. According to our experiments, the hip-index is a better
indicator of researcher performance than the conventional h-index
Recommended from our members
Language Models for Citation Classification
Authors reference academic works for a variety of reasons. As a result, not all citations in a research article have the same purpose. The need to understand and distinguish these citation purposes led to the development of automated approaches that consider semantic cues in the form of the context surrounding the citations. Identifying the semantic aspects of citations has proven valuable in various applications including research assessment, information retrieval, document summarisation, and more.
While automated citation classification has been in progress since the early 2000s, current efforts to determine citation types based on their contexts remain largely domain-specific. Besides, there is a lack of standard benchmarks for evaluating models for citation classification. Extracting valuable metadata related to the reason behind citation in scientific articles, particularly across multiple domains, is laborious and researchers still lack consensus on what should be the optimal context size for effective detection of citation function. The current methods heavily rely on the amount of annotated data used for training, making them data-centric. The emergence of self-supervised language models, which efficiently learn contextual relationships from vast unannotated datasets, has brought about substantial changes in the realm of Natural Language Processing in recent years. Despite these advancements, the few-shot predictive capability of the language models remains under-utilised in this field.
This thesis addresses the above shortcomings of citation classification. We systematically and comprehensively review the existing methodologies used by the previous works and identify the research gap and the potential future works. This meta-analysis forms the foundation for the research problems addressed in Chapters 3, 4, 5 and 6.
Initially, we introduce a novel benchmark in the form of an open shared task competition for multi-disciplinary citation classification in Chapter 3. The methods submitted to this shared task highlight the superiority of deep learning-based approaches and hinted at the importance of incorporating additional context to enhance the performance of citation classification models.
Secondly, we create a new open access feature-enriched multi-disciplinary citation classification dataset to overcome the challenges associated with extracting meta-data from both citing and cited articles in Chapter 4. The feature extraction process, utilising multiple sources and the missing meta-data values, indicates the complexities involved in extracting features for a heterogeneous dataset.
In Chapter 5, we assess domain-specific and multi-disciplinary datasets by fine-tuning them on pre-trained scientific language models, specifically exploring various fixed citation context windows. We introduce a new method for automatically extracting dynamic context windows in an unsupervised manner. Both sets of experiments emphasise the significance of additional context in citation context classification. Moreover, the experimental results also show the domain dependence of the citation context window, providing evidence for the benefit of extracting context dynamically.
Lastly, Chapter 6 presents novel prompting strategies for scientific and general-purpose language models to reduce the dependence on labelled citation classification datasets. The analysis of model performances under zero and few-shot settings reveals the effectiveness of large language models with minimal supervision, particularly when employing the newly proposed dynamic citation context-based prompting strategy