1,989 research outputs found
Explainability for Large Language Models: A Survey
Large language models (LLMs) have demonstrated impressive capabilities in
natural language processing. However, their internal mechanisms are still
unclear and this lack of transparency poses unwanted risks for downstream
applications. Therefore, understanding and explaining these models is crucial
for elucidating their behaviors, limitations, and social impacts. In this
paper, we introduce a taxonomy of explainability techniques and provide a
structured overview of methods for explaining Transformer-based language
models. We categorize techniques based on the training paradigms of LLMs:
traditional fine-tuning-based paradigm and prompting-based paradigm. For each
paradigm, we summarize the goals and dominant approaches for generating local
explanations of individual predictions and global explanations of overall model
knowledge. We also discuss metrics for evaluating generated explanations, and
discuss how explanations can be leveraged to debug models and improve
performance. Lastly, we examine key challenges and emerging opportunities for
explanation techniques in the era of LLMs in comparison to conventional machine
learning models
SWAT: A System for Detecting Salient Wikipedia Entities in Texts
We study the problem of entity salience by proposing the design and
implementation of SWAT, a system that identifies the salient Wikipedia entities
occurring in an input document. SWAT consists of several modules that are able
to detect and classify on-the-fly Wikipedia entities as salient or not, based
on a large number of syntactic, semantic and latent features properly extracted
via a supervised process which has been trained over millions of examples drawn
from the New York Times corpus. The validation process is performed through a
large experimental assessment, eventually showing that SWAT improves known
solutions over all publicly available datasets. We release SWAT via an API that
we describe and comment in the paper in order to ease its use in other
software
All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts
and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten
different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural
language processing (NLP). Yet, what `good generalisation' entails and how it
should be evaluated is not well understood, nor are there any common standards
to evaluate it. In this paper, we aim to lay the ground-work to improve both of
these issues. We present a taxonomy for characterising and understanding
generalisation research in NLP, we use that taxonomy to present a comprehensive
map of published generalisation studies, and we make recommendations for which
areas might deserve attention in the future. Our taxonomy is based on an
extensive literature review of generalisation research, and contains five axes
along which studies can differ: their main motivation, the type of
generalisation they aim to solve, the type of data shift they consider, the
source by which this data shift is obtained, and the locus of the shift within
the modelling pipeline. We use our taxonomy to classify over 400 previous
papers that test generalisation, for a total of more than 600 individual
experiments. Considering the results of this review, we present an in-depth
analysis of the current state of generalisation research in NLP, and make
recommendations for the future. Along with this paper, we release a webpage
where the results of our review can be dynamically explored, and which we
intend to up-date as new NLP generalisation studies are published. With this
work, we aim to make steps towards making state-of-the-art generalisation
testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference
The educational research-practice interface revisited
The question of how the realms of research and practice might successfully relate to one another is a persisting one, and especially so in education. The article takes a fresh look at this issue by using the terminology of collaboration scripts to reflect upon various forms of this relationship. Under this perspective, several approaches towards bridging the research/ practice gap are being described with regard to the type and closeness of interaction between the two realms. As different focuses and blind spots become discernible, the issue is raised concerning which 'script' might be appropriate depending upon the starting conditions of research interacting with practice
Biomedical Event Extraction with Machine Learning
Biomedical natural language processing (BioNLP) is a subfield of natural
language processing, an area of computational linguistics concerned
with developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of
semantic relations such as protein--protein interactions (PPI) from scientific
texts. The aim is to enhance information retrieval by detecting relations
between concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative for
simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events are
characterized by annotated trigger words, directed and typed arguments and the
ability to nest other events. For example, the sentence ``Protein A causes
protein B to bind protein C'' can be annotated with the nested event structure
CAUSE(A, BIND(B, C)). Converted to such formal representations, the
information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the BioInfer and
GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task
on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph format is defined
for representing event annotations and the problem of extracting complex event
structures is decomposed into a number of independent classification tasks.
These classification tasks are solved using SVM and RLS classifiers, utilizing
rich feature representations built from full dependency parsing. Building on
earlier work on pairwise relation extraction and using a generalized graph
representation, the resulting TEES system is capable of detecting binary
relations as well as complex event structures.
We show that this event extraction system has good performance,
reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently,
TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared
Tasks, as well as shown competitive performance in the binary relation Drug-Drug
Interaction Extraction 2011 and 2013 shared tasks.
The Turku Event Extraction System is published as a freely available open-source
project, documenting the research in detail as well as making the method
available for practical applications. In particular, in this thesis we
describe the application of the event extraction method to PubMed-scale text
mining, showing how the developed approach not only shows good performance, but
is generalizable and applicable to large-scale real-world text mining projects.
Finally, we discuss related literature, summarize the contributions of the work
and present some thoughts on future directions for biomedical event extraction.
This thesis includes and builds on six original research publications. The first
of these introduces the analysis of dependency parses that leads to
development of TEES. The entries in the three BioNLP Shared Tasks, as well as
in the DDIExtraction 2011 task are covered in four publications, and the sixth
one demonstrates the application of the system to PubMed-scale text mining.</p
Data discovery through profile-based similarity metrics
The most essential step in a data integration process is to find the datasets whose combined information provides relevant insights. This task, defined as data discovery, is highly dependent on the definition of the similarity between the candidate attributes to join, which commonly involves assessing the closeness of the semantic concepts that the two attributes represent. Most of the state-of-the-art approaches to this issue rely on syntactic methodologies, that is, procedures in which the instances of the two columns are compared to determine whether they are similar or not. These approaches suffice when the two sets of instances share the same syntactic representation but fail to detect cases in which the same semantic idea is represented by different sets of values. This latter case is ever-increasing in proportion, given the characteristics of big-data environments and the lack of standardization of the data. The aim of this project is to develop a system that can solve this issue and facilitate the establishment of relationships between related data that do not share a syntactic relationship. The approach presented in this work leverages the extensively studied syntactic methodologies to data discovery in conjunction with a new formulation for semantic similarity: the resemblance of probability distributions. Additionally, this system will be made scalable and able to handle vast quantities of data
Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations
Identifying academic plagiarism is a pressing task for educational and
research institutions, publishers, and funding agencies. Current plagiarism
detection systems reliably find instances of copied and moderately reworded
text. However, reliably detecting concealed plagiarism, such as strong
paraphrases, translations, and the reuse of nontextual content and ideas is an
open research problem. In this paper, we extend our prior research on analyzing
mathematical content and academic citations. Both are promising approaches for
improving the detection of concealed academic plagiarism primarily in Science,
Technology, Engineering and Mathematics (STEM). We make the following
contributions: i) We present a two-stage detection process that combines
similarity assessments of mathematical content, academic citations, and text.
ii) We introduce new similarity measures that consider the order of
mathematical features and outperform the measures in our prior research. iii)
We compare the effectiveness of the math-based, citation-based, and text-based
detection approaches using confirmed cases of academic plagiarism. iv) We
demonstrate that the combined analysis of math-based and citation-based content
features allows identifying potentially suspicious cases in a collection of
102K STEM documents. Overall, we show that analyzing the similarity of
mathematical content and academic citations is a striking supplement for
conventional text-based detection approaches for academic literature in the
STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL) 2019. The data and code of our study are openly available at
https://purl.org/hybridP
- …