267,156 research outputs found

    Inverse Classification for Comparison-based Interpretability in Machine Learning

    Full text link
    In the context of post-hoc interpretability, this paper addresses the task of explaining the prediction of a classifier, considering the case where no information is available, neither on the classifier itself, nor on the processed data (neither the training nor the test data). It proposes an instance-based approach whose principle consists in determining the minimal changes needed to alter a prediction: given a data point whose classification must be explained, the proposed method consists in identifying a close neighbour classified differently, where the closeness definition integrates a sparsity constraint. This principle is implemented using observation generation in the Growing Spheres algorithm. Experimental results on two datasets illustrate the relevance of the proposed approach that can be used to gain knowledge about the classifier.Comment: preprin

    Bibliometric Analysis to Scan and Scrape New Datasets: It’s all about that BASS

    Get PDF
    Objectives The main objective of this poster is to present a pilot project in determining emerging population health themes and identifying key research-enabling datasets ahead of time. At present, large-scale databanks, such as the Secure Anonymised Information Linkage (SAIL) Databank at Swansea University Medical School, already manage large quantities of health and administrative linked datasets. While these datasets are valuable for research purposes, complementary datasets may be required by collaborating researchers to answer detailed population health research questions. Dataset acquisition can take several years, which is a serious delay to a project with time-limited funding. The ability to pre-emptively acquire datasets so that these are ready for use before a researcher requests them would obviously be beneficial. However, a recent study conducted by the Farr Cipher team at Swansea University identified over 800 health and administrative datasets in Wales alone. With limited resources such as available funding and time, which of these datasets is worth its effort in acquiring? ApproachBibliometrics has long been a means of measuring the impact of papers on the wider academic community. Lately, the focus of analyses has been extended to include the topics, authorship and citations of the publications. Existing bibliometric data mining techniques suggest that it is possible to identify emerging topic trends and through this assist in prioritising dataset identification and acquisition. The project explored mining available literature through bibliometric analysis in order to predict emerging trends and through these identify potentially relevant and valuable datasets for acquisition on behalf of the Dementias Platform UK (DPUK). Literature searches were conducted for papers published on the topic of “dementia” over the last 20 years. Additional keywords and topics were extracted to identify emerging areas of research and clinical interest. These were then compared against an existing list of over 800 Welsh datasets currently not held in SAIL. ResultsResults focus on: • Using bibliometric methods in the context of DPUK cohort publications • Identifying emerging trends in the field of dementia research. • Identifying and prioritising datasets which might be useful for the SAIL Databank to acquir

    Context-Aware Process Performance Indicator Prediction

    Get PDF
    It is well-known that context impacts running instances of a process. Thus, defining and using contextual information may help to improve the predictive monitoring of business processes, which is one of the main challenges in process mining. However, identifying this contextual information is not an easy task because it might change depending on the target of the prediction. In this paper, we propose a novel methodology named CAP3 (Context-aware Process Performance indicator Prediction) which involves two phases. The first phase guides process analysts on identifying the context for the predictive monitoring of process performance indicators (PPIs), which are quantifiable metrics focused on measuring the progress of strategic objectives aimed to improve the process. The second phase involves a context-aware predictive monitoring technique that incorporates the relevant context information as input for the prediction. Our methodology leverages context-oriented domain knowledge and experts’ feedback to discover the contextual information useful to improve the quality of PPI prediction with a decrease of error rates in most cases, by adding this information as features to the datasets used as input of the predictive monitoring process. We experimentally evaluated our approach using two-real-life organizations. Process experts from both organizations applied CAP3 methodology and identified the contextual information to be used for prediction. The model learned using this information achieved lower error rates in most cases than the model learned without contextual information confirming the benefits of CAP3.European Union Horizon 2020 No. 645751 (RISE BPM)Ministerio de Ciencia, Innovación y Universidades Horatio RTI2018-101204-B-C21Ministerio de Ciencia, Innovación y Universidades OPHELIA RTI2018-101204-B-C2

    Detecting Online Hate Speech Using Both Supervised and Weakly-Supervised Approaches

    Get PDF
    In the wake of a polarizing election, social media is laden with hateful content. Context accompanying a hate speech text is useful for identifying hate speech, which however has been largely overlooked in existing datasets and hate speech detection models. We provide an annotated corpus of hate speech with context information well kept. Then we propose two types of supervised hate speech detection models that incorporate context information, a logistic regression model with context features and a neural network model with learning components for context. Further, to address various limitations of supervised hate speech classification methods including corpus bias and huge cost of annotation, we propose a weakly supervised two-path bootstrapping approach for online hate speech detection by leveraging large-scale unlabeled data. This system significantly outperforms hate speech detection systems that are trained in a supervised manner using manually annotated data. Applying this model on a large quantity of tweets collected before, after, and on election day reveals motivations and patterns of inflammatory language

    Detrimental Contexts in Open-Domain Question Answering

    Full text link
    For knowledge intensive NLP tasks, it has been widely accepted that accessing more information is a contributing factor to improvements in the model's end-to-end performance. However, counter-intuitively, too much context can have a negative impact on the model when evaluated on common question answering (QA) datasets. In this paper, we analyze how passages can have a detrimental effect on retrieve-then-read architectures used in question answering. Our empirical evidence indicates that the current read architecture does not fully leverage the retrieved passages and significantly degrades its performance when using the whole passages compared to utilizing subsets of them. Our findings demonstrate that model accuracy can be improved by 10% on two popular QA datasets by filtering out detrimental passages. Additionally, these outcomes are attained by utilizing existing retrieval methods without further training or data. We further highlight the challenges associated with identifying the detrimental passages. First, even with the correct context, the model can make an incorrect prediction, posing a challenge in determining which passages are most influential. Second, evaluation typically considers lexical matching, which is not robust to variations of correct answers. Despite these limitations, our experimental results underscore the pivotal role of identifying and removing these detrimental passages for the context-efficient retrieve-then-read pipeline. Code and data are available at https://github.com/xfactlab/emnlp2023-damaging-retrievalComment: Findings of EMNLP 202
    corecore