Search CORE

165 research outputs found

k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/04/2020
Field of study

Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

arXiv.org e-Print Archive

Arrow@TUDublin

The Proof Is In The Pudding – Using Perceived Stress To Measure Short-Term Impact in Initiatives to Enhance Gender Balance in Computing Education

Author: Berry Alina
Delany Sarah Jane
Publication venue: Technological University Dublin
Publication date: 01/01/2023
Field of study

The problem of gender imbalance in computing higher education has forced academics and professionals to implement a wide range of initiatives. Many initiatives use recruitment or retention numbers as their most obvious evidence of impact. This type of evidence of impact is, however, more resource heavy to obtain, as well as often requires a longitudinal approach. There are many shorter term initiatives that use other ways to measure their success. First, this poster presents with a review of existing evaluation measures in interventions to recruit and retain women in computing education across the board. Three main groups of evaluation come out of this review: statistical data, feedback and instruments. Second, this work reveals what type of evaluation is typically present in what types of initiatives. Finally, it recommends Perceived Stress Scale instrument with data collected in a retrospective pre- and then post survey as a lightweight evaluation method for short-term impact. This research aims to assist creators of initiatives in demonstrating quick wins of their efforts to enhance gender balance in STEM disciplines.https://arrow.tudublin.ie/cddpos/1001/thumbnail.jp

Arrow@TUDublin

Proceedings of the 18th Irish Conference on Artificial Intelligence and Cognitive Science

Author: Delany Sarah Jane
Madden Michael
Publication venue: Dublin Institute of Technology
Publication date: 29/08/2007
Field of study

These proceedings contain the papers that were accepted for publication at AICS-2007, the 18th Annual Conference on Artificial Intelligence and Cognitive Science, which was held in the Technological University Dublin; Dublin, Ireland; on the 29th to the 31st August 2007. AICS is the annual conference of the Artificial Intelligence Association of Ireland (AIAI)

Arrow@TUDublin

Deep Level Lexical Features for Cross-lingual Authorship Attribution

Author: Delany Sarah Jane
Llorens Marisa
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2016
Field of study

Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods

Arrow@TUDublin

k-Nearest Neighbour Classifiers - A Tutorial

Author: Cunningham Padraig
Delany Sarah Jane
Publication venue: Technological University Dublin
Publication date: 01/01/2021
Field of study

Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier – classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data.This paper is the second edition of a paper previously published as a technical report . Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods

Arrow@TUDublin

Identity Term Sampling for Measuring Gender Bias in Training Data

Author: Delany Sarah Jane
Sobhani Nasim
Publication venue: Technological University Dublin
Publication date: 01/12/2022
Field of study

Predictions from machine learning models can reflect biases in the data on which they are trained. Gender bias has been identified in natural language processing systems such as those used for recruitment. The development of approaches to mitigate gender bias in training data typically need to be able to isolate the effect of gender on the output to see the impact of gender. While it is possible to isolate and identify gender for some types of training data, e.g. CVs in recruitment, for most textual corpora there is no obvious gender label. This paper proposes a general approach to measure bias in textual training data for NLP prediction systems by providing a gender label identified from the textual content of the training data. The approach is compared with the identity term template approach currently in use, also known as Gender Bias Evaluation Datasets (GBETs), which involves the design of synthetic test datasets which isolate gender and are used to probe for gender bias in a dataset. We show that our Identity Term Sampling (ITS) approach is capable of identifying gender bias at least as well as identity term templates and can be used on training data that has no obvious gender label

Arrow@TUDublin

Textual Case-based Reasoning for Spam Filtering: a Comparison of Feature-based and Feature-free Approaches

Author: Bridge Derek
Delany Sarah Jane
Publication venue: Dublin Institute of Technology
Publication date: 01/10/2006
Field of study

Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach

Arrow@TUDublin

TechMate: A Research-Driven Toolkit to Enhance Gender Balance in Computing Education

Author: Berry Alina
Delany Sarah Jane
Publication venue: Technological University Dublin
Publication date: 01/01/2023
Field of study

This poster presents a toolkit of practical initiatives and guidance on how to enhance gender balance in computing higher education. The suggested initiatives are designed in the way that could be adapted for a use in a local context, especially in universities in the UK or in Ireland. The initiatives are categorised under four main areas: Policy, Pedagogy, Influence & Support and Promotion & Engagement. Additionally, guidance is given on mechanisms to evaluate the impact of these initiatives. This work will be of interest to champions looking to enhance gender balance in their computing courses.https://arrow.tudublin.ie/cddpos/1027/thumbnail.jp

Arrow@TUDublin

Exploring the Impact of Gender Bias Mitigation Approaches on a Downstream Classification Task

Author: Delany Sarah Jane
Sobhani Nasim
Publication venue: Technological University Dublin
Publication date: 01/10/2022
Field of study

Natural language models and systems have been shown to reﬂect gender bias existing in training data. This bias can impact on the downstream task that machine learning models, built on this training data, are to accomplish. A variety of techniques have been proposed to mitigate gender bias in training data. In this paper we compare diﬀerent gender bias mitigation approaches on a classiﬁcation task. We consider mitigation techniques that manipulate the training data itself, including data scrubbing, gender swapping and counterfactual data augmentation approaches. We also look at using de-biased word embeddings in the representation of the training data. We evaluate the eﬀectiveness of the diﬀerent approaches at reducing the gender bias in the training data and consider the impact on task performance. Our results show that the performance of the classiﬁcation task is not aﬀected adversely by many of the bias mitigation techniques but we show a signiﬁcant variation in the eﬀectiveness of the diﬀerent gender bias mitigation techniques

Arrow@TUDublin

Identifying Gendered Language

Author: Delany Sarah Jane
Soundararajan Shweta
Publication venue: Technological University Dublin
Publication date: 01/01/2023
Field of study

Gendered language refers to the use of words that indicate the gender of an individual. It can be explicit, where the gender is directly implied by the specific words used (e.g., mother, she, man), or it can be implicit, where societal roles and behaviors convey a person\u27s gender. For example, expectations that women display communal traits (e.g., affectionate, caring, gentle) and men display agentic traits (e.g., assertive, competitive, decisive). The presence of gendered language in natural language processing (NLP) systems can reinforce gender stereotypes and bias. Our work introduces an approach to creating gendered language datasets using ChatGPT. These datasets are designed to support data-driven methods for identifying gender stereotypes and mitigating gender bias. The approach focuses on generating implicit gendered language that captures and reflects stereotypical characteristics or traits associated with a specific gender. This is achieved by constructing prompts for ChatGPT that incorporate gender-coded words sourced from gender-coded lexicons. The evaluation of the datasets generated demonstrates good examples of English-language gendered sentences that can be categorized as either contradictory to or consistent with gender stereotypes. Additionally, the generated data exhibits a strong gender bias.https://arrow.tudublin.ie/cddpos/1007/thumbnail.jp

Arrow@TUDublin