165 research outputs found

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    The Proof Is In The Pudding ā€“ Using Perceived Stress To Measure Short-Term Impact in Initiatives to Enhance Gender Balance in Computing Education

    Get PDF
    The problem of gender imbalance in computing higher education has forced academics and professionals to implement a wide range of initiatives. Many initiatives use recruitment or retention numbers as their most obvious evidence of impact. This type of evidence of impact is, however, more resource heavy to obtain, as well as often requires a longitudinal approach. There are many shorter term initiatives that use other ways to measure their success. First, this poster presents with a review of existing evaluation measures in interventions to recruit and retain women in computing education across the board. Three main groups of evaluation come out of this review: statistical data, feedback and instruments. Second, this work reveals what type of evaluation is typically present in what types of initiatives. Finally, it recommends Perceived Stress Scale instrument with data collected in a retrospective pre- and then post survey as a lightweight evaluation method for short-term impact. This research aims to assist creators of initiatives in demonstrating quick wins of their efforts to enhance gender balance in STEM disciplines.https://arrow.tudublin.ie/cddpos/1001/thumbnail.jp

    Deep Level Lexical Features for Cross-lingual Authorship Attribution

    Get PDF
    Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods

    Proceedings of the 18th Irish Conference on Artificial Intelligence and Cognitive Science

    Get PDF
    These proceedings contain the papers that were accepted for publication at AICS-2007, the 18th Annual Conference on Artificial Intelligence and Cognitive Science, which was held in the Technological University Dublin; Dublin, Ireland; on the 29th to the 31st August 2007. AICS is the annual conference of the Artificial Intelligence Association of Ireland (AIAI)

    k-Nearest Neighbour Classifiers - A Tutorial

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier ā€“ classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data.This paper is the second edition of a paper previously published as a technical report . Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods

    Identity Term Sampling for Measuring Gender Bias in Training Data

    Get PDF
    Predictions from machine learning models can reflect biases in the data on which they are trained. Gender bias has been identified in natural language processing systems such as those used for recruitment. The development of approaches to mitigate gender bias in training data typically need to be able to isolate the effect of gender on the output to see the impact of gender. While it is possible to isolate and identify gender for some types of training data, e.g. CVs in recruitment, for most textual corpora there is no obvious gender label. This paper proposes a general approach to measure bias in textual training data for NLP prediction systems by providing a gender label identified from the textual content of the training data. The approach is compared with the identity term template approach currently in use, also known as Gender Bias Evaluation Datasets (GBETs), which involves the design of synthetic test datasets which isolate gender and are used to probe for gender bias in a dataset. We show that our Identity Term Sampling (ITS) approach is capable of identifying gender bias at least as well as identity term templates and can be used on training data that has no obvious gender label

    Textual Case-based Reasoning for Spam Filtering: a Comparison of Feature-based and Feature-free Approaches

    Get PDF
    Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach

    TechMate: A Research-Driven Toolkit to Enhance Gender Balance in Computing Education

    Get PDF
    This poster presents a toolkit of practical initiatives and guidance on how to enhance gender balance in computing higher education. The suggested initiatives are designed in the way that could be adapted for a use in a local context, especially in universities in the UK or in Ireland. The initiatives are categorised under four main areas: Policy, Pedagogy, Influence & Support and Promotion & Engagement. Additionally, guidance is given on mechanisms to evaluate the impact of these initiatives. This work will be of interest to champions looking to enhance gender balance in their computing courses.https://arrow.tudublin.ie/cddpos/1027/thumbnail.jp

    Exploring the Impact of Gender Bias Mitigation Approaches on a Downstream Classification Task

    Get PDF
    Natural language models and systems have been shown to reļ¬‚ect gender bias existing in training data. This bias can impact on the downstream task that machine learning models, built on this training data, are to accomplish. A variety of techniques have been proposed to mitigate gender bias in training data. In this paper we compare diļ¬€erent gender bias mitigation approaches on a classiļ¬cation task. We consider mitigation techniques that manipulate the training data itself, including data scrubbing, gender swapping and counterfactual data augmentation approaches. We also look at using de-biased word embeddings in the representation of the training data. We evaluate the eļ¬€ectiveness of the diļ¬€erent approaches at reducing the gender bias in the training data and consider the impact on task performance. Our results show that the performance of the classiļ¬cation task is not aļ¬€ected adversely by many of the bias mitigation techniques but we show a signiļ¬cant variation in the eļ¬€ectiveness of the diļ¬€erent gender bias mitigation techniques

    Identifying Gendered Language

    Get PDF
    Gendered language refers to the use of words that indicate the gender of an individual. It can be explicit, where the gender is directly implied by the specific words used (e.g., mother, she, man), or it can be implicit, where societal roles and behaviors convey a person\u27s gender. For example, expectations that women display communal traits (e.g., affectionate, caring, gentle) and men display agentic traits (e.g., assertive, competitive, decisive). The presence of gendered language in natural language processing (NLP) systems can reinforce gender stereotypes and bias. Our work introduces an approach to creating gendered language datasets using ChatGPT. These datasets are designed to support data-driven methods for identifying gender stereotypes and mitigating gender bias. The approach focuses on generating implicit gendered language that captures and reflects stereotypical characteristics or traits associated with a specific gender. This is achieved by constructing prompts for ChatGPT that incorporate gender-coded words sourced from gender-coded lexicons. The evaluation of the datasets generated demonstrates good examples of English-language gendered sentences that can be categorized as either contradictory to or consistent with gender stereotypes. Additionally, the generated data exhibits a strong gender bias.https://arrow.tudublin.ie/cddpos/1007/thumbnail.jp
    • ā€¦
    corecore