2,356 research outputs found

    Algorithms that Remember: Model Inversion Attacks and Data Protection Law

    Get PDF
    Many individuals are concerned about the governance of machine learning systems and the prevention of algorithmic harms. The EU's recent General Data Protection Regulation (GDPR) has been seen as a core tool for achieving better governance of this area. While the GDPR does apply to the use of models in some limited situations, most of its provisions relate to the governance of personal data, while models have traditionally been seen as intellectual property. We present recent work from the information security literature around `model inversion' and `membership inference' attacks, which indicate that the process of turning training data into machine learned systems is not one-way, and demonstrate how this could lead some models to be legally classified as personal data. Taking this as a probing experiment, we explore the different rights and obligations this would trigger and their utility, and posit future directions for algorithmic governance and regulation.Comment: 15 pages, 1 figur

    Accurate, timely and portable: course-agnostic early prediction of student performance from LMS logs

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceLearning management systems are essential intermediaries between students and educational content in the digital era. Among other factors, the institutional adoption of such systems is meant to foster student engagement and lead to better educational outcomes in a scalable manner. However, a significant challenge facing educators and institutions is the timely identification of students who may require special attention and feedback. Early identification of students allows educators to provide necessary feedback and adopt suitable corrective measures. Therefore, a significant body of research has been dedicated to developing early warning systems with clickstream data. However, comprehensive studies that attempt prediction on multiple courses are few and far between. Moreover, most predictive models require sophisticated domain knowledge, data skills and computational power that may not be available in practice. In this work, we used an academic year’s worth of data collected from all courses at a Portuguese information management school to perform two main experiments on two binary classification problems: the first being students at risk vs students not at risk and the second being high-performing students vs not high-performing students. In the first experiment, we compared the performances obtained with traditional machine learning classifiers against majority class classifiers at multiple stages of course completion (more specifically, the 10%, 25%, 33%, 50% and 100% course completion thresholds). For both classification problems, performances on all metrics peaked when using all of the data collected throughout the course – 88.6% accuracy and 92.3% Area Under the Receiver Operating Characteristic (AUROC) using Random Forest (RF) for students at risk and 78.2% accuracy and 79.6% AUROC using ExtraTrees for high-performing students. Concerning early prediction, acceptable performances for classifying at-risk students are achieved as early as the 25% course duration threshold (72.8% AUROC using RF). Performances for high-performing students were generally lower, with AUROC at earlier stages peaking at the courses’ midway point (64.4% AUROC using RF). Our second experiment deployed long-short term memory units (LSTM) trained with a time-dependent representation of a single feature (number of total clicks). While this approach achieved inferior performances, we argue that the more straightforward data pre-processing of this approach may represent a worthwhile tradeoff against relatively small losses in model performance, especially at earlier moments of prediction. We found the best tradeoff at 33% course duration – 64% AUROC against 74% AUROC using RF to predict at-risk students. To predict high-performing students, we found the best tradeoff to occur at 25% course duration (56% AUROC against 61% using RF). Results obtained using a different set of logs validate the portability of our approach when it comes to static aggregate models. However, our deep learning approach did not generalize well on this data, which suggests that portability between courses using this approach may only be possible in specific instances

    APIC: A method for automated pattern identification and classification

    Get PDF
    Machine Learning (ML) is a transformative technology at the forefront of many modern research endeavours. The technology is generating a tremendous amount of attention from researchers and practitioners, providing new approaches to solving complex classification and regression tasks. While concepts such as Deep Learning have existed for many years, the computational power for realising the utility of these algorithms in real-world applications has only recently become available. This dissertation investigated the efficacy of a novel, general method for deploying ML in a variety of complex tasks, where best feature selection, data-set labelling, model definition and training processes were determined automatically. Models were developed in an iterative fashion, evaluated using both training and validation data sets. The proposed method was evaluated using three distinct case studies, describing complex classification tasks often requiring significant input from human experts. The results achieved demonstrate that the proposed method compares with, and often outperforms, less general, comparable methods designed specifically for each task. Feature selection, data-set annotation, model design and training processes were optimised by the method, where less complex, comparatively accurate classifiers with lower dependency on computational power and human expert intervention were produced. In chapter 4, the proposed method demonstrated improved efficacy over comparable systems, automatically identifying and classifying complex application protocols traversing IP networks. In chapter 5, the proposed method was able to discriminate between normal and anomalous traffic, maintaining accuracy in excess of 99%, while reducing false alarms to a mere 0.08%. Finally, in chapter 6, the proposed method discovered more optimal classifiers than those implemented by comparable methods, with classification scores rivalling those achieved by state-of-the-art systems. The findings of this research concluded that developing a fully automated, general method, exhibiting efficacy in a wide variety of complex classification tasks with minimal expert intervention, was possible. The method and various artefacts produced in each case study of this dissertation are thus significant contributions to the field of ML

    Optimisation Method for Training Deep Neural Networks in Classification of Non- functional Requirements

    Get PDF
    Non-functional requirements (NFRs) are regarded critical to a software system's success. The majority of NFR detection and classification solutions have relied on supervised machine learning models. It is hindered by the lack of labelled data for training and necessitate a significant amount of time spent on feature engineering. In this work we explore emerging deep learning techniques to reduce the burden of feature engineering. The goal of this study is to develop an autonomous system that can classify NFRs into multiple classes based on a labelled corpus. In the first section of the thesis, we standardise the NFRs ontology and annotations to produce a corpus based on five attributes: usability, reliability, efficiency, maintainability, and portability. In the second section, the design and implementation of four neural networks, including the artificial neural network, convolutional neural network, long short-term memory, and gated recurrent unit are examined to classify NFRs. These models, necessitate a large corpus. To overcome this limitation, we proposed a new paradigm for data augmentation. This method uses a sort and concatenates strategy to combine two phrases from the same class, resulting in a two-fold increase in data size while keeping the domain vocabulary intact. We compared our method to a baseline (no augmentation) and an existing approach Easy data augmentation (EDA) with pre-trained word embeddings. All training has been performed under two modifications to the data; augmentation on the entire data before train/validation split vs augmentation on train set only. Our findings show that as compared to EDA and baseline, NFRs classification model improved greatly, and CNN outperformed when trained using our suggested technique in the first setting. However, we saw a slight boost in the second experimental setup with just train set augmentation. As a result, we can determine that augmentation of the validation is required in order to achieve acceptable results with our proposed approach. We hope that our ideas will inspire new data augmentation techniques, whether they are generic or task specific. Furthermore, it would also be useful to implement this strategy in other languages

    "May I borrow Your Filter?" Exchanging Filters to Combat Spam in a Community

    Get PDF
    Leveraging social networks in computer systems can be effective in dealing with a number of trust and security issues. Spam is one such issue where the "wisdom of crowds" can be harnessed by mining the collective knowledge of ordinary individuals. In this paper, we present a mechanism through which members of a virtual community can exchange information to combat spam. Previous attempts at collaborative spam filtering have concentrated on digest-based indexing techniques to share digests or fingerprints of emails that are known to be spam. We take a different approach and allow users to share their spam filters instead, thus dramatically reducing the amount of traffic generated in the network. The resultant diversity in the filters and cooperation in a community allows it to respond to spam in an autonomic fashion. As a test case for exchanging filters we use the popular SpamAssassin spam filtering software and show that exchanging spam filters provides an alternative method to improve spam filtering performance

    Affective games:a multimodal classification system

    Get PDF
    Affective gaming is a relatively new field of research that exploits human emotions to influence gameplay for an enhanced player experience. Changes in player’s psychology reflect on their behaviour and physiology, hence recognition of such variation is a core element in affective games. Complementary sources of affect offer more reliable recognition, especially in contexts where one modality is partial or unavailable. As a multimodal recognition system, affect-aware games are subject to the practical difficulties met by traditional trained classifiers. In addition, inherited game-related challenges in terms of data collection and performance arise while attempting to sustain an acceptable level of immersion. Most existing scenarios employ sensors that offer limited freedom of movement resulting in less realistic experiences. Recent advances now offer technology that allows players to communicate more freely and naturally with the game, and furthermore, control it without the use of input devices. However, the affective game industry is still in its infancy and definitely needs to catch up with the current life-like level of adaptation provided by graphics and animation

    Advancing Chronic Respiratory Disease Care with Real-Time Vital Sign Prediction

    Get PDF
    Cardiovascular and chronic respiratory diseases, being pervasive in nature, pose formidable challenges to the overall well-being of the global populace. With an alarming annual mortality rate of approximately 19 million individuals across the globe, these diseases have emerged as significant public health concerns warranting immediate attention and comprehensive understanding. The mitigation of this elevated mortality rate can be achieved through the application of cutting-edge technological innovations within the realm of medical science, which possess the capacity to enable the perpetual surveillance of various physiological indicators, including but not limited to blood pressure, cholesterol levels, and blood glucose concentrations. The forward-thinking implications of these pivotal physiological or vital sign parameters not only facilitate prompt intervention from medical professionals and carers, but also empower patients to effectively navigate their health status through the receipt of pertinent periodic notifications and guidance from healthcare practitioners. In this research endeavour, we present a novel framework that leverages the power of machine learning algorithms to forecast and categorise forthcoming values of pertinent physiological indicators in the context of cardiovascular and chronic respiratory ailments. Drawing upon prognostications of prospective values, the envisaged framework possesses the capacity to effectively categorise the health condition of individuals, thereby alerting both caretakers and medical professionals. In the present study, a machine-learning-driven prediction and classification framework has been employed, wherein a genuine dataset comprising vital signs has been utilised. In order to anticipate the forthcoming 1-3 minutes of vital sign values, a series of regression techniques, namely linear regression and polynomial regression of degrees 2, 3, and 4, have been subjected to rigorous examination and evaluation. In the realm of caregiving, a concise 60-second prognostication is employed to enable the expeditious provision of emergency medical aid. Additionally, a more comprehensive 3-minute prognostication of vital signs is utilised for the same purpose. The patient's overall health is evaluated based on the anticipated vital signs values through the utilisation of three machine learning classifiers, namely Support Vector Machine (SVM), Decision Tree and Random Forest. The findings of our study indicate that the implementation of a Decision Tree algorithm exhibits a high level of accuracy in accurately categorising a patient's health status by leveraging anomalous values of vital signs. This approach demonstrates its potential in facilitating prompt and effective medical interventions, thereby enhancing the overall quality of care provided to patients

    Tweak: Towards Portable Deep Learning Models for Domain-Agnostic LoRa Device Authentication

    Full text link
    Deep learning based device fingerprinting has emerged as a key method of identifying and authenticating devices solely via their captured RF transmissions. Conventional approaches are not portable to different domains in that if a model is trained on data from one domain, it will not perform well on data from a different but related domain. Examples of such domains include the receiver hardware used for collecting the data, the day/time on which data was captured, and the protocol configuration of devices. This work proposes Tweak, a technique that, using metric learning and a calibration process, enables a model trained with data from one domain to perform well on data from another domain. This process is accomplished with only a small amount of training data from the target domain and without changing the weights of the model, which makes the technique computationally lightweight and thus suitable for resource-limited IoT networks. This work evaluates the effectiveness of Tweak vis-a-vis its ability to identify IoT devices using a testbed of real LoRa-enabled devices under various scenarios. The results of this evaluation show that Tweak is viable and especially useful for networks with limited computational resources and applications with time-sensitive missions
    • …
    corecore