11 research outputs found

    Certified Data Removal from Machine Learning Models

    Full text link
    Good data stewardship requires removal of data at the request of the data's owner. This raises the question if and how a trained machine-learning model, which implicitly stores information about its training data, should be affected by such a removal request. Is it possible to "remove" data from a machine-learning model? We study this problem by defining certified removal: a very strong theoretical guarantee that a model from which data is removed cannot be distinguished from a model that never observed the data to begin with. We develop a certified-removal mechanism for linear classifiers and empirically study learning settings in which this mechanism is practical.Comment: Accepted to ICML 202

    Adaptive scheduling for adaptive sampling in pos taggers construction

    Get PDF
    We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly rising spacing between the instances, ours analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time. The algorithm proves to be formally correct regarding our working hypotheses. Namely, given a case, the following one is the nearest ensuring a net gain of learning ability from the former, it being possible to modulate the level of requirement for this condition. We also improve the robustness of sampling by paying greater attention to those regions of the training data base subject to a temporary inflation in performance, thus preventing the learning from stopping prematurely. The proposal has been evaluated on the basis of its reliability to identify the convergence of models, corroborating our expectations. While a concrete halting condition is used for testing, users can choose any condition whatsoever to suit their own specific needs.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RXunta de Galicia | Ref. ED431C 2018/50Xunta de Galicia | Ref. ED431D 2017/1

    Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study

    Get PDF
    Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but in a way that does not require excessive computational effort (e.g., a full retraining) for a small amount of deletions). Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of “the right to be forgotten” have given rise to requirements for certifiability (i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model). In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for logistic regression and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing this study, we extend some of the existing works and describe a common unlearning pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retraining of the ML model

    Previsão em tempo real da qualidade dos efluentes de uma ETAR

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaUma análise do desenvolvimento da sociedade, especialmente nas últimas décadas, permite verificar que é cada vez maior o número de informações geradas em todos os tipos de organizações. Esta quantidade de informação resulta da procura incessante pelo conhecimento. O surgimento das técnicas de Data Mining abriram novos horizontes nessa procura pelo conhecimento e permitem tornar uma organização mais competitiva e assim prosperar. As técnicas de Data Mining permitem inúmeras atividades, desde a obtenção desse conhecimento, intrínseco e dificilmente obtido apenas com a observação dos dados, como também na monitorização e previsão de diversas situações nos processos envolvidos nas organizações. No contexto das ETAR e no aperfeiçoamento do seu processo de tratamento, a utilização de técnicas de Data Mining revela-se uma atividade com bastante interesse, com diversos estudos encontrados. Atualmente uma das técnicas de Data Mining que mais tem chamado a atenção dos especialistas da área, são as técnicas de Support Vector Machines, pela sua generalização e pelos resultados obtidos. No contexto das ETAR são diariamente registados novos valores provenientes das diversas leituras realizadas por sensores de medição dos parâmetros físico-químicos, biológicos e microbiológicos das águas residuais. Estes sensores encontram-se situados ao longo das várias etapas do processo de tratamento. Um dos parâmetros analisados e alvo de previsão neste projeto baseia-se na Carência Bioquímica de Oxigénio, bastante importante para o processo de remoção de Sólidos Suspensos em ambientes de tratamento aeróbio e controlo do pH. Os constituintes dos efluentes que dão entrada diariamente nas ETAR possuem uma grande variabilidade em concentração e género. O surgimento diário de novos dados com uma grande variabilidade traz novas tendências e padrões que relacionam os diversos parâmetros das águas residuais. Uma desvantagem das técnicas de SVM é o tempo de aprendizagem dos modelos de previsão quando o conjunto de dados possui um volume extremamente grande, e nomeadamente, quando se torna necessário atualizar um modelo para a assimilação de novas caraterísticas dos dados. Para resolver esse problema, vários estudos têm-se focado numa atualização incremental dos modelos de previsão o que permite evitar o reprocessamento da aprendizagem de um novo modelo. Esta técnica permite reutilizar os conhecimentos adquiridos em modelos criados anteriormente. Neste projeto, procura-se demonstrar que os modelos de previsão criados podem trazer diversas melhorias para todo o funcionamento de uma ETAR, e principalmente no seu processo de tratamento, na sua monitorização e avaliação, importantes para a conservação do meio ambiente e da saúde pública. As ferramentas utilizadas para as várias tarefas de Data Mining foram o RapidMiner, o LIBLINEAR, e o TinySVM. Para tal, e seguindo a metodologia adotada, o CRISP-DM, a análise e a preparação dos dados foram fundamentais para a obtenção de resultados previsionais com alto índice de assertividade. Foram ainda utilizados métodos de avaliação para avaliar e comparar os modelos de previsão produzidos.An analysis of the development of society, especially in recent decades, shows that an increasing number of information is generated in all types of organizations. This amount of information is the result of the constant search for knowledge. The emergence of Data Mining techniques have opened new horizons in this quest for knowledge and the best method of making a more competitive organization and thrive. The Data Mining techniques allows numerous activities, from obtaining such knowledge, intrinsic and hardly obtained only with the observation of data, such as monitoring and forecasting various situations in the processes involved in organizations. In the context of the WWTP and of improvement of the treatment process, the use of Data Mining techniques proves to be an activity with great interest. Currently, one of Data Mining techniques that has most attracted the attention of specialists in the area are the Support Vector Machines techniques, by its capacity of generalization and by the obtained results in works done in the domain. In a typical environment of a WWTP are daily recorded new values from readings made by measuring sensors of the physical, chemical, biological, and microbiological parameters of the wastewater. These sensors are located throughout the various stages of the water treatment process. One of the analyzed parameters and target of prediction tasks of this project is based on the biochemical oxygen demand, fairly important to the process of removing suspended solids in the aerobic treatment and control of pH. The constituents of effluents that arrive daily in a WWTP have a large variability in concentration and gender. The daily emergence of new data with large variability brings new trends and patterns that relate the various parameters of wastewater. In this project, were sought to show that the forecast models created can bring many improvements to the overall operation of a wastewater treatment plant and especially for the treatment process and monitoring and evaluation important for the conservation of the environment and public health. The tools used for the various tasks of Data Mining that are performed were the RapidMiner, LIBLINEAR, and TinySVM. To that end, and following the methodology adopted, the CRISP-DM, the analysis and data preparation processes were essential for obtaining results of forecast with high assertiveness. In this project, were also used assessment methods to evaluate and compare the predictive produced models

    Incremental and Decremental Training for Linear Classification

    No full text
    在分類問題中,如果少量資料被新增或移除,增量和減量式技術可以快速更新模型。然而,增量和減量式演算法的設計需要很多方面的考量。在本論文中,主要探討線性分類問題,包括邏輯回歸與線性支持向量機。我們將暖啟動應用於線性分類,並研究各類狀況,如採用原始或對偶問題、選擇不同的優化方法,並分析實作上的難易。通過理論分析和實際試驗,我們的結論是暖啟動應用於原始問題的高階優化方法比其他方法更適於線性分類的增量和減量學習。In classification, if a small number of instances is added or removed, incremental and decremental techniques can be applied to quickly update the model. However, the design of incremental and decremental algorithms involves many considerations. In this thesis, we focus on linear classifiers including logistic regression and linear SVM because of their simplicity over kernel or other methods. By applying a warm start strategy, we investigate issues such as using primal or dual formulation, choosing optimization methods, and creating practical implementations. Through theoretical analysis and practical experiments, we conclude that a warm start setting on a high-order optimization method for primal formulations is more suitable than others for incremental and decremental learning of linear classification.中文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 II. SVM, LR, and their incremental and decremental training . . 4 2.1 Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6 III. Incremental and decremental learning with warm start . . . . 7 3.1 Initial Values for Incremental Learning . . . . . . . . . . . . . . 8 3.2 Initial Values for Decremental Learning . . . . . . . . . . . . . 10 IV. Optimization Methods and Incremental/Decremental Learning 13 4.1 Solving Primal Problem by a Trust Region Newton Method . . 14 4.2 Solving Primal Problem by a Coordinate Descent Method . . . 15 4.3 Solving Dual Problem by a Coordinate Descent Method . . . . 15 V. Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16 VI. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 6.1 Analysis on Initial Values . . . . . . . . . . . . . . . . . . . . . 19 6.2 Comparison of Optimization Methods for Incremental and Decremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 21 VII. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
    corecore