5 research outputs found
Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect
Data Science and Machine Learning have become fundamental assets for
companies and research institutions alike. As one of its fields, supervised
classification allows for class prediction of new samples, learning from given
training data. However, some properties can cause datasets to be problematic to
classify.
In order to evaluate a dataset a priori, data complexity metrics have been
used extensively. They provide information regarding different intrinsic
characteristics of the data, which serve to evaluate classifier compatibility
and a course of action that improves performance. However, most complexity
metrics focus on just one characteristic of the data, which can be insufficient
to properly evaluate the dataset towards the classifiers' performance. In fact,
class overlap, a very detrimental feature for the classification process
(especially when imbalance among class labels is also present) is hard to
assess.
This research work focuses on revisiting complexity metrics based on data
morphology. In accordance to their nature, the premise is that they provide
both good estimates for class overlap, and great correlations with the
classification performance. For that purpose, a novel family of metrics have
been developed. Being based on ball coverage by classes, they are named after
Overlap Number of Balls. Finally, some prospects for the adaptation of the
former family of metrics to singular (more complex) problems are discussed.Comment: 23 pages, 9 figures, preprin
The role of classifiers and data complexity in learned Bloom filters: insights and recommendations
Bloom filters, since their introduction over 50 years ago, have become a pillar to handle membership queries in small space, with relevant application in Big Data Mining and Stream Processing. Further improvements have been recently proposed with the use of Machine Learning techniques: learned Bloom filters. Those latter make considerably more complicated the proper parameter setting of this multi-criteria data structure, in particular in regard to the choice of one of its key components (the classifier) and accounting for the classification complexity of the input dataset. Given this State of the Art, our contributions are as follows. (1) A novel methodology, supported by software, for designing, analyzing and implementing learned Bloom filters that account for their own multi-criteria nature, in particular concerning classifier type choice and data classification complexity. Extensive experiments show the validity of the proposed methodology and, being our software public, we offer a valid tool to the practitioners interested in using learned Bloom filters. (2) Further contributions to the advancement of the State of the Art that are of great practical relevance are the following: (a) the classifier inference time should not be taken as a proxy for the filter reject time; (b) of the many classifiers we have considered, only two offer good performance; this result is in agreement with and further strengthens early findings in the literature; (c) Sandwiched Bloom filter, which is already known as being one of the references of this area, is further shown here to have the remarkable property of robustness to data complexity and classifier performance variability
Recommended from our members
Improving and Securing Machine Learning Systems
Machine Learning (ML) models refer to systems that could automatically learn patterns from and make predictions on data, without explicit programming from humans. They play an integral role in a wide range of critical applications, from classification systems like facial and iris recognition, to voice interfaces for home assistants, to creating artistic images and guiding self-driving cars.As ML models are made up with complex numerical operations, they naturally appear to humans as non-transparent boxes. The fundamental architectural difference between ML models and human brains makes it extremely difficult to understand how ML models operate internally. What patterns ML models learn from data? How they produce prediction results? How well they would generalize to untested inputs? These questions have been the biggest challenge in computing today. Despite intense work and effort from the community in recent years, we still see very limited progress towards fully understanding ML models.The non-transparent nature of ML model has severe implications on some of its most important properties, i.e. performance and security. First, it’s hard to understand the impact of ML model design on end-to-end performance. Without understanding of how ML models operate internally, it would be difficult to isolate performance bottleneck of ML models and improve on top of it. Second, it’s hard to measure the robustness of Machine Learning models. The lack of transparency into the model suggests that the model might not generalize its performance to untested inputs, especially when inputs are adversarially crafted to trigger unexpected behavior. Third, it opens up possibilities of injecting unwanted malicious behaviors into ML models. The lack of tool to “translate” ML models suggests that humans cannot verify what ML model learned and whether they are benign and required to solve the task. This opens possibilities for an attacker to hide malicious behaviors inside ML models, which would trigger unexpected behaviors on certain inputs. These implications reduce the performance and security of ML, which greatly hinders its wide adoption, especially in security-sensitive areas.Even though, advancement in making ML models fully transparent would solve most of the implications, current status on achieving this ultimate goal remains unsatisfied, unfortunately. Recent progress along this direction does not suggest any significant breakthrough in the near future. In the meantime, issues and implications caused by non- transparency are imminent and threatening all currently deployed ML systems. With this conflict between imminent threats and unsatisfying progress towards full transparency, we need immediate solutions for some of the most important issues. By identifying and addressing these issues, we can ensure an effective and safe adoption of such opaque systems.In this dissertation, we cover our effort to improve ML models’ performance and security, by performing end-to-end measurements and designing auxiliary systems and solutions. More specifically, my dissertation consists of three components that target each of the three afore-mentioned implications.First, we focus on performance and seek to understand the impact of Machine Learning model design on end-to-end performance. To achieve this goal, we adopt the data- driven approach to measure ML model’s performance with different high-level design choices on a large number of real datasets . By comparing different design choices and their performance, we quantify the high-level design tradeoffs between complexity, performance, and performance variability. Apart from that, we can also understand which key components of ML models have the biggest impact on performance, and design generalized techniques to optimize these components.Second, we try to understand the robustness of ML models against adversarial inputs. Particularly, we focus on practical scenarios where normal users train ML models with the constraint of data, and study the most common practice in such scenario, referred as transfer learning. We explore new attacks that can efficiently exploit models trained using transfer learning, and propose defenses to patch insecure models.Third, we study defenses against potential attacks that embed hidden malicious behaviors into Machine Learning models. Such hidden behavior, referred as “backdoor”, would not affect model’s performance on normal inputs, but changes model’s behavior when a specific trigger is presented in input. In this work, we design a series of tools to detect and identify hidden backdoors in Deep Learning models. Then we propose defenses that could filter adversarial inputs and mitigate backdoors to be ineffective.In summary, we provide immediate solutions to improve the utility and the security of Machine Learning models. Even though complete transparency of ML remains an impossible mission today, and may still be in the near future, we hope our work could strengthen ML models as opaque systems, and ensure an effective and secure adoption