1,550 research outputs found
Confidential Boosting with Random Linear Classifiers for Outsourced User-generated Data
User-generated data is crucial to predictive modeling in many applications.
With a web/mobile/wearable interface, a data owner can continuously record data
generated by distributed users and build various predictive models from the
data to improve their operations, services, and revenue. Due to the large size
and evolving nature of users data, data owners may rely on public cloud service
providers (Cloud) for storage and computation scalability. Exposing sensitive
user-generated data and advanced analytic models to Cloud raises privacy
concerns. We present a confidential learning framework, SecureBoost, for data
owners that want to learn predictive models from aggregated user-generated data
but offload the storage and computational burden to Cloud without having to
worry about protecting the sensitive data. SecureBoost allows users to submit
encrypted or randomly masked data to designated Cloud directly. Our framework
utilizes random linear classifiers (RLCs) as the base classifiers in the
boosting framework to dramatically simplify the design of the proposed
confidential boosting protocols, yet still preserve the model quality. A
Cryptographic Service Provider (CSP) is used to assist the Cloud's processing,
reducing the complexity of the protocol constructions. We present two
constructions of SecureBoost: HE+GC and SecSh+GC, using combinations of
homomorphic encryption, garbled circuits, and random masking to achieve both
security and efficiency. For a boosted model, Cloud learns only the RLCs and
the CSP learns only the weights of the RLCs. Finally, the data owner collects
the two parts to get the complete model. We conduct extensive experiments to
understand the quality of the RLC-based boosting and the cost distribution of
the constructions. Our results show that SecureBoost can efficiently learn
high-quality boosting models from protected user-generated data
Feature selection using mutual information in network intrusion detection system
University of Technology Sydney. Faculty of Engineering and Information Technology.Network technologies have made significant progress in development, while the security issues alongside these technologies have not been well addressed. Current research on network security mainly focuses on developing preventative measures, such as security policies and secure communication protocols. Meanwhile, attempts have been made to protect computer systems and networks against malicious behaviours by deploying Intrusion Detection Systems (IDSs). The collaboration of IDSs and preventative measures can provide a safe and secure communication environment. Intrusion detection systems are now an essential complement to security project infrastructure of most organisations. However, current IDSs suffer from three significant issues that severely restrict their utility and performance. These issues are: a large number of false alarms, very high volume of network traffic and the classification problem when the class labels are not available.
In this thesis, these three issues are addressed and efficient intrusion detection systems are developed which are effective in detecting a wide variety of attacks and result in very few false alarms and low computational cost. The principal contribution is the efficient and effective use of mutual information, which offers a solid theoretical framework for quantifying the amount of information that two random variables share with each other. The goal of this thesis is to develop an IDS that is accurate in detecting attacks and fast enough to make real-time decisions.
First, a nonlinear correlation coefficient-based similarity measure to help extract both linear and nonlinear correlations between network traffic records is used. This measure is based on mutual information. The extracted information is used to develop an IDS to detect malicious network behaviours. However, the current network traffic data, which consist of a great number of traffic patterns, create a serious challenge to IDSs. Therefore, to address this issue, two feature selection methods are proposed; filter-based feature selection and hybrid feature selection algorithms, added to our current IDS for supervised classification. These methods are used to select a subset of features from the original feature set and use the selected subset to build our IDS and enhance the detection performance.
The filter-based feature selection algorithm, named Flexible Mutual Information Feature Selection (FMIFS), uses the theoretical analyses of mutual information as evaluation criteria to measure the relevance between the input features and the output classes. To eliminate the redundancy among selected features, FMIFS introduces a new criterion to estimate the redundancy of the current selected features with respect to the previously selected subset of features.
The hybrid feature selection algorithm is a combination of filter and wrapper algorithms. The filter method searches for the best subset of features using mutual information as a measure of relevance between the input features and the output class. The wrapper method is used to further refine the selected subset from the previous phase and select the optimal subset of features that can produce better accuracy.
In addition to the supervised feature selection methods, the research is extended to unsupervised feature selection methods, and an Extended Laplacian score EL and a Modified Laplacian score ML methods are proposed which can select features in unsupervised scenarios. More specifically, each of EL and ML consists of two main phases. In the first phase, the Laplacian score algorithm is applied to rank the features by evaluating the power of locality preservation for each feature in the initial data. In the second phase, a new redundancy penalization technique uses mutual information to remove the redundancy among the selected features. The final output of these algorithms is then used to build the detection model.
The proposed IDSs are then tested on three publicly available datasets, the KDD Cup 99, NSL-KDD and Kyoto dataset. Experimental results confirm the effectiveness and feasibility of these proposed solutions in terms of detection accuracy, false alarm rate, computational complexity and the capability of utilising unlabelled data. The unsupervised feature selection methods have been further tested on five more well-known datasets from the UCI Machine Learning Repository. These newly added datasets are frequently used in literature to evaluate the performance of feature selection methods. Furthermore, these datasets have different sample sizes and various numbers of features, so they are a lot more challenging for comprehensively testing feature selection algorithms. The experimental results show that ML performs better than EL and four other state-of-art methods (including the Variance score algorithm and the Laplacian score algorithm) in terms of the classification accuracy
NVIDIA FLARE: Federated Learning from Simulation to Real-World
Federated learning (FL) enables building robust and generalizable AI models
by leveraging diverse datasets from multiple collaborators without centralizing
the data. We created NVIDIA FLARE as an open-source software development kit
(SDK) to make it easier for data scientists to use FL in their research and
real-world applications. The SDK includes solutions for state-of-the-art FL
algorithms and federated machine learning approaches, which facilitate building
workflows for distributed learning across enterprises and enable platform
developers to create a secure, privacy-preserving offering for multiparty
collaboration utilizing homomorphic encryption or differential privacy. The SDK
is a lightweight, flexible, and scalable Python package. It allows researchers
to apply their data science workflows in any training libraries (PyTorch,
TensorFlow, XGBoost, or even NumPy) in real-world FL settings. This paper
introduces the key design principles of NVFlare and illustrates some use cases
(e.g., COVID analysis) with customizable FL workflows that implement different
privacy-preserving algorithms.
Code is available at https://github.com/NVIDIA/NVFlare.Comment: Accepted at the International Workshop on Federated Learning, NeurIPS
2022, New Orleans, USA (https://federated-learning.org/fl-neurips-2022);
Revised version v2: added Key Components list, system metrics for homomorphic
encryption experiment; Extended v3 for journal submissio
SoK: Training Machine Learning Models over Multiple Sources with Privacy Preservation
Nowadays, gathering high-quality training data from multiple data controllers
with privacy preservation is a key challenge to train high-quality machine
learning models. The potential solutions could dramatically break the barriers
among isolated data corpus, and consequently enlarge the range of data
available for processing. To this end, both academia researchers and industrial
vendors are recently strongly motivated to propose two main-stream folders of
solutions: 1) Secure Multi-party Learning (MPL for short); and 2) Federated
Learning (FL for short). These two solutions have their advantages and
limitations when we evaluate them from privacy preservation, ways of
communication, communication overhead, format of data, the accuracy of trained
models, and application scenarios.
Motivated to demonstrate the research progress and discuss the insights on
the future directions, we thoroughly investigate these protocols and frameworks
of both MPL and FL. At first, we define the problem of training machine
learning models over multiple data sources with privacy-preserving (TMMPP for
short). Then, we compare the recent studies of TMMPP from the aspects of the
technical routes, parties supported, data partitioning, threat model, and
supported machine learning models, to show the advantages and limitations.
Next, we introduce the state-of-the-art platforms which support online training
over multiple data sources. Finally, we discuss the potential directions to
resolve the problem of TMMPP.Comment: 17 pages, 4 figure
- …