281 research outputs found
Context-Aware Generative Adversarial Privacy
Preserving the utility of published datasets while simultaneously providing
provable privacy guarantees is a well-known challenge. On the one hand,
context-free privacy solutions, such as differential privacy, provide strong
privacy guarantees, but often lead to a significant reduction in utility. On
the other hand, context-aware privacy solutions, such as information theoretic
privacy, achieve an improved privacy-utility tradeoff, but assume that the data
holder has access to dataset statistics. We circumvent these limitations by
introducing a novel context-aware privacy framework called generative
adversarial privacy (GAP). GAP leverages recent advancements in generative
adversarial networks (GANs) to allow the data holder to learn privatization
schemes from the dataset itself. Under GAP, learning the privacy mechanism is
formulated as a constrained minimax game between two players: a privatizer that
sanitizes the dataset in a way that limits the risk of inference attacks on the
individuals' private variables, and an adversary that tries to infer the
private variables from the sanitized dataset. To evaluate GAP's performance, we
investigate two simple (yet canonical) statistical dataset models: (a) the
binary data model, and (b) the binary Gaussian mixture model. For both models,
we derive game-theoretically optimal minimax privacy mechanisms, and show that
the privacy mechanisms learned from data (in a generative adversarial fashion)
match the theoretically optimal ones. This demonstrates that our framework can
be easily applied in practice, even in the absence of dataset statistics.Comment: Improved version of a paper accepted by Entropy Journal, Special
Issue on Information Theory in Machine Learning and Data Scienc
Federated Learning in Big Model Era: Domain-Specific Multimodal Large Models
Multimodal data, which can comprehensively perceive and recognize the
physical world, has become an essential path towards general artificial
intelligence. However, multimodal large models trained on public datasets often
underperform in specific industrial domains. This paper proposes a multimodal
federated learning framework that enables multiple enterprises to utilize
private domain data to collaboratively train large models for vertical domains,
achieving intelligent services across scenarios. The authors discuss in-depth
the strategic transformation of federated learning in terms of intelligence
foundation and objectives in the era of big model, as well as the new
challenges faced in heterogeneous data, model aggregation, performance and cost
trade-off, data privacy, and incentive mechanism. The paper elaborates a case
study of leading enterprises contributing multimodal data and expert knowledge
to city safety operation management , including distributed deployment and
efficient coordination of the federated learning platform, technical
innovations on data quality improvement based on large model capabilities and
efficient joint fine-tuning approaches. Preliminary experiments show that
enterprises can enhance and accumulate intelligent capabilities through
multimodal model federated learning, thereby jointly creating an smart city
model that provides high-quality intelligent services covering energy
infrastructure safety, residential community security, and urban operation
management. The established federated learning cooperation ecosystem is
expected to further aggregate industry, academia, and research resources,
realize large models in multiple vertical domains, and promote the large-scale
industrial application of artificial intelligence and cutting-edge research on
multimodal federated learning
Adversarial Purification for Data-Driven Power System Event Classifiers with Diffusion Models
The global deployment of the phasor measurement units (PMUs) enables
real-time monitoring of the power system, which has stimulated considerable
research into machine learning-based models for event detection and
classification. However, recent studies reveal that machine learning-based
methods are vulnerable to adversarial attacks, which can fool the event
classifiers by adding small perturbations to the raw PMU data. To mitigate the
threats posed by adversarial attacks, research on defense strategies is
urgently needed. This paper proposes an effective adversarial purification
method based on the diffusion model to counter adversarial attacks on the
machine learning-based power system event classifier. The proposed method
includes two steps: injecting noise into the PMU data; and utilizing a
pre-trained neural network to eliminate the added noise while simultaneously
removing perturbations introduced by the adversarial attacks. The proposed
adversarial purification method significantly increases the accuracy of the
event classifier under adversarial attacks while satisfying the requirements of
real-time operations. In addition, the theoretical analysis reveals that the
proposed diffusion model-based adversarial purification method decreases the
distance between the original and compromised PMU data, which reduces the
impacts of adversarial attacks. The empirical results on a large-scale
real-world PMU dataset validate the effectiveness and computational efficiency
of the proposed adversarial purification method
Privacy Preserving Data Publishing
Recent years have witnessed increasing interest among researchers in protecting individual privacy in the big data era, involving social media, genomics, and Internet of Things. Recent studies have revealed numerous privacy threats and privacy protection methodologies, that vary across a broad range of applications. To date, however, there exists no powerful methodologies in addressing challenges from: high-dimension data, high-correlation data and powerful attackers.
In this dissertation, two critical problems will be investigated: the prospects and some challenges for elucidating the attack capabilities of attackers in mining individuals’ private information; and methodologies that can be used to protect against such inference attacks, while guaranteeing significant data utility.
First, this dissertation has proposed a series of works regarding inference attacks laying emphasis on protecting against powerful adversaries with auxiliary information. In the context of genomic data, data dimensions and computation feasibility is highly challenging in conducting data analysis. This dissertation proved that the proposed attack can effectively infer the values of the unknown SNPs and traits in linear complexity, which dramatically improve the computation cost compared with traditional methods with exponential computation cost.
Second, putting differential privacy guarantee into high-dimension and high-correlation data remains a challenging problem, due to high-sensitivity, output scalability and signal-to-noise ratio. Consider there are tens-of-millions of genomes in a human DNA, it is infeasible for traditional methods to introduce noise to sanitize genomic data. This dissertation has proposed a series of works and demonstrated that the proposed differentially private method satisfies differential privacy; moreover, data utility is improved compared with the states of the arts by largely lowering data sensitivity.
Third, putting privacy guarantee into social data publishing remains a challenging problem, due to tradeoff requirements between data privacy and utility. This dissertation has proposed a series of works and demonstrated that the proposed methods can effectively realize privacy-utility tradeoff in data publishing.
Finally, two future research topics are proposed. The first topic is about Privacy Preserving Data Collection and Processing for Internet of Things. The second topic is to study Privacy Preserving Big Data Aggregation. They are motivated by the newly proposed data mining, artificial intelligence and cybersecurity methods
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
- …