8 research outputs found
Deep Serial Number: Computational Watermarking for DNN Intellectual Property Protection
In this paper, we introduce DSN (Deep Serial Number), a new watermarking
approach that can prevent the stolen model from being deployed by unauthorized
parties. Recently, watermarking in DNNs has emerged as a new research direction
for owners to claim ownership of DNN models. However, the verification schemes
of existing watermarking approaches are vulnerable to various watermark
attacks. Different from existing work that embeds identification information
into DNNs, we explore a new DNN Intellectual Property Protection mechanism that
can prevent adversaries from deploying the stolen deep neural networks.
Motivated by the success of serial number in protecting conventional software
IP, we introduce the first attempt to embed a serial number into DNNs.
Specifically, the proposed DSN is implemented in the knowledge distillation
framework, where a private teacher DNN is first trained, then its knowledge is
distilled and transferred to a series of customized student DNNs. During the
distillation process, each customer DNN is augmented with a unique serial
number, i.e., an encrypted 0/1 bit trigger pattern. Customer DNN works properly
only when a potential customer enters the valid serial number. The embedded
serial number could be used as a strong watermark for ownership verification.
Experiments on various applications indicate that DSN is effective in terms of
preventing unauthorized application while not sacrificing the original DNN
performance. The experimental analysis further shows that DSN is resistant to
different categories of attacks
Human-Readable Fingerprint for Large Language Models
Protecting the copyright of large language models (LLMs) has become crucial
due to their resource-intensive training and accompanying carefully designed
licenses. However, identifying the original base model of an LLM is challenging
due to potential parameter alterations. In this study, we introduce a
human-readable fingerprint for LLMs that uniquely identifies the base model
without exposing model parameters or interfering with training. We first
observe that the vector direction of LLM parameters remains stable after the
model has converged during pretraining, showing negligible perturbations
through subsequent training steps, including continued pretraining, supervised
fine-tuning (SFT), and RLHF, which makes it a sufficient condition to identify
the base model. The necessity is validated by continuing to train an LLM with
an extra term to drive away the model parameters' direction and the model
becomes damaged. However, this direction is vulnerable to simple attacks like
dimension permutation or matrix rotation, which significantly change it without
affecting performance. To address this, leveraging the Transformer structure,
we systematically analyze potential attacks and define three invariant terms
that identify an LLM's base model. We make these invariant terms human-readable
by mapping them to a Gaussian vector using a convolutional encoder and then
converting it into a natural image with StyleGAN2. Our method generates a dog
image as an identity fingerprint for an LLM, where the dog's appearance
strongly indicates the LLM's base model. The fingerprint provides intuitive
information for qualitative discrimination, while the invariant terms can be
employed for quantitative and precise verification. Experimental results across
various LLMs demonstrate the effectiveness of our method
Data Protection in Big Data Analysis
"Big data" applications are collecting data from various aspects of our lives more and more every day. This fast transition has surpassed the development pace of data protection techniques and has resulted in innumerable data breaches and privacy violations. To prevent that, it is important to ensure the data is protected while at rest, in transit, in use, as well as during computation or dispersal. We investigate data protection issues in big data analysis in this thesis. We address a security or privacy concern in each phase of the data science pipeline. These phases are: i) data cleaning and preparation, ii) data management, iii) data modelling and analysis, and iv) data dissemination and visualization. In each of our contributions, we either address an existing problem and propose a resolving design (Chapters 2 and 4), or evaluate a current solution for a problem and analyze whether it meets the expected security/privacy goal (Chapters 3 and 5).
Starting with privacy in data preparation, we investigate providing privacy in query analysis leveraging differential privacy techniques. We consider contextual outlier analysis and identify challenging queries that require releasing direct information about members of the dataset. We define a new sampling mechanism that allows releasing this information in a differentially private manner. Our second contribution is in the data modelling and analysis phase. We investigate the effect of data properties and application requirements on the successful implementation of privacy techniques. We in particular investigate the effects of data correlation on data protection guarantees of differential privacy. Our third contribution in this thesis is in the data management phase. The problem is to efficiently protecting the data that is outsourced to a database management system (DBMS) provider while still allowing join operation. We provide an encryption method to minimize the leakage and to guarantee confidentiality for the data efficiently. Our last contribution is in the data dissemination phase. We inspect the ownership/contract protection for the prediction models trained on the data. We evaluate the backdoor-based watermarking in deep neural networks which is an important and recent line of the work in model ownership/contract protection