72 research outputs found
Deep Serial Number: Computational Watermarking for DNN Intellectual Property Protection
In this paper, we introduce DSN (Deep Serial Number), a new watermarking
approach that can prevent the stolen model from being deployed by unauthorized
parties. Recently, watermarking in DNNs has emerged as a new research direction
for owners to claim ownership of DNN models. However, the verification schemes
of existing watermarking approaches are vulnerable to various watermark
attacks. Different from existing work that embeds identification information
into DNNs, we explore a new DNN Intellectual Property Protection mechanism that
can prevent adversaries from deploying the stolen deep neural networks.
Motivated by the success of serial number in protecting conventional software
IP, we introduce the first attempt to embed a serial number into DNNs.
Specifically, the proposed DSN is implemented in the knowledge distillation
framework, where a private teacher DNN is first trained, then its knowledge is
distilled and transferred to a series of customized student DNNs. During the
distillation process, each customer DNN is augmented with a unique serial
number, i.e., an encrypted 0/1 bit trigger pattern. Customer DNN works properly
only when a potential customer enters the valid serial number. The embedded
serial number could be used as a strong watermark for ownership verification.
Experiments on various applications indicate that DSN is effective in terms of
preventing unauthorized application while not sacrificing the original DNN
performance. The experimental analysis further shows that DSN is resistant to
different categories of attacks
The Science of Detecting LLM-Generated Texts
The emergence of large language models (LLMs) has resulted in the production
of LLM-generated texts that is highly sophisticated and almost
indistinguishable from texts written by humans. However, this has also sparked
concerns about the potential misuse of such texts, such as spreading
misinformation and causing disruptions in the education system. Although many
detection approaches have been proposed, a comprehensive understanding of the
achievements and challenges is still lacking. This survey aims to provide an
overview of existing LLM-generated text detection techniques and enhance the
control and regulation of language generation models. Furthermore, we emphasize
crucial considerations for future research, including the development of
comprehensive evaluation metrics and the threat posed by open-source LLMs, to
drive progress in the area of LLM-generated text detection
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Recent advancements in large language models (LLMs) have led to the
development of highly potent models like OpenAI's ChatGPT. These models have
exhibited exceptional performance in a variety of tasks, such as question
answering, essay composition, and code generation. However, their effectiveness
in the healthcare sector remains uncertain. In this study, we seek to
investigate the potential of ChatGPT to aid in clinical text mining by
examining its ability to extract structured information from unstructured
healthcare texts, with a focus on biological named entity recognition and
relation extraction. However, our preliminary results indicate that employing
ChatGPT directly for these tasks resulted in poor performance and raised
privacy concerns associated with uploading patients' information to the ChatGPT
API. To overcome these limitations, we propose a new training paradigm that
involves generating a vast quantity of high-quality synthetic data with labels
utilizing ChatGPT and fine-tuning a local model for the downstream task. Our
method has resulted in significant improvements in the performance of
downstream tasks, improving the F1-score from 23.37% to 63.99% for the named
entity recognition task and from 75.86% to 83.59% for the relation extraction
task. Furthermore, generating data using ChatGPT can significantly reduce the
time and effort required for data collection and labeling, as well as mitigate
data privacy concerns. In summary, the proposed framework presents a promising
solution to enhance the applicability of LLM models to clinical text mining.Comment: 10 pages, 8 tables, 4 figure
LLM for Patient-Trial Matching: Privacy-Aware Data Augmentation Towards Better Performance and Generalizability
The process of matching patients with suitable clinical trials is essential
for advancing medical research and providing optimal care. However, current
approaches face challenges such as data standardization, ethical
considerations, and a lack of interoperability between Electronic Health
Records (EHRs) and clinical trial criteria. In this paper, we explore the
potential of large language models (LLMs) to address these challenges by
leveraging their advanced natural language generation capabilities to improve
compatibility between EHRs and clinical trial descriptions. We propose an
innovative privacy-aware data augmentation approach for LLM-based patient-trial
matching (LLM-PTM), which balances the benefits of LLMs while ensuring the
security and confidentiality of sensitive patient data. Our experiments
demonstrate a 7.32% average improvement in performance using the proposed
LLM-PTM method, and the generalizability to new data is improved by 12.12%.
Additionally, we present case studies to further illustrate the effectiveness
of our approach and provide a deeper understanding of its underlying
principles
SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization
Electronic health records (EHRs) store an extensive array of patient
information, encompassing medical histories, diagnoses, treatments, and test
outcomes. These records are crucial for enabling healthcare providers to make
well-informed decisions regarding patient care. Summarizing clinical notes
further assists healthcare professionals in pinpointing potential health risks
and making better-informed decisions. This process contributes to reducing
errors and enhancing patient outcomes by ensuring providers have access to the
most pertinent and current patient data. Recent research has shown that
incorporating prompts with large language models (LLMs) substantially boosts
the efficacy of summarization tasks. However, we show that this approach also
leads to increased output variance, resulting in notably divergent outputs even
when prompts share similar meanings. To tackle this challenge, we introduce a
model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft
prompts to diminish variance while preserving the advantages of prompt-based
summarization. Experimental findings on multiple clinical note tasks and LLMs
indicate that our method not only bolsters performance but also effectively
curbs variance for various LLMs, providing a more uniform and dependable
solution for summarizing vital medical information
Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking
The huge supporting training data on the Internet has been a key factor in
the success of deep learning models. However, this abundance of
public-available data also raises concerns about the unauthorized exploitation
of datasets for commercial purposes, which is forbidden by dataset licenses. In
this paper, we propose a backdoor-based watermarking approach that serves as a
general framework for safeguarding public-available data. By inserting a small
number of watermarking samples into the dataset, our approach enables the
learning model to implicitly learn a secret function set by defenders. This
hidden function can then be used as a watermark to track down third-party
models that use the dataset illegally. Unfortunately, existing backdoor
insertion methods often entail adding arbitrary and mislabeled data to the
training set, leading to a significant drop in performance and easy detection
by anomaly detection algorithms. To overcome this challenge, we introduce a
clean-label backdoor watermarking framework that uses imperceptible
perturbations to replace mislabeled samples. As a result, the watermarking
samples remain consistent with the original labels, making them difficult to
detect. Our experiments on text, image, and audio datasets demonstrate that the
proposed framework effectively safeguards datasets with minimal impact on
original task performance. We also show that adding just 1% of watermarking
samples can inject a traceable watermarking function and that our watermarking
samples are stealthy and look benign upon visual inspection
TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval
For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos
by ad-hoc textual queries, CLIP-based methods are dominating. Compared to
CLIP4Clip which is efficient and compact, the state-of-the-art models tend to
compute video-text similarity by fine-grained cross-modal feature interaction
and matching, putting their scalability for large-scale T2VR into doubt. For
efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a
CLIP4Clip based student network learn from more advanced yet computationally
heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's
learning capability, we add an Attentional frame-Feature Aggregation (AFA)
block, which by design adds no extra storage/computation overhead at the
retrieval stage. While attentive weights produced by AFA are commonly used for
combining frame-level features, we propose a novel use of the weights to let
them imitate frame-text relevance estimated by the teacher network. As such,
AFA provides a fine-grained learning (teaching) channel for the student
(teacher). Extensive experiments on multiple public datasets justify the
viability of the proposed method
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
In the field of natural language processing, the prevalent approach involves
fine-tuning pretrained language models (PLMs) using local samples. Recent
research has exposed the susceptibility of PLMs to backdoor attacks, wherein
the adversaries can embed malicious prediction behaviors by manipulating a few
training samples. In this study, our objective is to develop a
backdoor-resistant tuning procedure that yields a backdoor-free model, no
matter whether the fine-tuning dataset contains poisoned samples. To this end,
we propose and integrate a honeypot module into the original PLM, specifically
designed to absorb backdoor information exclusively. Our design is motivated by
the observation that lower-layer representations in PLMs carry sufficient
backdoor features while carrying minimal information about the original tasks.
Consequently, we can impose penalties on the information acquired by the
honeypot module to inhibit backdoor creation during the fine-tuning process of
the stem network. Comprehensive experiments conducted on benchmark datasets
substantiate the effectiveness and robustness of our defensive strategy.
Notably, these results indicate a substantial reduction in the attack success
rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods
Effect of arabinogalactan protein complex content on emulsification performance of gum arabic
The emulsification properties of the standard (STD), matured (EM2 and EM10) and fractionated gum arabic samples via phase separation induced molecular fractionation were investigated to find out how the content of arabinogalactan protein (AGP) complex affects the resulting emulsion properties. Phase separation and the accompanying molecular fractionation were induced by mixing with different hydrocolloids including hyaluronan (HA), carboxymethyl cellulose (CMC), and maltodextrin (MD). Increase of AGP content from 11 to 28% resulted in the formation of emulsions with relatively smaller droplet sizes and better stability. Further increase in the AGP content to 41% resulted in the formation of emulsions with larger droplets. In spite of the larger droplets sizes, these emulsions were extremely stable. In addition, the emulsions prepared with GA higher AGP content better stability in the presence of ethanol. The results indicate that AGP content plays a vital role in emulsion stability and droplet size
- …