Search CORE

72 research outputs found

Deep Serial Number: Computational Watermarking for DNN Intellectual Property Protection

Author: Du Mengnan
Hu Xia
Tang Ruixiang
Publication venue
Publication date: 17/11/2020
Field of study

In this paper, we introduce DSN (Deep Serial Number), a new watermarking approach that can prevent the stolen model from being deployed by unauthorized parties. Recently, watermarking in DNNs has emerged as a new research direction for owners to claim ownership of DNN models. However, the verification schemes of existing watermarking approaches are vulnerable to various watermark attacks. Different from existing work that embeds identification information into DNNs, we explore a new DNN Intellectual Property Protection mechanism that can prevent adversaries from deploying the stolen deep neural networks. Motivated by the success of serial number in protecting conventional software IP, we introduce the first attempt to embed a serial number into DNNs. Specifically, the proposed DSN is implemented in the knowledge distillation framework, where a private teacher DNN is first trained, then its knowledge is distilled and transferred to a series of customized student DNNs. During the distillation process, each customer DNN is augmented with a unique serial number, i.e., an encrypted 0/1 bit trigger pattern. Customer DNN works properly only when a potential customer enters the valid serial number. The embedded serial number could be used as a strong watermark for ownership verification. Experiments on various applications indicate that DSN is effective in terms of preventing unauthorized application while not sacrificing the original DNN performance. The experimental analysis further shows that DSN is resistant to different categories of attacks

arXiv.org e-Print Archive

The Science of Detecting LLM-Generated Texts

Author: Chuang Yu-Neng
Hu Xia
Tang Ruixiang
Publication venue
Publication date: 02/06/2023
Field of study

The emergence of large language models (LLMs) has resulted in the production of LLM-generated texts that is highly sophisticated and almost indistinguishable from texts written by humans. However, this has also sparked concerns about the potential misuse of such texts, such as spreading misinformation and causing disruptions in the education system. Although many detection approaches have been proposed, a comprehensive understanding of the achievements and challenges is still lacking. This survey aims to provide an overview of existing LLM-generated text detection techniques and enhance the control and regulation of language generation models. Furthermore, we emphasize crucial considerations for future research, including the development of comprehensive evaluation metrics and the threat posed by open-source LLMs, to drive progress in the area of LLM-generated text detection

arXiv.org e-Print Archive

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Author: Han Xiaotian
Hu Xia
Jiang Xiaoqian
Tang Ruixiang
Publication venue
Publication date: 10/04/2023
Field of study

Recent advancements in large language models (LLMs) have led to the development of highly potent models like OpenAI's ChatGPT. These models have exhibited exceptional performance in a variety of tasks, such as question answering, essay composition, and code generation. However, their effectiveness in the healthcare sector remains uncertain. In this study, we seek to investigate the potential of ChatGPT to aid in clinical text mining by examining its ability to extract structured information from unstructured healthcare texts, with a focus on biological named entity recognition and relation extraction. However, our preliminary results indicate that employing ChatGPT directly for these tasks resulted in poor performance and raised privacy concerns associated with uploading patients' information to the ChatGPT API. To overcome these limitations, we propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data with labels utilizing ChatGPT and fine-tuning a local model for the downstream task. Our method has resulted in significant improvements in the performance of downstream tasks, improving the F1-score from 23.37% to 63.99% for the named entity recognition task and from 75.86% to 83.59% for the relation extraction task. Furthermore, generating data using ChatGPT can significantly reduce the time and effort required for data collection and labeling, as well as mitigate data privacy concerns. In summary, the proposed framework presents a promising solution to enhance the applicability of LLM models to clinical text mining.Comment: 10 pages, 8 tables, 4 figure

arXiv.org e-Print Archive

LLM for Patient-Trial Matching: Privacy-Aware Data Augmentation Towards Better Performance and Generalizability

Author: Hu Xia
Jiang Xiaoqian
Tang Ruixiang
Yuan Jiayi
Publication venue
Publication date: 23/03/2023
Field of study

The process of matching patients with suitable clinical trials is essential for advancing medical research and providing optimal care. However, current approaches face challenges such as data standardization, ethical considerations, and a lack of interoperability between Electronic Health Records (EHRs) and clinical trial criteria. In this paper, we explore the potential of large language models (LLMs) to address these challenges by leveraging their advanced natural language generation capabilities to improve compatibility between EHRs and clinical trial descriptions. We propose an innovative privacy-aware data augmentation approach for LLM-based patient-trial matching (LLM-PTM), which balances the benefits of LLMs while ensuring the security and confidentiality of sensitive patient data. Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%. Additionally, we present case studies to further illustrate the effectiveness of our approach and provide a deeper understanding of its underlying principles

arXiv.org e-Print Archive

SPeC: A Soft Prompt-Based Calibration on Mitigating Performance Variability in Clinical Notes Summarization

Author: Chuang Yu-Neng
Hu Xia
Jiang Xiaoqian
Tang Ruixiang
Publication venue
Publication date: 27/03/2023
Field of study

Electronic health records (EHRs) store an extensive array of patient information, encompassing medical histories, diagnoses, treatments, and test outcomes. These records are crucial for enabling healthcare providers to make well-informed decisions regarding patient care. Summarizing clinical notes further assists healthcare professionals in pinpointing potential health risks and making better-informed decisions. This process contributes to reducing errors and enhancing patient outcomes by ensuring providers have access to the most pertinent and current patient data. Recent research has shown that incorporating prompts with large language models (LLMs) substantially boosts the efficacy of summarization tasks. However, we show that this approach also leads to increased output variance, resulting in notably divergent outputs even when prompts share similar meanings. To tackle this challenge, we introduce a model-agnostic Soft Prompt-Based Calibration (SPeC) pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization. Experimental findings on multiple clinical note tasks and LLMs indicate that our method not only bolsters performance but also effectively curbs variance for various LLMs, providing a more uniform and dependable solution for summarizing vital medical information

arXiv.org e-Print Archive

Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking

Author: Feng Qizhang
Hu Xia
Liu Ninghao
Tang Ruixiang
Yang Fan
Publication venue
Publication date: 10/04/2023
Field of study

The huge supporting training data on the Internet has been a key factor in the success of deep learning models. However, this abundance of public-available data also raises concerns about the unauthorized exploitation of datasets for commercial purposes, which is forbidden by dataset licenses. In this paper, we propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data. By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders. This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally. Unfortunately, existing backdoor insertion methods often entail adding arbitrary and mislabeled data to the training set, leading to a significant drop in performance and easy detection by anomaly detection algorithms. To overcome this challenge, we introduce a clean-label backdoor watermarking framework that uses imperceptible perturbations to replace mislabeled samples. As a result, the watermarking samples remain consistent with the original labels, making them difficult to detect. Our experiments on text, image, and audio datasets demonstrate that the proposed framework effectively safeguards datasets with minimal impact on original task performance. We also show that adding just 1% of watermarking samples can inject a traceable watermarking function and that our watermarking samples are stealthy and look benign upon visual inspection

arXiv.org e-Print Archive

TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval

Author: Hu Hu
Kang Zhanhui
Li Xirong
Lian Fengzong
Tian Kaibin
Xie Runquan
Zhao Ruixiang
Publication venue
Publication date: 02/08/2023
Field of study

For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method

arXiv.org e-Print Archive

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Author: Chen Rui
Hu Xia
Li Yiming
Liu Zirui
Tang Ruixiang
Yuan Jiayi
Publication venue
Publication date: 28/10/2023
Field of study

In the field of natural language processing, the prevalent approach involves fine-tuning pretrained language models (PLMs) using local samples. Recent research has exposed the susceptibility of PLMs to backdoor attacks, wherein the adversaries can embed malicious prediction behaviors by manipulating a few training samples. In this study, our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. To this end, we propose and integrate a honeypot module into the original PLM, specifically designed to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features while carrying minimal information about the original tasks. Consequently, we can impose penalties on the information acquired by the honeypot module to inhibit backdoor creation during the fine-tuning process of the stem network. Comprehensive experiments conducted on benchmark datasets substantiate the effectiveness and robustness of our defensive strategy. Notably, these results indicate a substantial reduction in the attack success rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods

arXiv.org e-Print Archive

Effect of arabinogalactan protein complex content on emulsification performance of gum arabic

Author: Fang Yapeng
Gao Zhiming
Han Lingyu
Hu Bing
Ma Ruixiang
Nishinari Katsuyoshi
Phillips Glyn O
Yang Jixin
Publication venue: 'Elsevier BV'
Publication date: 09/08/2019
Field of study

The emulsification properties of the standard (STD), matured (EM2 and EM10) and fractionated gum arabic samples via phase separation induced molecular fractionation were investigated to find out how the content of arabinogalactan protein (AGP) complex affects the resulting emulsion properties. Phase separation and the accompanying molecular fractionation were induced by mixing with different hydrocolloids including hyaluronan (HA), carboxymethyl cellulose (CMC), and maltodextrin (MD). Increase of AGP content from 11 to 28% resulted in the formation of emulsions with relatively smaller droplet sizes and better stability. Further increase in the AGP content to 41% resulted in the formation of emulsions with larger droplets. In spite of the larger droplets sizes, these emulsions were extremely stable. In addition, the emulsions prepared with GA higher AGP content better stability in the presence of ethanol. The results indicate that AGP content plays a vital role in emulsion stability and droplet size

Glyndŵr University Research Online