Search CORE

602 research outputs found

Comparative transcriptome analysis and simple sequence repeat marker development for two closely related Isodon species used as ‘Xihuangcao’ herbs

Author: Hu Weiming
Huang Shanshan
Mo Xiaolu
Wang Ying
Zeng Shaohua
Publication venue: 'African Journals Online (AJOL)'
Publication date: 13/02/2019
Field of study

Purpose: To facilitate the molecular identification of original plants, resolve taxonomic problems and identify standards for ‘Xihuangcao’-based products on the market.Methods: A transcriptomic analysis of two closely related species, i.e., Isodon serra (Maxim.) (IS) and I. lophanthoides (Buch.-Ham. ex D. Don) Hara, was conducted by using the Illumina HiSeq 2500 platform, and expressed sequence tag-derived simple sequence repeat (EST-SSR) markers were developed based on these transcriptomes.Results: In total, 149,650 and 103,221 contigs were obtained, with N50 values of 1,400 and 1,516, from the IS and I. lophanthoides RNA-Seq datasets, respectively. These contigs were clustered into 107,777 and 68,220 unigenes, which were functionally annotated to identify the genes involved in therapeutic components. In total, 14,138 and 11,756 EST-SSR motifs were identified, and of these motifs, 7,453 and 6,428 were used to design primers for IS and I. lophanthoides, respectively. After PCR verification and fluorescence-based genotyping, 24 SSR markers with bright bands, high polymorphism, and single amplification were obtained and used to identify closely related Isodon species/varieties.Conclusion: These data could help herbal scientists identify high-quality herbal plants and provide a reference for genetic improvement and population genetic and phylogenetic studies investigating ‘Xihuangcao’ herbs.Keywords: Xihuangcao, Transcriptome, EST-SSR, Molecular marker

AJOL - African Journals Online

Provably Secure Disambiguating Neural Linguistic Steganography

Author: Chen Kejiang
Qi Yuang
Yu Nenghai
Zeng Kai
Zhang Weiming
Publication venue
Publication date: 26/03/2024
Field of study

Recent research in provably secure neural linguistic steganography has overlooked a crucial aspect: the sender must detokenize stegotexts to avoid raising suspicion from the eavesdropper. The segmentation ambiguity problem, which arises when using language models based on subwords, leads to occasional decoding failures in all neural language steganography implementations based on these models. Current solutions to this issue involve altering the probability distribution of candidate words, rendering them incompatible with provably secure steganography. We propose a novel secure disambiguation method named SyncPool, which effectively addresses the segmentation ambiguity problem. We group all tokens with prefix relationships in the candidate pool before the steganographic embedding algorithm runs to eliminate uncertainty among ambiguous tokens. To enable the receiver to synchronize the sampling process of the sender, a shared cryptographically-secure pseudorandom number generator (CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does not change the size of the candidate pool or the distribution of tokens and thus is applicable to provably secure language steganography methods. We provide theoretical proofs and experimentally demonstrate the applicability of our solution to various languages and models, showing its potential to significantly improve the reliability and security of neural linguistic steganography systems

arXiv.org e-Print Archive

HiCu: Leveraging Hierarchy for Curriculum Learning in Automated ICD Coding

Author: Krishnan Rahul G.
Ren Weiming
Wu Tongzi
Zeng Ruijing
Zhu Tianshu
Publication venue
Publication date: 03/08/2022
Field of study

There are several opportunities for automation in healthcare that can improve clinician throughput. One such example is assistive tools to document diagnosis codes when clinicians write notes. We study the automation of medical code prediction using curriculum learning, which is a training strategy for machine learning models that gradually increases the hardness of the learning tasks from easy to difficult. One of the challenges in curriculum learning is the design of curricula -- i.e., in the sequential design of tasks that gradually increase in difficulty. We propose Hierarchical Curriculum Learning (HiCu), an algorithm that uses graph structure in the space of outputs to design curricula for multi-label classification. We create curricula for multi-label classification models that predict ICD diagnosis and procedure codes from natural language descriptions of patients. By leveraging the hierarchy of ICD codes, which groups diagnosis codes based on various organ systems in the human body, we find that our proposed curricula improve the generalization of neural network-based predictive models across recurrent, convolutional, and transformer-based architectures. Our code is available at https://github.com/wren93/HiCu-ICD.Comment: To appear at Machine Learning for Healthcare Conference (MLHC2022

arXiv.org e-Print Archive

Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models

Author: Chen Kejiang
Fang Han
Yang Zijin
Yu Nenghai
Zeng Kai
Zhang Weiming
Publication venue
Publication date: 06/05/2024
Field of study

Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. However, existing methods often compromise the model performance or require additional training, which is undesirable for operators and users. To address this issue, we propose Gaussian Shading, a diffusion model watermarking technique that is both performance-lossless and training-free, while serving the dual purpose of copyright protection and tracing of offending content. Our watermark embedding is free of model parameter modifications and thus is plug-and-play. We map the watermark to latent representations following a standard Gaussian distribution, which is indistinguishable from latent representations obtained from the non-watermarked diffusion model. Therefore we can achieve watermark embedding with lossless performance, for which we also provide theoretical proof. Furthermore, since the watermark is intricately linked with image semantics, it exhibits resilience to lossy processing and erasure attempts. The watermark can be extracted by Denoising Diffusion Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian Shading on multiple versions of Stable Diffusion, and the results demonstrate that Gaussian Shading not only is performance-lossless but also outperforms existing methods in terms of robustness.Comment: 17 pages, 11 figures, accepted by CVPR 202

arXiv.org e-Print Archive

Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction

Author: Chen Shiwei
Hu Weiming
Wang Ziyi
Xu Ruifeng
Zeng Jie
Zhang Yice
Publication venue
Publication date: 26/06/2024
Field of study

Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer's effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at https://github.com/HITSZ-HLT/ST-w-Scorer-ABSA .Comment: Accepted to ACL 2024 Main Conferenc

arXiv.org e-Print Archive

Advanced Information Technology Convergence

Author: Anthony T. S. Ho
Hui Cheng
Jucheng Yang
Sook Yoon
Weiming Zeng
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Crossref

Directory of Open Access Journals

Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting

Author: Chen Zhuo
Chu Pengzhi
Yan Yichao
Yang Xiaokang
Zeng Weili
Zhao Weiming
Zhu Qi
Publication venue
Publication date: 22/04/2024
Field of study

Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.Comment: 10 page

arXiv.org e-Print Archive

PUMGPT: A Large Vision-Language Model for Product Understanding

Author: Cui Baoliang
Guo Zongyi
Lu Weiming
Wang Xiufei
Wu Shuhui
Xing Zheng
Xue Wei
Zeng Xiaoyi
Publication venue
Publication date: 16/06/2024
Field of study

E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain

arXiv.org e-Print Archive