602 research outputs found
Comparative transcriptome analysis and simple sequence repeat marker development for two closely related Isodon species used as ‘Xihuangcao’ herbs
Purpose: To facilitate the molecular identification of original plants, resolve taxonomic problems and identify standards for ‘Xihuangcao’-based products on the market.Methods: A transcriptomic analysis of two closely related species, i.e., Isodon serra (Maxim.) (IS) and I. lophanthoides (Buch.-Ham. ex D. Don) Hara, was conducted by using the Illumina HiSeq 2500 platform, and expressed sequence tag-derived simple sequence repeat (EST-SSR) markers were developed based on these transcriptomes.Results: In total, 149,650 and 103,221 contigs were obtained, with N50 values of 1,400 and 1,516, from the IS and I. lophanthoides RNA-Seq datasets, respectively. These contigs were clustered into 107,777 and 68,220 unigenes, which were functionally annotated to identify the genes involved in therapeutic components. In total, 14,138 and 11,756 EST-SSR motifs were identified, and of these motifs, 7,453 and 6,428 were used to design primers for IS and I. lophanthoides, respectively. After PCR verification and fluorescence-based genotyping, 24 SSR markers with bright bands, high polymorphism, and single amplification were obtained and used to identify closely related Isodon species/varieties.Conclusion: These data could help herbal scientists identify high-quality herbal plants and provide a reference for genetic improvement and population genetic and phylogenetic studies investigating ‘Xihuangcao’ herbs.Keywords: Xihuangcao, Transcriptome, EST-SSR, Molecular marker
Provably Secure Disambiguating Neural Linguistic Steganography
Recent research in provably secure neural linguistic steganography has
overlooked a crucial aspect: the sender must detokenize stegotexts to avoid
raising suspicion from the eavesdropper. The segmentation ambiguity problem,
which arises when using language models based on subwords, leads to occasional
decoding failures in all neural language steganography implementations based on
these models. Current solutions to this issue involve altering the probability
distribution of candidate words, rendering them incompatible with provably
secure steganography. We propose a novel secure disambiguation method named
SyncPool, which effectively addresses the segmentation ambiguity problem. We
group all tokens with prefix relationships in the candidate pool before the
steganographic embedding algorithm runs to eliminate uncertainty among
ambiguous tokens. To enable the receiver to synchronize the sampling process of
the sender, a shared cryptographically-secure pseudorandom number generator
(CSPRNG) is deployed to select a token from the ambiguity pool. SyncPool does
not change the size of the candidate pool or the distribution of tokens and
thus is applicable to provably secure language steganography methods. We
provide theoretical proofs and experimentally demonstrate the applicability of
our solution to various languages and models, showing its potential to
significantly improve the reliability and security of neural linguistic
steganography systems
HiCu: Leveraging Hierarchy for Curriculum Learning in Automated ICD Coding
There are several opportunities for automation in healthcare that can improve
clinician throughput. One such example is assistive tools to document diagnosis
codes when clinicians write notes. We study the automation of medical code
prediction using curriculum learning, which is a training strategy for machine
learning models that gradually increases the hardness of the learning tasks
from easy to difficult. One of the challenges in curriculum learning is the
design of curricula -- i.e., in the sequential design of tasks that gradually
increase in difficulty. We propose Hierarchical Curriculum Learning (HiCu), an
algorithm that uses graph structure in the space of outputs to design curricula
for multi-label classification. We create curricula for multi-label
classification models that predict ICD diagnosis and procedure codes from
natural language descriptions of patients. By leveraging the hierarchy of ICD
codes, which groups diagnosis codes based on various organ systems in the human
body, we find that our proposed curricula improve the generalization of neural
network-based predictive models across recurrent, convolutional, and
transformer-based architectures. Our code is available at
https://github.com/wren93/HiCu-ICD.Comment: To appear at Machine Learning for Healthcare Conference (MLHC2022
Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models
Ethical concerns surrounding copyright protection and inappropriate content
generation pose challenges for the practical implementation of diffusion
models. One effective solution involves watermarking the generated images.
However, existing methods often compromise the model performance or require
additional training, which is undesirable for operators and users. To address
this issue, we propose Gaussian Shading, a diffusion model watermarking
technique that is both performance-lossless and training-free, while serving
the dual purpose of copyright protection and tracing of offending content. Our
watermark embedding is free of model parameter modifications and thus is
plug-and-play. We map the watermark to latent representations following a
standard Gaussian distribution, which is indistinguishable from latent
representations obtained from the non-watermarked diffusion model. Therefore we
can achieve watermark embedding with lossless performance, for which we also
provide theoretical proof. Furthermore, since the watermark is intricately
linked with image semantics, it exhibits resilience to lossy processing and
erasure attempts. The watermark can be extracted by Denoising Diffusion
Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian
Shading on multiple versions of Stable Diffusion, and the results demonstrate
that Gaussian Shading not only is performance-lossless but also outperforms
existing methods in terms of robustness.Comment: 17 pages, 11 figures, accepted by CVPR 202
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect
term, aspect category, opinion term, sentiment polarity) for a given review,
which is the most representative and challenging task in aspect-based sentiment
analysis. A key challenge in the ASQP task is the scarcity of labeled data,
which limits the performance of existing methods. To tackle this issue, we
propose a self-training framework with a pseudo-label scorer, wherein a scorer
assesses the match between reviews and their pseudo-labels, aiming to filter
out mismatches and thereby enhance the effectiveness of self-training. We
highlight two critical aspects to ensure the scorer's effectiveness and
reliability: the quality of the training dataset and its model architecture. To
this end, we create a human-annotated comparison dataset and train a generative
model on it using ranking-based objectives. Extensive experiments on public
ASQP datasets reveal that using our scorer can greatly and consistently improve
the effectiveness of self-training. Moreover, we explore the possibility of
replacing humans with large language models for comparison dataset annotation,
and experiments demonstrate its feasibility. We release our code and data at
https://github.com/HITSZ-HLT/ST-w-Scorer-ABSA .Comment: Accepted to ACL 2024 Main Conferenc
Infusion: Preventing Customized Text-to-Image Diffusion from Overfitting
Text-to-image (T2I) customization aims to create images that embody specific
visual concepts delineated in textual descriptions. However, existing works
still face a main challenge, concept overfitting. To tackle this challenge, we
first analyze overfitting, categorizing it into concept-agnostic overfitting,
which undermines non-customized concept knowledge, and concept-specific
overfitting, which is confined to customize on limited modalities, i.e,
backgrounds, layouts, styles. To evaluate the overfitting degree, we further
introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to
measure the distribution changes of non-customized and customized concept
respectively. Drawing from the analysis, we propose Infusion, a T2I
customization method that enables the learning of target concepts to avoid
being constrained by limited training modalities, while preserving
non-customized knowledge. Remarkably, Infusion achieves this feat with
remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive
experiments also demonstrate that our approach outperforms state-of-the-art
methods in both single and multi-concept customized generation.Comment: 10 page
PUMGPT: A Large Vision-Language Model for Product Understanding
E-commerce platforms benefit from accurate product understanding to enhance
user experience and operational efficiency. Traditional methods often focus on
isolated tasks such as attribute extraction or categorization, posing
adaptability issues to evolving tasks and leading to usability challenges with
noisy data from the internet. Current Large Vision Language Models (LVLMs) lack
domain-specific fine-tuning, thus falling short in precision and instruction
following. To address these issues, we introduce PumGPT, the first e-commerce
specialized LVLM designed for multi-modal product understanding tasks. We
collected and curated a dataset of over one million products from AliExpress,
filtering out non-inferable attributes using a universal hallucination
detection framework, resulting in 663k high-quality data samples. PumGPT
focuses on five essential tasks aimed at enhancing workflows for e-commerce
platforms and retailers. We also introduce PumBench, a benchmark to evaluate
product understanding across LVLMs. Our experiments show that PumGPT
outperforms five other open-source LVLMs and GPT-4V in product understanding
tasks. We also conduct extensive analytical experiments to delve deeply into
the superiority of PumGPT, demonstrating the necessity for a specialized model
in the e-commerce domain
- …
