592 research outputs found
BiHRNet: A Binary high-resolution network for Human Pose Estimation
Human Pose Estimation (HPE) plays a crucial role in computer vision
applications. However, it is difficult to deploy state-of-the-art models on
resouce-limited devices due to the high computational costs of the networks. In
this work, a binary human pose estimator named BiHRNet(Binary HRNet) is
proposed, whose weights and activations are expressed as 1. BiHRNet
retains the keypoint extraction ability of HRNet, while using fewer computing
resources by adapting binary neural network (BNN). In order to reduce the
accuracy drop caused by network binarization, two categories of techniques are
proposed in this work. For optimizing the training process for binary pose
estimator, we propose a new loss function combining KL divergence loss with
AWing loss, which makes the binary network obtain more comprehensive output
distribution from its real-valued counterpart to reduce information loss caused
by binarization. For designing more binarization-friendly structures, we
propose a new information reconstruction bottleneck called IR Bottleneck to
retain more information in the initial stage of the network. In addition, we
also propose a multi-scale basic block called MS-Block for information
retention. Our work has less computation cost with few precision drop.
Experimental results demonstrate that BiHRNet achieves a PCKh of 87.9 on the
MPII dataset, which outperforms all binary pose estimation networks. On the
challenging of COCO dataset, the proposed method enables the binary neural
network to achieve 70.8 mAP, which is better than most tested lightweight
full-precision networks.Comment: 12 pages, 6 figure
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in both
inference speed and synthesis quality when reconstructing an audible waveform
from an acoustic representation. This study focuses on improving the
discriminator for GAN-based vocoders. Most existing Time-Frequency
Representation (TFR)-based discriminators are rooted in Short-Time Fourier
Transform (STFT), which owns a constant Time-Frequency (TF) resolution,
linearly scaled center frequencies, and a fixed decomposition basis, making it
incompatible with signals like singing voices that require dynamic attention
for different frequency bands and different time intervals. Motivated by that,
we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT)
discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet
Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF
resolution for different frequency bands. In contrast, CQT has a better
modeling ability in pitch information, and CWT has a better modeling ability in
short-time transients. Experiments conducted on both speech and singing voices
confirm the effectiveness of our proposed discriminators. Moreover, the STFT,
CQT, and CWT-based discriminators can be used jointly for better performance.
The proposed discriminators can boost the synthesis quality of various
state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.Comment: arXiv admin note: text overlap with arXiv:2311.1495
SingVisio: Visual Analytics of Diffusion Model for Singing Voice Conversion
In this study, we present SingVisio, an interactive visual analysis system
that aims to explain the diffusion model used in singing voice conversion.
SingVisio provides a visual display of the generation process in diffusion
models, showcasing the step-by-step denoising of the noisy spectrum and its
transformation into a clean spectrum that captures the desired singer's timbre.
The system also facilitates side-by-side comparisons of different conditions,
such as source content, melody, and target timbre, highlighting the impact of
these conditions on the diffusion generation process and resulting conversions.
Through comprehensive evaluations, SingVisio demonstrates its effectiveness in
terms of system design, functionality, explainability, and user-friendliness.
It offers users of various backgrounds valuable learning experiences and
insights into the diffusion model for singing voice conversion
Mining Dual Emotion for Fake News Detection
Emotion plays an important role in detecting fake news online. When
leveraging emotional signals, the existing methods focus on exploiting the
emotions of news contents that conveyed by the publishers (i.e., publisher
emotion). However, fake news often evokes high-arousal or activating emotions
of people, so the emotions of news comments aroused in the crowd (i.e., social
emotion) should not be ignored. Furthermore, it remains to be explored whether
there exists a relationship between publisher emotion and social emotion (i.e.,
dual emotion), and how the dual emotion appears in fake news. In this paper, we
verify that dual emotion is distinctive between fake and real news and propose
Dual Emotion Features to represent dual emotion and the relationship between
them for fake news detection. Further, we exhibit that our proposed features
can be easily plugged into existing fake news detectors as an enhancement.
Extensive experiments on three real-world datasets (one in English and the
others in Chinese) show that our proposed feature set: 1) outperforms the
state-of-the-art task-related emotional features; 2) can be well compatible
with existing fake news detectors and effectively improve the performance of
detecting fake news.Comment: Accepted by WWW 202
Zoom Out and Observe: News Environment Perception for Fake News Detection
Fake news detection is crucial for preventing the dissemination of
misinformation on social media. To differentiate fake news from real ones,
existing methods observe the language patterns of the news post and "zoom in"
to verify its content with knowledge sources or check its readers' replies.
However, these methods neglect the information in the external news environment
where a fake news post is created and disseminated. The news environment
represents recent mainstream media opinion and public attention, which is an
important inspiration of fake news fabrication because fake news is often
designed to ride the wave of popular events and catch public attention with
unexpected novel content for greater exposure and spread. To capture the
environmental signals of news posts, we "zoom out" to observe the news
environment and propose the News Environment Perception Framework (NEP). For
each post, we construct its macro and micro news environment from recent
mainstream news. Then we design a popularity-oriented and a novelty-oriented
module to perceive useful signals and further assist final prediction.
Experiments on our newly built datasets show that the NEP can efficiently
improve the performance of basic fake news detectors.Comment: ACL 2022 Main Conference (Long Paper
- …
