152 research outputs found
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in both
inference speed and synthesis quality when reconstructing an audible waveform
from an acoustic representation. This study focuses on improving the
discriminator for GAN-based vocoders. Most existing Time-Frequency
Representation (TFR)-based discriminators are rooted in Short-Time Fourier
Transform (STFT), which owns a constant Time-Frequency (TF) resolution,
linearly scaled center frequencies, and a fixed decomposition basis, making it
incompatible with signals like singing voices that require dynamic attention
for different frequency bands and different time intervals. Motivated by that,
we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT)
discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet
Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF
resolution for different frequency bands. In contrast, CQT has a better
modeling ability in pitch information, and CWT has a better modeling ability in
short-time transients. Experiments conducted on both speech and singing voices
confirm the effectiveness of our proposed discriminators. Moreover, the STFT,
CQT, and CWT-based discriminators can be used jointly for better performance.
The proposed discriminators can boost the synthesis quality of various
state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.Comment: arXiv admin note: text overlap with arXiv:2311.1495
Motor Current Based Misalignment Diagnosis on Linear Axes with Short- Time Fourier Transform (STFT)
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds
We study Neural Foley, the automatic generation of high-quality sound effects
synchronizing with videos, enabling an immersive audio-visual experience.
Despite its wide range of applications, existing approaches encounter
limitations when it comes to simultaneously synthesizing high-quality and
video-aligned (i.e.,, semantic relevant and temporal synchronized) sounds. To
overcome these limitations, we propose FoleyCrafter, a novel framework that
leverages a pre-trained text-to-audio model to ensure high-quality audio
generation. FoleyCrafter comprises two key components: the semantic adapter for
semantic alignment and the temporal controller for precise audio-video
synchronization. The semantic adapter utilizes parallel cross-attention layers
to condition audio generation on video features, producing realistic sound
effects that are semantically relevant to the visual content. Meanwhile, the
temporal controller incorporates an onset detector and a timestampbased adapter
to achieve precise audio-video alignment. One notable advantage of FoleyCrafter
is its compatibility with text prompts, enabling the use of text descriptions
to achieve controllable and diverse video-to-audio generation according to user
intents. We conduct extensive quantitative and qualitative experiments on
standard benchmarks to verify the effectiveness of FoleyCrafter. Models and
codes are available at https://github.com/open-mmlab/FoleyCrafter.Comment: Project page: https://foleycrafter.github.io
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
A primary hurdle of autonomous driving in urban environments is understanding
complex and long-tail scenarios, such as challenging road conditions and
delicate human behaviors. We introduce DriveVLM, an autonomous driving system
leveraging Vision-Language Models (VLMs) for enhanced scene understanding and
planning capabilities. DriveVLM integrates a unique combination of reasoning
modules for scene description, scene analysis, and hierarchical planning.
Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy
computational requirements, we propose DriveVLM-Dual, a hybrid system that
synergizes the strengths of DriveVLM with the traditional autonomous driving
pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset
demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and
unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a
production vehicle, verifying it is effective in real-world autonomous driving
environments.Comment: Project Page: https://tsinghua-mars-lab.github.io/DriveVLM
Absence of metallicity and bias-dependent resistivity in low-carrier-density EuCd2As2
EuCd2As2 was theoretically predicted to be a minimal model of Weyl semimetals
with a single pair of Weyl points in the ferromagnet state. However, the
heavily p-doped EuCd2As2 crystals in previous experiments prevent direct
identification of the semimetal hypothesis. Here we present a comprehensive
magneto-transport study of high-quality EuCd2As2 crystals with ultralow bulk
carrier density (10^13 cm-3). In contrast to the general expectation of a Weyl
semimetal phase, EuCd2As2 shows insulating behavior in both antiferromagnetic
and ferromagnetic states as well as surface-dominated conduction from band
bending. Moreover, the application of a dc bias current can dramatically
modulate the resistance by over one order of magnitude, and induce a periodic
resistance oscillation due to the geometric resonance. Such nonlinear transport
results from the highly nonequilibrium state induced by electrical field near
the band edge. Our results suggest an insulating phase in EuCd2As2 and put a
strong constraint on the underlying mechanism of anomalous transport properties
in this system.Comment: 13 pages, 4 figure
Leveraging Diverse Semantic-based Audio Pretrained Models for Singing Voice Conversion
Singing Voice Conversion (SVC) is a technique that enables any singer to perform any song. To achieve this, it is essential to obtain speaker-agnostic representations from the source audio, which poses a significant challenge. A common solution involves utilizing a semantic-based audio pretrained model as a feature extractor. However, the degree to which the extracted features can meet the SVC requirements remains an open question. This includes their capability to accurately model melody and lyrics, the speaker-independency of their underlying acoustic information, and their robustness for in-the-wild acoustic environments. In this study, we investigate the knowledge within classical semantic-based pretrained models in much detail. We discover that the knowledge of different models is diverse and can be complementary for SVC. Based on the above, we design a Singing Voice Conversion framework based on Diverse Semantic-based Feature Fusion (DSFF-SVC). Experimental results demonstrate that DSFF-SVC can be generalized and improve various existing SVC models, particularly in challenging real-world conversion tasks. Our demo website is available at https://diversesemanticsvc.github.io/.Accepted by IEEE SLT 202
Current-induced domain wall motion in a van der Waals ferromagnet Fe3GeTe2
The manipulation of spin textures by spin currents is of fundamental and technological interest. A particularly interesting system is the 2D van der Waals ferromagnet Fe3GeTe2, in which Néel-type skyrmions have recently been observed. The origin of these chiral spin textures is of considerable interest. Recently, it was proposed that these derive from defects in the structure that lower the symmetry and allow for a bulk vector Dzyaloshinsky-Moriya interaction. Here, we demonstrate current-induced domain wall motion in Fe3GeTe2 flakes, in which the maximum domain wall velocity is an order of magnitude higher than those reported in previous studies. In heterostructures with Pt or W layers on top of the Fe3GeTe2 flakes, domain walls can be moved via a combination of spin transfer and spin-orbit torques. The competition between these torques leads to a change in the direction of domain wall motion with increasing magnitude of the injected current
From Beginner to Expert: Modeling Medical Knowledge into General LLMs
Recently, large language model (LLM) based artificial intelligence (AI)
systems have demonstrated remarkable capabilities in natural language
understanding and generation. However, these models face a significant
challenge when it comes to sensitive applications, such as reasoning over
medical knowledge and answering medical questions in a physician-like manner.
Prior studies attempted to overcome this challenge by increasing the model size
(>100B) to learn more general medical knowledge, while there is still room for
improvement in LLMs with smaller-scale model sizes (<100B). In this work, we
start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a
medical beginner towards a medical expert (called AntGLM-Med-10B), which
leverages a 3-stage optimization procedure, i.e., general medical knowledge
injection, medical domain instruction tuning, and specific medical task
adaptation. Our contributions are threefold: (1) We specifically investigate
how to adapt a pre-trained general LLM in medical domain, especially for a
specific medical task. (2) We collect and construct large-scale medical
datasets for each stage of the optimization process. These datasets encompass
various data types and tasks, such as question-answering, medical reasoning,
multi-choice questions, and medical conversations. (3) Specifically for
multi-choice questions in the medical domain, we propose a novel
Verification-of-Choice approach for prompting engineering, which significantly
enhances the reasoning ability of LLMs. Remarkably, by combining the above
approaches, our AntGLM-Med-10B model can outperform the most of LLMs on
PubMedQA, including both general and medical LLMs, even when these LLMs have
larger model size.Comment: Developed by Ant Group for PubMedQA leaderboar
- …
