192 research outputs found
LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models
Language model (LM) based audio generation frameworks, e.g., AudioLM, have
recently achieved new state-of-the-art performance in zero-shot audio
generation. In this paper, we explore the feasibility of LMs for zero-shot
voice conversion. An intuitive approach is to follow AudioLM - Tokenizing
speech into semantic and acoustic tokens respectively by HuBERT and
SoundStream, and converting source semantic tokens to target acoustic tokens
conditioned on acoustic tokens of the target speaker. However, such an approach
encounters several issues: 1) the linguistic content contained in semantic
tokens may get dispersed during multi-layer modeling while the lengthy speech
input in the voice conversion task makes contextual learning even harder; 2)
the semantic tokens still contain speaker-related information, which may be
leaked to the target speech, lowering the target speaker similarity; 3) the
generation diversity in the sampling of the LM can lead to unexpected outcomes
during inference, leading to unnatural pronunciation and speech quality
degradation. To mitigate these problems, we propose LM-VC, a two-stage language
modeling approach that generates coarse acoustic tokens for recovering the
source linguistic content and target speaker's timbre, and then reconstructs
the fine for acoustic details as converted speech. Specifically, to enhance
content preservation and facilitates better disentanglement, a masked prefix LM
with a mask prediction strategy is used for coarse acoustic modeling. This
model is encouraged to recover the masked content from the surrounding context
and generate target speech based on the target speaker's utterance and
corrupted semantic tokens. Besides, to further alleviate the sampling error in
the generation, an external LM, which employs window attention to capture the
local acoustic relations, is introduced to participate in the coarse acoustic
modeling
An extensive analysis of the presence of altmetric data for Web of Science publications across subject fields and research topics
Sufficient data presence is one of the key preconditions for applying metrics
in practice. Based on both Altmetric.com data and Mendeley data collected up to
2019, this paper presents a state-of-the-art analysis of the presence of 12
kinds of altmetric events for nearly 12.3 million Web of Science publications
published between 2012 and 2018. Results show that even though an upward trend
of data presence can be observed over time, except for Mendeley readers and
Twitter mentions, the overall presence of most altmetric data is still low. The
majority of altmetric events go to publications in the fields of Biomedical and
Health Sciences, Social Sciences and Humanities, and Life and Earth Sciences.
As to research topics, the level of attention received by research topics
varies across altmetric data, and specific altmetric data show different
preferences for research topics, on the basis of which a framework for
identifying hot research topics is proposed and applied to detect research
topics with higher levels of attention garnered on certain altmetric data
source. Twitter mentions and policy document citations were selected as two
examples to identify hot research topics of interest of Twitter users and
policy-makers, respectively, shedding light on the potential of altmetric data
in monitoring research trends of specific social attention
Numerical Simulation Analysis of Mechanical Properties on Rock Brittle–Ductility Transformation Under Different Loading Rates
At present, a large number of physical tests and numerical simulations have been carried out to study the effect of confining pressure on rock deformation mechanism, and some achievements have been achieved; however, the mechanism of rock deformation in actual mine engineering needs to be further studied, for example, rock-burst is actually a unilateral unloading process of rock mass, and this process can not be completed by physical test. RFPA3D was used to simulate the brittle–ductility transformation mechanical properties of rock under different confining pressures in this paper. The damage constitutive equation of rock was derived from continuum damage mechanics; the damage coefficients of different rocks were determined based on the numerical results of stress acoustic emission, so the correctness of rock damage constitutive equation was verified. According to the derived brittle–ductility damage equation and the fitting results of ductility cumulative damage data, it was found that the development trend of rock brittleness stage was almost the same, and the extended separation occurred after entering ductility stage. The larger the Poisson’s ratio was, the longer the ductility stage was. The smaller the Poisson’s ratio was, the shorter the ductility stage was, but the larger the bearing capacity was. At the late loading stage, the ductility cumulative damage of rock showed a linear upward trend, the bearing capacity sharply decreased, the rock stability failure occurred, and the ductility damage coefficient increased gradually. The study on the brittle–ductile mechanical properties of rocks can help to deep mine’s rock-burst prediction and prevention and has significant engineering significance
Delivering Speaking Style in Low-resource Voice Conversion with Multi-factor Constraints
Conveying the linguistic content and maintaining the source speech's speaking
style, such as intonation and emotion, is essential in voice conversion (VC).
However, in a low-resource situation, where only limited utterances from the
target speaker are accessible, existing VC methods are hard to meet this
requirement and capture the target speaker's timber. In this work, a novel VC
model, referred to as MFC-StyleVC, is proposed for the low-resource VC task.
Specifically, speaker timbre constraint generated by clustering method is newly
proposed to guide target speaker timbre learning in different stages.
Meanwhile, to prevent over-fitting to the target speaker's limited data,
perceptual regularization constraints explicitly maintain model performance on
specific aspects, including speaking style, linguistic content, and speech
quality. Besides, a simulation mode is introduced to simulate the inference
process to alleviate the mismatch between training and inference. Extensive
experiments performed on highly expressive speech demonstrate the superiority
of the proposed method in low-resource VC.Comment: Accepted by ICASSP 202
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling
In addition to conveying the linguistic content from source speech to
converted speech, maintaining the speaking style of source speech also plays an
important role in the voice conversion (VC) task, which is essential in many
scenarios with highly expressive source speech, such as dubbing and data
augmentation. Previous work generally took explicit prosodic features or
fixed-length style embedding extracted from source speech to model the speaking
style of source speech, which is insufficient to achieve comprehensive style
modeling and target speaker timbre preservation. Inspired by the style's
multi-scale nature of human speech, a multi-scale style modeling method for the
VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the
speaking style of source speech from different levels. To effectively convey
the speaking style and meanwhile prevent timbre leakage from source speech to
converted speech, each level's style is modeled by specific representation.
Specifically, prosodic features, pre-trained ASR model's bottleneck features,
and features extracted by a model trained with a self-supervised strategy are
adopted to model the frame, local, and global-level styles, respectively.
Besides, to balance the performance of source style modeling and target speaker
timbre preservation, an explicit constraint module consisting of a pre-trained
speech emotion recognition model and a speaker classifier is introduced to
MSM-VC. This explicit constraint module also makes it possible to simulate the
style transfer inference process during the training to improve the
disentanglement ability and alleviate the mismatch between training and
inference. Experiments performed on the highly expressive speech corpus
demonstrate that MSM-VC is superior to the state-of-the-art VC methods for
modeling source speech style while maintaining good speech quality and speaker
similarity.Comment: This work was submitted on April 10, 2022 and accepted on August 29,
202
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning
Zero-shot speaker cloning aims to synthesize speech for any target speaker
unseen during TTS system building, given only a single speech reference of the
speaker at hand. Although more practical in real applications, the current
zero-shot methods still produce speech with undesirable naturalness and speaker
similarity. Moreover, endowing the target speaker with arbitrary speaking
styles in the zero-shot setup has not been considered. This is because the
unique challenge of zero-shot speaker and style cloning is to learn the
disentangled speaker and style representations from only short references
representing an arbitrary speaker and an arbitrary style. To address this
challenge, we propose U-Style, which employs Grad-TTS as the backbone,
particularly cascading a speaker-specific encoder and a style-specific encoder
between the text encoder and the diffusion decoder. Thus, leveraging signal
perturbation, U-Style is explicitly decomposed into speaker- and style-specific
modeling parts, achieving better speaker and style disentanglement. To improve
unseen speaker and style modeling ability, these two encoders conduct
multi-level speaker and style modeling by skip-connected U-nets, incorporating
the representation extraction and information reconstruction process. Besides,
to improve the naturalness of synthetic speech, we adopt mean-based instance
normalization and style adaptive layer normalization in these encoders to
perform representation extraction and condition adaptation, respectively.
Experiments show that U-Style significantly surpasses the state-of-the-art
methods in unseen speaker cloning regarding naturalness and speaker similarity.
Notably, U-Style can transfer the style from an unseen source speaker to
another unseen target speaker, achieving flexible combinations of desired
speaker timbre and style in zero-shot voice cloning
Integrated microbiome and metabolomics analysis reveal the relationship between plant-specialized metabolites and microbial community in Phellodendron amurense
Phellodendron amurense is the essential source of bisbenzylisoquinoline alkaloids (BIAs), making it a highly valued raw material in traditional Chinese medicine. The plant’s root secondary metabolism is intricately linked to the microbial communities that surround it. However, the root-associated microbiomes of P. amurense, as well as the potential correlation between its bioactive compounds and these microbiomes, remain poorly understood. Here, the metabolic profiles of root, rhizosphere, and bulk soils of P. amurense revealed the dramatic differences in the relative content of plant-specialized metabolites. A total of 31, 21, and 0 specialized metabolites in P. amurense were identified in the root, rhizosphere soil, and bulk soil, respectively, with higher content of the seven major BIAs observed in the rhizosphere compared with that in the bulk soils. The composition of the bulk and rhizosphere microbiomes was noticeably distinct from that of the endospheric microbiome. The phylum Cyanobacteria accounted for over 60% of the root endosphere communities, and the α-diversity in root was the lowest. Targeted seven BIAs, namely, berberine, palmatine, magnocurarine, phellodendrine, jatrorrhizine, tetrahydropalmatine, and magnoflorine, were significantly positively correlated with Nectriaceae and Sphingobacteriaceae. This study has illuminated the intricate interaction networks between P. amurense root-associated microorganisms and their key chemical compounds, providing the theoretical foundation for discovering biological fertilizers and laying the groundwork for cultivating high-quality medicinal plants
Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance
Streaming voice conversion (VC) is the task of converting the voice of one
person to another in real-time. Previous streaming VC methods use phonetic
posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems
to represent speaker-independent information. However, PPGs lack the prosody
and vocalization information of the source speaker, and streaming PPGs contain
undesired leaked timbre of the source speaker. In this paper, we propose to use
intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained
with IBFs retain more prosody and vocalization information of the source
speaker. Furthermore, we propose a non-streaming teacher guidance (TG)
framework that addresses the timbre leakage problem. Experiments show that our
proposed IBFs and the TG framework achieve a state-of-the-art streaming VC
naturalness of 3.85, a content consistency of 3.77, and a timbre similarity of
3.77 under a future receptive field of 160 ms which significantly outperform
previous streaming VC systems.Comment: The paper has been submitted to ICASSP202
The honeysuckle genome provides insight into the molecular mechanism of carotenoid metabolism underlying dynamic flower coloration
Lonicera japonica is a wide-spread member of the Caprifoliaceae (honeysuckle) family utilized in traditional medical practices. This twining vine honeysuckle is also a much-sought ornamental, in part due to its dynamic flower coloration, which changes from white to gold during development. The molecular mechanism underlying dynamic flower coloration in L. japonica was elucidated by integrating whole genome sequencing, transcriptomic analysis, and biochemical assays. Here, we report a chromosome-level genome assembly of L. japonica, comprising nine pseudo-chromosomes with a total size of 843.2 Mb. We also provide evidence for a whole genome duplication event in the lineage leading to L. japonica, which occurred after its divergence from Dipsacales and Asterales. Moreover, gene expression analysis not only revealed correlated expression of the relevant biosynthetic genes with carotenoid accumulation, but also suggested a role for carotenoid degradation in L. japonica's dynamic flower coloration. The variation of flower color is consistent with not only the observed carotenoid accumulation pattern, but also with the release of volatile apocarotenoids that presumably serve as pollinator attractants. Beyond novel insights into the evolution and dynamics of flower coloration, the high-quality L. japonica genome sequence also provides a foundation for molecular breeding to improve desired characteristics
- …