192 research outputs found

    LM-VC: Zero-shot Voice Conversion via Speech Generation based on Language Models

    Full text link
    Language model (LM) based audio generation frameworks, e.g., AudioLM, have recently achieved new state-of-the-art performance in zero-shot audio generation. In this paper, we explore the feasibility of LMs for zero-shot voice conversion. An intuitive approach is to follow AudioLM - Tokenizing speech into semantic and acoustic tokens respectively by HuBERT and SoundStream, and converting source semantic tokens to target acoustic tokens conditioned on acoustic tokens of the target speaker. However, such an approach encounters several issues: 1) the linguistic content contained in semantic tokens may get dispersed during multi-layer modeling while the lengthy speech input in the voice conversion task makes contextual learning even harder; 2) the semantic tokens still contain speaker-related information, which may be leaked to the target speech, lowering the target speaker similarity; 3) the generation diversity in the sampling of the LM can lead to unexpected outcomes during inference, leading to unnatural pronunciation and speech quality degradation. To mitigate these problems, we propose LM-VC, a two-stage language modeling approach that generates coarse acoustic tokens for recovering the source linguistic content and target speaker's timbre, and then reconstructs the fine for acoustic details as converted speech. Specifically, to enhance content preservation and facilitates better disentanglement, a masked prefix LM with a mask prediction strategy is used for coarse acoustic modeling. This model is encouraged to recover the masked content from the surrounding context and generate target speech based on the target speaker's utterance and corrupted semantic tokens. Besides, to further alleviate the sampling error in the generation, an external LM, which employs window attention to capture the local acoustic relations, is introduced to participate in the coarse acoustic modeling

    An extensive analysis of the presence of altmetric data for Web of Science publications across subject fields and research topics

    Get PDF
    Sufficient data presence is one of the key preconditions for applying metrics in practice. Based on both Altmetric.com data and Mendeley data collected up to 2019, this paper presents a state-of-the-art analysis of the presence of 12 kinds of altmetric events for nearly 12.3 million Web of Science publications published between 2012 and 2018. Results show that even though an upward trend of data presence can be observed over time, except for Mendeley readers and Twitter mentions, the overall presence of most altmetric data is still low. The majority of altmetric events go to publications in the fields of Biomedical and Health Sciences, Social Sciences and Humanities, and Life and Earth Sciences. As to research topics, the level of attention received by research topics varies across altmetric data, and specific altmetric data show different preferences for research topics, on the basis of which a framework for identifying hot research topics is proposed and applied to detect research topics with higher levels of attention garnered on certain altmetric data source. Twitter mentions and policy document citations were selected as two examples to identify hot research topics of interest of Twitter users and policy-makers, respectively, shedding light on the potential of altmetric data in monitoring research trends of specific social attention

    Numerical Simulation Analysis of Mechanical Properties on Rock Brittle–Ductility Transformation Under Different Loading Rates

    Get PDF
    At present, a large number of physical tests and numerical simulations have been carried out to study the effect of confining pressure on rock deformation mechanism, and some achievements have been achieved; however, the mechanism of rock deformation in actual mine engineering needs to be further studied, for example, rock-burst is actually a unilateral unloading process of rock mass, and this process can not be completed by physical test. RFPA3D was used to simulate the brittle–ductility transformation mechanical properties of rock under different confining pressures in this paper. The damage constitutive equation of rock was derived from continuum damage mechanics; the damage coefficients of different rocks were determined based on the numerical results of stress acoustic emission, so the correctness of rock damage constitutive equation was verified. According to the derived brittle–ductility damage equation and the fitting results of ductility cumulative damage data, it was found that the development trend of rock brittleness stage was almost the same, and the extended separation occurred after entering ductility stage. The larger the Poisson’s ratio was, the longer the ductility stage was. The smaller the Poisson’s ratio was, the shorter the ductility stage was, but the larger the bearing capacity was. At the late loading stage, the ductility cumulative damage of rock showed a linear upward trend, the bearing capacity sharply decreased, the rock stability failure occurred, and the ductility damage coefficient increased gradually. The study on the brittle–ductile mechanical properties of rocks can help to deep mine’s rock-burst prediction and prevention and has significant engineering significance

    Delivering Speaking Style in Low-resource Voice Conversion with Multi-factor Constraints

    Full text link
    Conveying the linguistic content and maintaining the source speech's speaking style, such as intonation and emotion, is essential in voice conversion (VC). However, in a low-resource situation, where only limited utterances from the target speaker are accessible, existing VC methods are hard to meet this requirement and capture the target speaker's timber. In this work, a novel VC model, referred to as MFC-StyleVC, is proposed for the low-resource VC task. Specifically, speaker timbre constraint generated by clustering method is newly proposed to guide target speaker timbre learning in different stages. Meanwhile, to prevent over-fitting to the target speaker's limited data, perceptual regularization constraints explicitly maintain model performance on specific aspects, including speaking style, linguistic content, and speech quality. Besides, a simulation mode is introduced to simulate the inference process to alleviate the mismatch between training and inference. Extensive experiments performed on highly expressive speech demonstrate the superiority of the proposed method in low-resource VC.Comment: Accepted by ICASSP 202

    MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling

    Full text link
    In addition to conveying the linguistic content from source speech to converted speech, maintaining the speaking style of source speech also plays an important role in the voice conversion (VC) task, which is essential in many scenarios with highly expressive source speech, such as dubbing and data augmentation. Previous work generally took explicit prosodic features or fixed-length style embedding extracted from source speech to model the speaking style of source speech, which is insufficient to achieve comprehensive style modeling and target speaker timbre preservation. Inspired by the style's multi-scale nature of human speech, a multi-scale style modeling method for the VC task, referred to as MSM-VC, is proposed in this paper. MSM-VC models the speaking style of source speech from different levels. To effectively convey the speaking style and meanwhile prevent timbre leakage from source speech to converted speech, each level's style is modeled by specific representation. Specifically, prosodic features, pre-trained ASR model's bottleneck features, and features extracted by a model trained with a self-supervised strategy are adopted to model the frame, local, and global-level styles, respectively. Besides, to balance the performance of source style modeling and target speaker timbre preservation, an explicit constraint module consisting of a pre-trained speech emotion recognition model and a speaker classifier is introduced to MSM-VC. This explicit constraint module also makes it possible to simulate the style transfer inference process during the training to improve the disentanglement ability and alleviate the mismatch between training and inference. Experiments performed on the highly expressive speech corpus demonstrate that MSM-VC is superior to the state-of-the-art VC methods for modeling source speech style while maintaining good speech quality and speaker similarity.Comment: This work was submitted on April 10, 2022 and accepted on August 29, 202

    U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning

    Full text link
    Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of zero-shot speaker and style cloning is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose U-Style, which employs Grad-TTS as the backbone, particularly cascading a speaker-specific encoder and a style-specific encoder between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning

    Integrated microbiome and metabolomics analysis reveal the relationship between plant-specialized metabolites and microbial community in Phellodendron amurense

    Get PDF
    Phellodendron amurense is the essential source of bisbenzylisoquinoline alkaloids (BIAs), making it a highly valued raw material in traditional Chinese medicine. The plant’s root secondary metabolism is intricately linked to the microbial communities that surround it. However, the root-associated microbiomes of P. amurense, as well as the potential correlation between its bioactive compounds and these microbiomes, remain poorly understood. Here, the metabolic profiles of root, rhizosphere, and bulk soils of P. amurense revealed the dramatic differences in the relative content of plant-specialized metabolites. A total of 31, 21, and 0 specialized metabolites in P. amurense were identified in the root, rhizosphere soil, and bulk soil, respectively, with higher content of the seven major BIAs observed in the rhizosphere compared with that in the bulk soils. The composition of the bulk and rhizosphere microbiomes was noticeably distinct from that of the endospheric microbiome. The phylum Cyanobacteria accounted for over 60% of the root endosphere communities, and the α-diversity in root was the lowest. Targeted seven BIAs, namely, berberine, palmatine, magnocurarine, phellodendrine, jatrorrhizine, tetrahydropalmatine, and magnoflorine, were significantly positively correlated with Nectriaceae and Sphingobacteriaceae. This study has illuminated the intricate interaction networks between P. amurense root-associated microorganisms and their key chemical compounds, providing the theoretical foundation for discovering biological fertilizers and laying the groundwork for cultivating high-quality medicinal plants

    Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

    Full text link
    Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leaked timbre of the source speaker. In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained with IBFs retain more prosody and vocalization information of the source speaker. Furthermore, we propose a non-streaming teacher guidance (TG) framework that addresses the timbre leakage problem. Experiments show that our proposed IBFs and the TG framework achieve a state-of-the-art streaming VC naturalness of 3.85, a content consistency of 3.77, and a timbre similarity of 3.77 under a future receptive field of 160 ms which significantly outperform previous streaming VC systems.Comment: The paper has been submitted to ICASSP202

    The honeysuckle genome provides insight into the molecular mechanism of carotenoid metabolism underlying dynamic flower coloration

    Get PDF
    Lonicera japonica is a wide-spread member of the Caprifoliaceae (honeysuckle) family utilized in traditional medical practices. This twining vine honeysuckle is also a much-sought ornamental, in part due to its dynamic flower coloration, which changes from white to gold during development. The molecular mechanism underlying dynamic flower coloration in L. japonica was elucidated by integrating whole genome sequencing, transcriptomic analysis, and biochemical assays. Here, we report a chromosome-level genome assembly of L. japonica, comprising nine pseudo-chromosomes with a total size of 843.2 Mb. We also provide evidence for a whole genome duplication event in the lineage leading to L. japonica, which occurred after its divergence from Dipsacales and Asterales. Moreover, gene expression analysis not only revealed correlated expression of the relevant biosynthetic genes with carotenoid accumulation, but also suggested a role for carotenoid degradation in L. japonica's dynamic flower coloration. The variation of flower color is consistent with not only the observed carotenoid accumulation pattern, but also with the release of volatile apocarotenoids that presumably serve as pollinator attractants. Beyond novel insights into the evolution and dynamics of flower coloration, the high-quality L. japonica genome sequence also provides a foundation for molecular breeding to improve desired characteristics
    • …
    corecore