1,066 research outputs found

    Efficient Parallel Audio Generation using Group Masked Language Modeling

    Full text link
    We present a fast and high-quality codec language model for parallel audio generation. While SoundStorm, a state-of-the-art parallel audio generation model, accelerates inference speed compared to autoregressive models, it still suffers from slow inference due to iterative sampling. To resolve this problem, we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel Decoding~(G-IPD) for efficient parallel audio generation. Both the training and sampling schemes enable the model to synthesize high-quality audio with a small number of iterations by effectively modeling the group-wise conditional dependencies. In addition, our model employs a cross-attention-based architecture to capture the speaker style of the prompt voice and improves computational efficiency. Experimental results demonstrate that our proposed model outperforms the baselines in prompt-based audio generation.Comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessibl

    SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech

    Full text link
    Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge of ZSM-TTS is to increase the overall speaker similarity for unseen speakers. One of the most successful speaker conditioning methods for flow-based multi-speaker text-to-speech (TTS) models is to utilize the functions which predict the scale and bias parameters of the affine coupling layers according to the given speaker embedding vector. In this letter, we improve on the previous speaker conditioning method by introducing a speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker speech synthesis in a zero-shot manner leveraging a normalization-based conditioning technique. The newly designed coupling layer explicitly normalizes the input by the parameters predicted from a speaker embedding vector while training, enabling an inverse process of denormalizing for a new speaker embedding at inference. The proposed conditioning scheme yields the state-of-the-art performance in terms of the speech quality and speaker similarity in a ZSM-TTS setting.Comment: Accepted to IEEE Signal Processing Letter

    Impact of sensory modality and tempo in motor timing

    Get PDF
    BackgroundAccurate motor timing requires the coordinated control of actions in response to external stimuli. Over the past few years, several studies have investigated the effect of sensory input on motor timing; however, the evidence remains conflicting. The purpose of this study was to examine the impact of sensory modality and tempo on the accuracy of timed movements and explore strategies for enhancing motor timing.MethodsParticipants (n = 30) performed synchronization and adaptation circle drawing tasks in virtual reality. In Experiment 1, participants synchronized circle drawing with repeated stimuli based on sensory modalities (auditory, visual, tactile, audio-visual, audio-tactile, and visual-tactile) and tempos (20, 30, and 60 bpm). In Experiment 2, we examined timing adaptation in circle drawing tasks under conditions of unexpected tempo changes, whether increased or decreased.ResultsA significant interaction effect between modality and tempo was observed in the comparison of timing accuracy. Tactile stimuli exhibited significantly higher timing accuracy at 60 bpm, whereas auditory stimuli demonstrated a peak accuracy at 30 bpm. The analysis revealed a significantly larger timing error when adapting to changes in the tempo-down condition compared with the tempo-up condition.DiscussionThrough Experiment 1, we found that sensory modality impacts motor timing differently depending on the tempo, with tactile modality being effective at a faster tempo and auditory modality being beneficial at a moderate tempo. Additionally, Experiment 2 revealed that adapting to changes by correcting timing errors is more challenging with decreasing tempo than with increasing tempo. Our findings suggest that motor timing is intricately influenced by sensory modality and tempo variation. Therefore, to enhance the motor timing, a comprehensive understanding of these factors and their applications is imperative

    High-Performance PVC Gel for Adaptive Micro-Lenses with Variable Focal Length.

    Get PDF
    This paper presents a bio-inspired adaptive micro-lens with electrically tunable focus made of non-ionic high-molecular-weight polyvinyl chloride (PVC) gel. The optical device mimics the design of the crystalline lens and ciliary muscle of the human eye. It consists of a plano-convex PVC gel micro-lens on Indium Tin Oxide (ITO) glass, confined with an annular electrode operating as an artificial ciliary muscle. Upon electrical activation, the electroactive adhesive force of the PVC gel is exerted on the annular anode electrode, which reduces the sagittal height of the plano-convex PVC gel lens, resulting in focal length variation of the micro-lens. The focal length increases from 3.8 mm to 22.3 mm as the applied field is varied from 200 V/mm to 800 V/mm, comparable to that of the human lens. The device combines excellent optical characteristics with structural simplicity, fast response speed, silent operation, and low power consumption. The results show the PVC gel micro-lens is expected to open up new perspectives on practical tunable optics

    Feature Re-calibration based Multiple Instance Learning for Whole Slide Image Classification

    Full text link
    Whole slide image (WSI) classification is a fundamental task for the diagnosis and treatment of diseases; but, curation of accurate labels is time-consuming and limits the application of fully-supervised methods. To address this, multiple instance learning (MIL) is a popular method that poses classification as a weakly supervised learning task with slide-level labels only. While current MIL methods apply variants of the attention mechanism to re-weight instance features with stronger models, scant attention is paid to the properties of the data distribution. In this work, we propose to re-calibrate the distribution of a WSI bag (instances) by using the statistics of the max-instance (critical) feature. We assume that in binary MIL, positive bags have larger feature magnitudes than negatives, thus we can enforce the model to maximize the discrepancy between bags with a metric feature loss that models positive bags as out-of-distribution. To achieve this, unlike existing MIL methods that use single-batch training modes, we propose balanced-batch sampling to effectively use the feature loss i.e., (+/-) bags simultaneously. Further, we employ a position encoding module (PEM) to model spatial/morphological information, and perform pooling by multi-head self-attention (PSMA) with a Transformer encoder. Experimental results on existing benchmark datasets show our approach is effective and improves over state-of-the-art MIL methods.Comment: MICCAI 202

    Clinical Efficacy of Primary Tumor Volume Measurements: Comparison of Different Primary Sites

    Get PDF
    ObjectivesThe purpose of study was to determine the clinical efficacy of primary tumor volume measurements of different primary sites in the oropharynx compared to the oral cavity.MethodsA retrospective analysis of 85 patients with oral cavity or oropharynx cancer. The tumor area was manually outlined from axial magnetic resonance (MR) series. The software calculated the tumor volumes, automatically. The values of the primary tumor volumes were then subdivided into separate groups (≤3,500 mm3, >3,500 mm3).ResultsThe prognostic indicators were the cT and cN (oral cavity); age, primary site, cT, cN, and primary tumor volume (oropharynx) on the univariate analysis. There was no significant prognostic factor for oral cavity cancer on the multivariate analysis. Primary site, cN, and primary tumor volume were independent prognostic indicators for oropharynx cancer by multivariate analysis.ConclusionPrimary tumor volume measurement is a reliable way to stratify outcome, and make up for the weak points in the American Joint Committee on Cancer staging system with oropharynx cancer
    corecore