1,066 research outputs found
Efficient Parallel Audio Generation using Group Masked Language Modeling
We present a fast and high-quality codec language model for parallel audio
generation. While SoundStorm, a state-of-the-art parallel audio generation
model, accelerates inference speed compared to autoregressive models, it still
suffers from slow inference due to iterative sampling. To resolve this problem,
we propose Group-Masked Language Modeling~(G-MLM) and Group Iterative Parallel
Decoding~(G-IPD) for efficient parallel audio generation. Both the training and
sampling schemes enable the model to synthesize high-quality audio with a small
number of iterations by effectively modeling the group-wise conditional
dependencies. In addition, our model employs a cross-attention-based
architecture to capture the speaker style of the prompt voice and improves
computational efficiency. Experimental results demonstrate that our proposed
model outperforms the baselines in prompt-based audio generation.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a
speech sample with the voice characteristic of an unseen speaker. The main
challenge of ZSM-TTS is to increase the overall speaker similarity for unseen
speakers. One of the most successful speaker conditioning methods for
flow-based multi-speaker text-to-speech (TTS) models is to utilize the
functions which predict the scale and bias parameters of the affine coupling
layers according to the given speaker embedding vector. In this letter, we
improve on the previous speaker conditioning method by introducing a
speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker
speech synthesis in a zero-shot manner leveraging a normalization-based
conditioning technique. The newly designed coupling layer explicitly normalizes
the input by the parameters predicted from a speaker embedding vector while
training, enabling an inverse process of denormalizing for a new speaker
embedding at inference. The proposed conditioning scheme yields the
state-of-the-art performance in terms of the speech quality and speaker
similarity in a ZSM-TTS setting.Comment: Accepted to IEEE Signal Processing Letter
Impact of sensory modality and tempo in motor timing
BackgroundAccurate motor timing requires the coordinated control of actions in response to external stimuli. Over the past few years, several studies have investigated the effect of sensory input on motor timing; however, the evidence remains conflicting. The purpose of this study was to examine the impact of sensory modality and tempo on the accuracy of timed movements and explore strategies for enhancing motor timing.MethodsParticipants (n = 30) performed synchronization and adaptation circle drawing tasks in virtual reality. In Experiment 1, participants synchronized circle drawing with repeated stimuli based on sensory modalities (auditory, visual, tactile, audio-visual, audio-tactile, and visual-tactile) and tempos (20, 30, and 60 bpm). In Experiment 2, we examined timing adaptation in circle drawing tasks under conditions of unexpected tempo changes, whether increased or decreased.ResultsA significant interaction effect between modality and tempo was observed in the comparison of timing accuracy. Tactile stimuli exhibited significantly higher timing accuracy at 60 bpm, whereas auditory stimuli demonstrated a peak accuracy at 30 bpm. The analysis revealed a significantly larger timing error when adapting to changes in the tempo-down condition compared with the tempo-up condition.DiscussionThrough Experiment 1, we found that sensory modality impacts motor timing differently depending on the tempo, with tactile modality being effective at a faster tempo and auditory modality being beneficial at a moderate tempo. Additionally, Experiment 2 revealed that adapting to changes by correcting timing errors is more challenging with decreasing tempo than with increasing tempo. Our findings suggest that motor timing is intricately influenced by sensory modality and tempo variation. Therefore, to enhance the motor timing, a comprehensive understanding of these factors and their applications is imperative
High-Performance PVC Gel for Adaptive Micro-Lenses with Variable Focal Length.
This paper presents a bio-inspired adaptive micro-lens with electrically tunable focus made of non-ionic high-molecular-weight polyvinyl chloride (PVC) gel. The optical device mimics the design of the crystalline lens and ciliary muscle of the human eye. It consists of a plano-convex PVC gel micro-lens on Indium Tin Oxide (ITO) glass, confined with an annular electrode operating as an artificial ciliary muscle. Upon electrical activation, the electroactive adhesive force of the PVC gel is exerted on the annular anode electrode, which reduces the sagittal height of the plano-convex PVC gel lens, resulting in focal length variation of the micro-lens. The focal length increases from 3.8 mm to 22.3 mm as the applied field is varied from 200 V/mm to 800 V/mm, comparable to that of the human lens. The device combines excellent optical characteristics with structural simplicity, fast response speed, silent operation, and low power consumption. The results show the PVC gel micro-lens is expected to open up new perspectives on practical tunable optics
Feature Re-calibration based Multiple Instance Learning for Whole Slide Image Classification
Whole slide image (WSI) classification is a fundamental task for the
diagnosis and treatment of diseases; but, curation of accurate labels is
time-consuming and limits the application of fully-supervised methods. To
address this, multiple instance learning (MIL) is a popular method that poses
classification as a weakly supervised learning task with slide-level labels
only. While current MIL methods apply variants of the attention mechanism to
re-weight instance features with stronger models, scant attention is paid to
the properties of the data distribution. In this work, we propose to
re-calibrate the distribution of a WSI bag (instances) by using the statistics
of the max-instance (critical) feature. We assume that in binary MIL, positive
bags have larger feature magnitudes than negatives, thus we can enforce the
model to maximize the discrepancy between bags with a metric feature loss that
models positive bags as out-of-distribution. To achieve this, unlike existing
MIL methods that use single-batch training modes, we propose balanced-batch
sampling to effectively use the feature loss i.e., (+/-) bags simultaneously.
Further, we employ a position encoding module (PEM) to model
spatial/morphological information, and perform pooling by multi-head
self-attention (PSMA) with a Transformer encoder. Experimental results on
existing benchmark datasets show our approach is effective and improves over
state-of-the-art MIL methods.Comment: MICCAI 202
Clinical Efficacy of Primary Tumor Volume Measurements: Comparison of Different Primary Sites
ObjectivesThe purpose of study was to determine the clinical efficacy of primary tumor volume measurements of different primary sites in the oropharynx compared to the oral cavity.MethodsA retrospective analysis of 85 patients with oral cavity or oropharynx cancer. The tumor area was manually outlined from axial magnetic resonance (MR) series. The software calculated the tumor volumes, automatically. The values of the primary tumor volumes were then subdivided into separate groups (≤3,500 mm3, >3,500 mm3).ResultsThe prognostic indicators were the cT and cN (oral cavity); age, primary site, cT, cN, and primary tumor volume (oropharynx) on the univariate analysis. There was no significant prognostic factor for oral cavity cancer on the multivariate analysis. Primary site, cN, and primary tumor volume were independent prognostic indicators for oropharynx cancer by multivariate analysis.ConclusionPrimary tumor volume measurement is a reliable way to stratify outcome, and make up for the weak points in the American Joint Committee on Cancer staging system with oropharynx cancer
- …