53 research outputs found

    Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image

    Full text link
    Generating a voice from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech conditioned on a face image rather than reference speech. We hypothesize that learning both speaker identity and prosody from a face image poses a significant challenge. To address the issue, our TTS model incorporates both a face encoder and a prosody encoder. The prosody encoder is specifically designed to model prosodic features that are not captured only with a face image, allowing the face encoder to focus solely on capturing the speaker identity from the face image. Experimental results demonstrate that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for the face images the model has not trained. Samples are at our demo page https://face-stylespeech.github.io.Comment: Submitted to ICASSP 202

    ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models

    Full text link
    Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Specifically, to enable a zero-shot adaptive TTS model to synthesize emotional speech, we propose domain adversarial learning and guidance methods on the diffusion model. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers. Samples are at https://ZET-Speech.github.io/ZET-Speech-Demo/.Comment: Accepted by INTERSPEECH 202

    AD-YOLO: You Look ONly Once in Training Multiple Sound Event Localization and Detection

    Full text link
    Sound event localization and detection (SELD) combines the identification of sound events with the corresponding directions of arrival (DOA). Recently, event-oriented track output formats have been adopted to solve this problem; however, they still have limited generalization toward real-world problems in an unknown polyphony environment. To address the issue, we proposed an angular-distance-based multiple SELD (AD-YOLO), which is an adaptation of the "You Look Only Once" algorithm for SELD. The AD-YOLO format allows the model to learn sound occurrences location-sensitively by assigning class responsibility to DOA predictions. Hence, the format enables the model to handle the polyphony problem, regardless of the number of sound overlaps. We evaluated AD-YOLO on DCASE 2020-2022 challenge Task 3 datasets using four SELD objective metrics. The experimental results show that AD-YOLO achieved outstanding performance overall and also accomplished robustness in class-homogeneous polyphony environments.Comment: 5 pages, 3 figures, accepted for publication in IEEE ICASSP 202

    Workspace Force/Acceleration Disturbance Observer for Combined Motion/Impedance/Force Control

    No full text
    Disturbance observer, Hybrid control, Impedance control, motion control, Force control. Introduction 1 II. Problem Formulation 4 2.1. Manipulator Dynamics 4 2.2. Disturbance and Model Uncertainty in Impedance-based Motion Control 4 2.3. Limitation of Conventional WSDOB in Impedance-based Motion Control 7 III. Robust-Safe Motion Control and Force Control with Workspace Force and Acceleration Disturbance Observer 9 3.1. Workspace Force-Acceleration Disturbance Observer for Impedance-based Motion Control 9 3.2. Workspace Force-Acceleration Disturbance Observer for Force Tracking Impedance Control 11 IV. Performance Verification through Simulation 13 4.1. Simulation of Position Tracking Performance and Contact Behavior in Impedance-based Motion Control 13 4.2. Simulation of Performance of External Force Response under Model Uncertainty in Impedance-based Motion Control 16 4.3. Simulation of Performance of Force Control in Force Tracking Impedance Control 18 V. Experimental Verification 19 5.1. Experiment Condition 19 5.2. Experiment of Performance of Position Tracking Performance and Contact Behavior in Impedance-based Motion Control 19 5.3. Experiment of Performance of External Force Response in Impedance-based Motion Control 22 5.4. Experiment of Performance of Force Control in Force Tracking Impedance Control 23 VI. Conclusion 25MasterdCollectio

    Small molecule and fragment inhibitors of E. coli LpxA

    No full text
    This is to obtain approvals for LpxA inhibitors to write manuscript

    TRACER: Extreme Attention Guided Salient Object Tracing Network (Student Abstract)

    No full text
    Existing studies on salient object detection (SOD) focus on extracting distinct objects with edge features and aggregating multi-level features to improve SOD performance. However, both performance gain and computational efficiency cannot be achieved, which has motivated us to study the inefficiencies in existing encoder-decoder structures to avoid this trade-off. We propose TRACER which excludes multi-decoder structures and minimizes the learning parameters usage by employing attention guided tracing modules (ATMs), as shown in Fig. 1

    COMMA: Propagating Complementary Multi-Level Aggregation Network for Polyp Segmentation

    No full text
    Colonoscopy is an effective method for detecting polyps to prevent colon cancer. Existing studies have achieved satisfactory polyp detection performance by aggregating low-level boundary and high-level region information in convolutional neural networks (CNNs) for precise polyp segmentation in colonoscopy images. However, multi-level aggregation provides limited polyp segmentation owing to the distribution discrepancy that occurs when integrating different layer representations. To address this problem, previous studies have employed complementary low- and high- level representations. In contrast to existing methods, we focus on propagating complementary information such that the complementary low-level explicit boundary with abstracted high-level representations diminishes the discrepancy. This study proposes COMMA, which propagates complementary multi-level aggregation to reduce distribution discrepancies. COMMA comprises a complementary masking module (CMM) and a boundary propagation module (BPM) as a multi-decoder. The CMM masks the low-level boundary noises through the abstracted high-level representation and leverages the masked information at both levels. Similarly, the BPM incorporates the lowest- and highest-level representations to obtain explicit boundary information and propagates the boundary to the CMMs to improve polyp detection. CMMs can discriminate polyps more elaborately than prior CMMs based on boundary and complementary representations. Moreover, we propose a hybrid loss function to mitigate class imbalance and noisy annotations in polyp segmentation. To evaluate the COMMA performance, we conducted experiments on five benchmark datasets using five metrics. The results proved that the proposed network outperforms state-of-the-art methods in terms of all datasets. Specifically, COMMA improved mIoU performance by 0.043 on average for all datasets compared to the existing state-of-the-art methods

    The influences of encircling gill net fishery on fish organisms

    No full text

    An approach to query decomposition for reader level filtering in RFID middleware

    No full text
    In RFID systems, middleware is used to filter enormous streaming data gathered continuously from readers to process application requests. The high volume of data makes middleware often in a highly overloaded situation. Nowadays, readers are becoming smart and provide filtering functionality. The reader filtering capability can be used to reduce data volume as well as middleware work-load. However, if middleware dispatches query conditions to reader without any adjustment, it may generate huge amount of duplicate data which imposes considerable load on the middleware. So, the appropriate schema of data volume reduction is required. In this paper, we propose a query decomposition technique to divide queries into sub-queries for middleware and reader level execution. This new approach of query execution resolves the problem of duplicate data generation. Our experiments show that the proposed approach considerably improves the performance of middleware by reducing the query processing time and the network traffic between reader and middleware
    corecore