39 research outputs found

    Semi-supervised learning for continuous emotional intensity controllable speech synthesis with disentangled representations

    Full text link
    Recent text-to-speech models have reached the level of generating natural speech similar to what humans say. But there still have limitations in terms of expressiveness. The existing emotional speech synthesis models have shown controllability using interpolated features with scaling parameters in emotional latent space. However, the emotional latent space generated from the existing models is difficult to control the continuous emotional intensity because of the entanglement of features like emotions, speakers, etc. In this paper, we propose a novel method to control the continuous intensity of emotions using semi-supervised learning. The model learns emotions of intermediate intensity using pseudo-labels generated from phoneme-level sequences of speech information. An embedding space built from the proposed model satisfies the uniform grid geometry with an emotional basis. The experimental results showed that the proposed method was superior in controllability and naturalness.Comment: Accepted by Interspeech 202

    Mondrian: On-Device High-Performance Video Analytics with Compressive Packed Inference

    Full text link
    In this paper, we present Mondrian, an edge system that enables high-performance object detection on high-resolution video streams. Many lightweight models and system optimization techniques have been proposed for resource-constrained devices, but they do not fully utilize the potential of the accelerators over dynamic, high-resolution videos. To enable such capability, we devise a novel Compressive Packed Inference to minimize per-pixel processing costs by selectively determining the necessary pixels to process and combining them to maximize processing parallelism. In particular, our system quickly extracts ROIs and dynamically shrinks them, reflecting the effect of the fast-changing characteristics of objects and scenes. It then intelligently combines such scaled ROIs into large canvases to maximize the utilization of inference accelerators such as GPU. Evaluation across various datasets, models, and devices shows Mondrian outperforms state-of-the-art baselines (e.g., input rescaling, ROI extractions, ROI extractions+batching) by 15.0-19.7% higher accuracy, leading to ×\times6.65 higher throughput than frame-wise inference for processing various 1080p video streams. We will release the code after the paper review

    Three-dimensional Segmentation of Trees Through a Flexible Multi-Class Graph Cut Algorithm (MCGC)

    Get PDF
    Developing a robust algorithm for automatic individual tree crown (ITC) detection from airborne laser scanning datasets is important for tracking the responses of trees to anthropogenic change. Such approaches allow the size, growth and mortality of individual trees to be measured, enabling forest carbon stocks and dynamics to be tracked and understood. Many algorithms exist for structurally simple forests including coniferous forests and plantations. Finding a robust solution for structurally complex, species-rich tropical forests remains a challenge; existing segmentation algorithms often perform less well than simple area-based approaches when estimating plot-level biomass. Here we describe a Multi-Class Graph Cut (MCGC) approach to tree crown delineation. This uses local three-dimensional geometry and density information, alongside knowledge of crown allometries, to segment individual tree crowns from airborne LiDAR point clouds. Our approach robustly identifies trees in the top and intermediate layers of the canopy, but cannot recognise small trees. From these three-dimensional crowns, we are able to measure individual tree biomass. Comparing these estimates to those from permanent inventory plots, our algorithm is able to produce robust estimates of hectare-scale carbon density, demonstrating the power of ITC approaches in monitoring forests. The flexibility of our method to add additional dimensions of information, such as spectral reflectance, make this approach an obvious avenue for future development and extension to other sources of three-dimensional data, such as structure from motion datasets.Jonathan Williams holds a NERC studentship [NE/N008952/1] which is a CASE partnership with support from Royal Society for the Protection of Birds (RSPB). David Coomes was supported by an International Academic Fellowship from the Leverhulme Trust. Carola-Bibiane Schoenlieb was supported by the RISE projects CHiPS and NoMADS, the Cantab Capital Institute for the Mathematics of Information and the Alan Turing Institute. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Quadro P6000 GPU used for this research

    LIVE - Life-Immersive Virtual Environment with Physical Interaction-aware Adaptive Blending

    No full text
    © 2022 Owner/Author.We present LIVE, a system enabling a life-immersive Mixed Reality experience. Daily MR usage is challenging in that the user's interaction state with the physical objects continuously change over time, while the immersion and the utility should be supported simultaneously in the process. As many works of blending the virtual and physical world are designed for a single interaction state, they are not enough to support life-immersive MR. We propose the initial design of LIVE that (i) selects the current user's context among the three states of interaction with physical object and (ii) applies the most suitable blending method to balance immersion and the utility.N
    corecore