494 research outputs found

    TripleNet: A Low Computing Power Platform of Low-Parameter Network

    Full text link
    With the excellent performance of deep learning technology in the field of computer vision, convolutional neural network (CNN) architecture has become the main backbone of computer vision task technology. With the widespread use of mobile devices, neural network models based on platforms with low computing power are gradually being paid attention. This paper proposes a lightweight convolutional neural network model, TripleNet, an improved convolutional neural network based on HarDNet and ThreshNet, inheriting the advantages of small memory usage and low power consumption of the mentioned two models. TripleNet uses three different convolutional layers combined into a new model architecture, which has less number of parameters than that of HarDNet and ThreshNet. CIFAR-10 and SVHN datasets were used for image classification by employing HarDNet, ThreshNet, and our proposed TripleNet for verification. Experimental results show that, compared with HarDNet, TripleNet's parameters are reduced by 66% and its accuracy rate is increased by 18%; compared with ThreshNet, TripleNet's parameters are reduced by 37% and its accuracy rate is increased by 5%.Comment: 4 pages, 2 figure

    Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications

    Full text link
    We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set, they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly, the relationship between speech and facial motion is one-to-many, containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper, we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them, as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then, we demonstrate a probabilistic model that achieves both diversity and fidelity to speech, outperforming other methods across the proposed benchmarks. Finally, we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models

    A social networking approach for mobile innovation in emerging countries

    Get PDF
    Thesis (S.M. in Engineering and Management)--Massachusetts Institute of Technology, Engineering Systems Division, System Design and Management Program, February 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 121-122).Addressing the global challenges and the next billion mobile subscribers, the MIT NextLab course engages students, industry partners, entrepreneurs and the next billion mobile subscribers to develop innovative mobile services that improve the quality of life in the emerging countries. In three years, NextLab teams developed and deployed 29 projects in 14 counties, and five teams founded their own ventures after perceiving the strong demand from the vast mobile users in the developing world. However, the size and the amount of NextLab projects are limited by the schedule and the location of an academic course. The focus of this thesis is to research and develop a social networking platform that replicates the success of the NextLab course to reach out to more participants around the world. In this document, I utilized the social analysis framework to identify social processes among stakeholders in a general NextLab project, specify the possible social failures and research the possible solutions. Besides, I also reviewed the NextLab projects in 2008 and 2009 and developed the NextLab Project Development Process (NLPDP) that highlights the 12 critical stages of a NextLab project. Finally, I proposed the NextLab 2.0 Community that is integrates with the social networking solutions and the NextLab Project Development Process. The case study of the mobile logistics (m-Logistics) project is used to demonstrate how the proposed solution facilitates the collaboration and communication for a large and cross-country mobile innovation project. A number of recommendations were also discussed for further research.by Jen-Hao Yang.S.M.in Engineering and Managemen

    Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

    Full text link
    Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.Comment: ICASSP 202
    • …
    corecore