67 research outputs found

    Probabilistic Models of Speech Quality

    Get PDF
    東京電機大学202

    AMBIQUAL – a Full Reference Objective Quality Metric for Ambisonic Spatial Audio

    Get PDF
    Streaming spatial audio over networks requires efficient encoding techniques that compress the raw audio content without compromising quality of experience. Streaming service providers such as YouTube need a perceptually relevant objective audio quality metric to monitor users’ perceived quality and spatial localization accuracy. In this paper we introduce a full reference objective spatial audio quality metric, AMBIQUAL, which assesses both Listening Quality and Localization Accuracy. In our solution both metrics are derived directly from the B-format Ambisonic audio. The metric extends and adapts the algorithm used in ViSQOLAudio, a full reference objective metric designed for assessing speech and audio quality. In particular, Listening Quality is derived from the omnidirectional channel and Localization Accuracy is derived from a weighted sum of similarity from B-format directional channels. This paper evaluates whether the proposed AMBIQUAL objective spatial audio quality metric can predict two factors: Listening Quality and Localization Accuracy by comparing its predictions with results from MUSHRA subjective listening tests. In particular, we evaluated the Listening Quality and Localization Accuracy of First and Third-Order Ambisonic audio compressed with the OPUS 1.2 codec at various bitrates (i.e. 32, 128 and 256, 512kbps respectively). The sample set for the tests comprised both recorded and synthetic audio clips with a wide range of time-frequency characteristics. To evaluate Localization Accuracy of compressed audio a number of fixed and dynamic (moving vertically and horizontally) source positions were selected for the test samples. Results showed a strong correlation (PCC=0.919; Spearman=0.882 regarding Listening Quality and PCC=0.854; Spearman=0.842 regarding Localization Accuracy) between objective quality scores derived from the B-format Ambisonic audio using AMBIQUAL and subjective scores obtained during listening MUSHRA tests. AMBIQUAL displays very promising quality assessment predictions for spatial audio. Future work will optimise the algorithm to generalise and validate it for any Higher Order Ambisonic formats

    Debakarn Koorliny Wangkiny: steady walking and talking using first nations-led participatory action research methodologies to build relationships

    Get PDF
    Aboriginal participatory action research (APAR) has an ethical focus that corrects the imbalances of colonisation through participation and shared decision-making to position people, place, and intention at the centre of research. APAR supports researchers to respond to the community's local rhythms and culture. APAR supports researchers to respond to the community's local rhythms and culture. First Nations scholars and their allies do this in a way that decolonises mainstream approaches in research to disrupt its cherished ideals and endeavours. How these knowledges are co-created and translated is also critically scrutinised. We are a team of intercultural researchers working with community and mainstream health service providers to improve service access, responsiveness, and Aboriginal client outcomes. Our article begins with an overview of the APAR literature and pays homage to the decolonising scholarship that champions Aboriginal ways of knowing, being, and doing. We present a research program where Aboriginal Elders, as cultural guides, hold the research through storying and cultural experiences that have deepened relationships between services and the local Aboriginal community. We conclude with implications of a community-led engagement framework underpinned by a relational methodology that reflects the nuances of knowledge translation through a co-creation of new knowledge and knowledge exchange

    A Comparison of Deep Learning MOS Predictors for Speech Synthesis Quality

    Full text link
    This paper introduces a comparison of deep learning-based techniques for the MOS prediction task of synthesised speech in the Interspeech VoiceMOS challenge. Using the data from the main track of the VoiceMOS challenge we explore both existing predictors and propose new ones. We evaluate two groups of models: NISQA-based models and techniques based on fine-tuning the self-supervised learning (SSL) model wav2vec2_base. Our findings show that a simplified version of NISQA with 40% fewer parameters achieves results close to the original NISQA architecture on both utterance-level and system-level performances. Pre-training NISQA with the NISQA corpus improves utterance-level performance but shows no benefit on the system-level performance. Also, the NISQA-based models perform close to LDNet and MOSANet, 2 out of 3 baselines of the challenge. Fine-tuning wav2vec2_base shows superior performance than the NISQA-based models. We explore the mismatch between natural and synthetic speech and discovered that the performance of the SSL model drops consistently when fine-tuned on natural speech samples. We show that adding CNN features with the SSL model does not improve the baseline performance. Finally, we show that the system type has an impact on the predictions of the non-SSL models.Comment: Submitted to INTERSPEECH 202
    corecore