47 research outputs found

    Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

    Full text link
    This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.Comment: 5 pages,3 figures, Accepted to IEEE ICASSP 2023 workshop Self-supervision in Audio, Speech and Beyon

    Streaming Target-Speaker ASR with Neural Transducer

    Full text link
    Although recent advances in deep learning technology have boosted automatic speech recognition (ASR) performance in the single-talker case, it remains difficult to recognize multi-talker speech in which many voices overlap. One conventional approach to tackle this problem is to use a cascade of a speech separation or target speech extraction front-end with an ASR back-end. However, the extra computation costs of the front-end module are a critical barrier to quick response, especially for streaming ASR. In this paper, we propose a target-speaker ASR (TS-ASR) system that implicitly integrates the target speech extraction functionality within a streaming end-to-end (E2E) ASR system, i.e. recurrent neural network-transducer (RNNT). Our system uses a similar idea as adopted for target speech extraction, but implements it directly at the level of the encoder of RNNT. This allows TS-ASR to be realized without placing extra computation costs on the front-end. Note that this study presents two major differences between prior studies on E2E TS-ASR; we investigate streaming models and base our study on Conformer models, whereas prior studies used RNN-based systems and considered only offline processing. We confirm in experiments that our TS-ASR achieves comparable recognition performance with conventional cascade systems in the offline setting, while reducing computation costs and realizing streaming TS-ASR.Comment: Accepted to Interspeech 202

    What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

    Full text link
    Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these models represent information is essential for refining model efficiency and effectiveness. Unlike the various analyses of speech SSL, there has been limited investigation into what information speaker SSL captures and how its representation differs from speech SSL or other fully-supervised speaker models. This paper addresses these fundamental questions. We explore the capacity to capture various speech properties by applying SUPERB evaluation probing tasks to speech and speaker SSL models. We also examine which layers are predominantly utilized for each task to identify differences in how speech is represented. Furthermore, we conduct direct comparisons to measure the similarities between layers within and across models. Our analysis unveils that 1) the capacity to represent content information is somewhat unrelated to enhanced speaker representation, 2) specific layers of speech SSL models would be partly specialized in capturing linguistic information, and 3) speaker SSL models tend to disregard linguistic information but exhibit more sophisticated speaker representation.Comment: Accepted at ICASSP 202

    Differences in image between FTM and MTF as gender dysphoria : Semi-structured interviews with non-participants

    Get PDF
    The purpose of this study is to clarify the process of perception and image of gender dysphoria by non-participants. Participants were three male and three female participants working in mainland China, for a total of six (M =26.8, SD =1.33). Semi-structured interviews were conducted. The interview data was analyzed with The Modified Grounded Theory Approach (M-GTA). As a result, four category groups were generated: 【recognition of the name gender dysphoria】 【overall image of gender dysphoria】 【image of FTM】 and 【image of MTF】. Furthermore, it was shown that 8 categories and 36 concepts were extracted.本研究の目的は,非当事者の視点から,非当事者が性別違和に対する認知とイメージのプロセスを明らかにすることである。協力者は中国本土で働いている社会人,男女各3名,合計6名である(平均年齢26.8 歳,SD=1.33)。分析では,半構造化インタビューを行い,インタビューデータを得た。そして,修正版グラウンデッド・セオリー・アプローチ(M-GTA)を用いて分析した。その結果,【性別違和という名称に対する認知】【性別違和に対する全体的なイメージ】【FTMに対するイメージ】【MTFに対するイメージ】という4つのカテゴリーグループが生成された。さらに,8カテゴリー,36概念が抽出されていることが示された

    Gender Differences of Gender Dysphoria Attitudes among College Students in Japan and China

    Get PDF
    In both Japan and Mainland China, few studies assess attitudes towards gender dysphoria. The purpose of this study was to compare the attitudes towards gender dysphoria between Japanese and Chinese college students. Japanese (92 males, 251 females) and Chinese college students (103 males,206 females) answered 56 items about attitudes towards gender dysphoria, which were collected from previous studies. We found that male had higher psychological distance and lower romantic feelings than female regardless of Japan and Mainland China. Chinese had higher psychological distance and romantic feelings than Japanese. They have the lowest level of social approval for gender dysphoria. Moreover, positive minority empathy is lower in female than male.日本では性別違和に対する態度研究が主に医療現場の学生を対象者としており,幅広い分野の学生の態度を測る研究は十分に検討されているとは言い難い。中国本土では,性別違和に対する態度研究は十分なされておらず,さらには伝統的な性別二元制の思想に支配され,性別違和に関する十分な知識が提供されていない可能性が高い。そこで,本研究では,日本の大学生(男性92人,女性251人)と中国の大学生(男性103人,女性206人)が性別違和に対する態度を比較することを目的とした。先行研究より性別違和に対する態度を測定する56項目を収集し,日中大学生にウェブ調査を行った。その結果,日中を問わず,男性は女性より性別違和者に対する心理的距離が高く,恋愛好感度が低かった。そして,中国人は心理的距離と恋愛好感度が日本人より高く,性別違和に対する社会的ポジションの承認が最も低かったことがわかった。また,性別違和に対するポジティブマイノリティ共感では,男性より女性の方が高いことが示唆された

    End-to-End Joint Target and Non-Target Speakers ASR

    Full text link
    This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker's speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. Target-speaker ASR systems are a promising way to only transcribe a target speaker's speech by enrolling the target speaker's information. However, in conversational ASR applications, transcribing both the target speaker's speech and non-target speakers' ones is often required to understand interactive information. To naturally consider both target and non-target speakers in a single ASR model, our idea is to extend autoregressive modeling-based multi-talker ASR systems to utilize the enrollment speech of the target speaker. Our proposed ASR is performed by recursively generating both textual tokens and tokens that represent target or non-target speakers. Our experiments demonstrate the effectiveness of our proposed method.Comment: Accepted at Interspeech 202

    Fermiology of a topological line-nodal compound CaSb2 and its implication to superconductivity: angle-resolved photoemission study

    Full text link
    We performed angle-resolved photoemission spectroscopy with micro-focused beam on a topological line-nodal compound CaSb2 which undergoes a superconducting transition at the onset Tc~1.8 K, to clarify the Fermi-surface topology relevant to the occurrence of superconductivity. We found that a three-dimensional hole pocket at the G point is commonly seen for two types of single-crystalline samples fabricated by different growth conditions. On the other hand, the carrier-doping level estimated from the position of the chemical potential was found to be sensitive to the sample fabrication condition. The cylindrical electron pocket at the Y(C) point predicted by the calculations is absent in one of the two samples, despite the fact that both samples commonly show superconductivity with similar Ts's. This suggests a key role of the three-dimensional hole pocket to the occurrence of superconductivity, and further points to an intriguing possibility to control the topological nature of superconductivity by carrier tuning in CaSb2.Comment: 7 pages, 3 figure
    corecore