47 research outputs found
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a
self-supervised speech-representation model acquired through self-supervised
learning (SSL). Conventional methods with embedding vectors from x-vector or
global style tokens still have a gap in reproducing the speaker characteristics
of unseen speakers. A novel point of the proposed method is the direct use of
the SSL model to obtain embedding vectors from speech representations trained
with a large amount of data. We also introduce the separate conditioning of
acoustic features and a phoneme duration predictor to obtain the disentangled
embeddings between rhythm-based speaker characteristics and
acoustic-feature-based ones. The disentangled embeddings will enable us to
achieve better reproduction performance for unseen speakers and rhythm transfer
conditioned by different speeches. Objective and subjective evaluations showed
that the proposed method can synthesize speech with improved similarity and
achieve speech-rhythm transfer.Comment: 5 pages,3 figures, Accepted to IEEE ICASSP 2023 workshop
Self-supervision in Audio, Speech and Beyon
Streaming Target-Speaker ASR with Neural Transducer
Although recent advances in deep learning technology have boosted automatic
speech recognition (ASR) performance in the single-talker case, it remains
difficult to recognize multi-talker speech in which many voices overlap. One
conventional approach to tackle this problem is to use a cascade of a speech
separation or target speech extraction front-end with an ASR back-end. However,
the extra computation costs of the front-end module are a critical barrier to
quick response, especially for streaming ASR. In this paper, we propose a
target-speaker ASR (TS-ASR) system that implicitly integrates the target speech
extraction functionality within a streaming end-to-end (E2E) ASR system, i.e.
recurrent neural network-transducer (RNNT). Our system uses a similar idea as
adopted for target speech extraction, but implements it directly at the level
of the encoder of RNNT. This allows TS-ASR to be realized without placing extra
computation costs on the front-end. Note that this study presents two major
differences between prior studies on E2E TS-ASR; we investigate streaming
models and base our study on Conformer models, whereas prior studies used
RNN-based systems and considered only offline processing. We confirm in
experiments that our TS-ASR achieves comparable recognition performance with
conventional cascade systems in the offline setting, while reducing computation
costs and realizing streaming TS-ASR.Comment: Accepted to Interspeech 202
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
Self-supervised learning (SSL) has attracted increased attention for learning
meaningful speech representations. Speech SSL models, such as WavLM, employ
masked prediction training to encode general-purpose representations. In
contrast, speaker SSL models, exemplified by DINO-based models, adopt
utterance-level training objectives primarily for speaker representation.
Understanding how these models represent information is essential for refining
model efficiency and effectiveness. Unlike the various analyses of speech SSL,
there has been limited investigation into what information speaker SSL captures
and how its representation differs from speech SSL or other fully-supervised
speaker models. This paper addresses these fundamental questions. We explore
the capacity to capture various speech properties by applying SUPERB evaluation
probing tasks to speech and speaker SSL models. We also examine which layers
are predominantly utilized for each task to identify differences in how speech
is represented. Furthermore, we conduct direct comparisons to measure the
similarities between layers within and across models. Our analysis unveils that
1) the capacity to represent content information is somewhat unrelated to
enhanced speaker representation, 2) specific layers of speech SSL models would
be partly specialized in capturing linguistic information, and 3) speaker SSL
models tend to disregard linguistic information but exhibit more sophisticated
speaker representation.Comment: Accepted at ICASSP 202
Differences in image between FTM and MTF as gender dysphoria : Semi-structured interviews with non-participants
The purpose of this study is to clarify the process of perception and image of gender dysphoria by non-participants. Participants were three male and three female participants working in mainland China, for a total of six (M =26.8, SD =1.33). Semi-structured interviews were conducted. The interview data was analyzed with The Modified Grounded Theory Approach (M-GTA). As a result, four category groups were generated: 【recognition of the name gender dysphoria】 【overall image of gender dysphoria】 【image of FTM】 and 【image of MTF】. Furthermore, it was shown that 8 categories and 36 concepts were extracted.本研究の目的は,非当事者の視点から,非当事者が性別違和に対する認知とイメージのプロセスを明らかにすることである。協力者は中国本土で働いている社会人,男女各3名,合計6名である(平均年齢26.8 歳,SD=1.33)。分析では,半構造化インタビューを行い,インタビューデータを得た。そして,修正版グラウンデッド・セオリー・アプローチ(M-GTA)を用いて分析した。その結果,【性別違和という名称に対する認知】【性別違和に対する全体的なイメージ】【FTMに対するイメージ】【MTFに対するイメージ】という4つのカテゴリーグループが生成された。さらに,8カテゴリー,36概念が抽出されていることが示された
Gender Differences of Gender Dysphoria Attitudes among College Students in Japan and China
In both Japan and Mainland China, few studies assess attitudes towards gender dysphoria. The purpose of this study was to compare the attitudes towards gender dysphoria between Japanese and Chinese college students. Japanese (92 males, 251 females) and Chinese college students (103 males,206 females) answered 56 items about attitudes towards gender dysphoria, which were collected from previous studies. We found that male had higher psychological distance and lower romantic feelings than female regardless of Japan and Mainland China. Chinese had higher psychological distance and romantic feelings than Japanese. They have the lowest level of social approval for gender dysphoria. Moreover, positive minority empathy is lower in female than male.日本では性別違和に対する態度研究が主に医療現場の学生を対象者としており,幅広い分野の学生の態度を測る研究は十分に検討されているとは言い難い。中国本土では,性別違和に対する態度研究は十分なされておらず,さらには伝統的な性別二元制の思想に支配され,性別違和に関する十分な知識が提供されていない可能性が高い。そこで,本研究では,日本の大学生(男性92人,女性251人)と中国の大学生(男性103人,女性206人)が性別違和に対する態度を比較することを目的とした。先行研究より性別違和に対する態度を測定する56項目を収集し,日中大学生にウェブ調査を行った。その結果,日中を問わず,男性は女性より性別違和者に対する心理的距離が高く,恋愛好感度が低かった。そして,中国人は心理的距離と恋愛好感度が日本人より高く,性別違和に対する社会的ポジションの承認が最も低かったことがわかった。また,性別違和に対するポジティブマイノリティ共感では,男性より女性の方が高いことが示唆された
End-to-End Joint Target and Non-Target Speakers ASR
This paper proposes a novel automatic speech recognition (ASR) system that
can transcribe individual speaker's speech while identifying whether they are
target or non-target speakers from multi-talker overlapped speech.
Target-speaker ASR systems are a promising way to only transcribe a target
speaker's speech by enrolling the target speaker's information. However, in
conversational ASR applications, transcribing both the target speaker's speech
and non-target speakers' ones is often required to understand interactive
information. To naturally consider both target and non-target speakers in a
single ASR model, our idea is to extend autoregressive modeling-based
multi-talker ASR systems to utilize the enrollment speech of the target
speaker. Our proposed ASR is performed by recursively generating both textual
tokens and tokens that represent target or non-target speakers. Our experiments
demonstrate the effectiveness of our proposed method.Comment: Accepted at Interspeech 202
Fermiology of a topological line-nodal compound CaSb2 and its implication to superconductivity: angle-resolved photoemission study
We performed angle-resolved photoemission spectroscopy with micro-focused
beam on a topological line-nodal compound CaSb2 which undergoes a
superconducting transition at the onset Tc~1.8 K, to clarify the Fermi-surface
topology relevant to the occurrence of superconductivity. We found that a
three-dimensional hole pocket at the G point is commonly seen for two types of
single-crystalline samples fabricated by different growth conditions. On the
other hand, the carrier-doping level estimated from the position of the
chemical potential was found to be sensitive to the sample fabrication
condition. The cylindrical electron pocket at the Y(C) point predicted by the
calculations is absent in one of the two samples, despite the fact that both
samples commonly show superconductivity with similar Ts's. This suggests a key
role of the three-dimensional hole pocket to the occurrence of
superconductivity, and further points to an intriguing possibility to control
the topological nature of superconductivity by carrier tuning in CaSb2.Comment: 7 pages, 3 figure