34 research outputs found
What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis
Self-supervised learning (SSL) has attracted increased attention for learning
meaningful speech representations. Speech SSL models, such as WavLM, employ
masked prediction training to encode general-purpose representations. In
contrast, speaker SSL models, exemplified by DINO-based models, adopt
utterance-level training objectives primarily for speaker representation.
Understanding how these models represent information is essential for refining
model efficiency and effectiveness. Unlike the various analyses of speech SSL,
there has been limited investigation into what information speaker SSL captures
and how its representation differs from speech SSL or other fully-supervised
speaker models. This paper addresses these fundamental questions. We explore
the capacity to capture various speech properties by applying SUPERB evaluation
probing tasks to speech and speaker SSL models. We also examine which layers
are predominantly utilized for each task to identify differences in how speech
is represented. Furthermore, we conduct direct comparisons to measure the
similarities between layers within and across models. Our analysis unveils that
1) the capacity to represent content information is somewhat unrelated to
enhanced speaker representation, 2) specific layers of speech SSL models would
be partly specialized in capturing linguistic information, and 3) speaker SSL
models tend to disregard linguistic information but exhibit more sophisticated
speaker representation.Comment: Accepted at ICASSP 202
Hierarchical Latent Words Language Models for Robust Modeling to Out-Of Domain Tasks
Abstract This paper focuses on language modeling with adequate robustness to support different domain tasks. To this end, we propose a hierarchical latent word language model (h-LWLM). The proposed model can be regarded as a generalized form of the standard LWLMs. The key advance is introducing a multiple latent variable space with hierarchical structure. The structure can flexibly take account of linguistic phenomena not present in the training data. This paper details the definition as well as a training method based on layer-wise inference and a practical usage in natural language processing tasks with an approximation technique. Experiments on speech recognition show the effectiveness of h-LWLM in out-of domain tasks
CRISPR/Cas9 mediated genome editing in ES cells and its application for chimeric analysis in mice
Oji, A., Noda, T., Fujihara, Y. et al. CRISPR/Cas9 mediated genome editing in ES cells and its application for chimeric analysis in mice. Sci Rep 6, 31666 (2016). https://doi.org/10.1038/srep3166
Spermatozoa lacking Fertilization Influencing Membrane Protein (FIMP) fail to fuse with oocytes in mice
Fujihara, Y., Lu, Y., Noda, T., Oji, A., Larasati, T., Kojima-Kita, K., . . . Ikawa, M. (2020). Spermatozoa lacking fertilization influencing membrane protein (FIMP) fail to fuse with oocytes in mice. Proceedings of the National Academy of Sciences of the United States of America, 117(17), 9393-9400. doi:10.1073/pnas.191706011
Identification of multiple male reproductive tractspecific proteins that regulate sperm migration through the oviduct in mice
Fujihara, Y., Noda, T., Kobayashi, K., Oji, A., Kobayashi, S., Matsumura, T., . . . Ikawa, M. (2019). Identification of multiple male reproductive tractspecific proteins that regulate sperm migration through the oviduct in mice. Proceedings of the National Academy of Sciences of the United States of America, 116(37), 18498-18506. doi:10.1073/pnas.190873611
Noise-Robust Speaker Verification Using F 0 Features
This paper proposes a noise-robust speaker verification method augmented by fundamental frequency (F 0 ). The paper first describes a noise-robust F0 extraction method using the Hough transform. Then, it proposes a robust speaker verification method using multi-stream HMMs which fuse the extracted F 0 and cepstral features. Experiments are conducted using fourconnected -digit utterances of Japanese by 37 male speakers recorded at five sessions over a half year period. The utterances are contaminated with white noise at various SNR levels. Experimental results show that the F0 features improve the verification performance in all SNR conditions
An Improved Approximation Algorithm for Wage Determination and Online Task Allocation in Crowd-Sourcing
Crowd-sourcing has attracted much attention due to its growing importance to society, and numerous studies have been conducted on task allocation and wage determination. Recent works have focused on optimizing task allocation and workers' wages, simultaneously. However, existing methods do not provide good solutions for real-world crowd-sourcing platforms due to the low approximation ratio or myopic problem settings. We tackle an optimization problem for wage determination and online task allocation in crowd-sourcing and propose a fast 1-1/(k+3)^(1/2)-approximation algorithm, where k is the minimum of tasks' budgets (numbers of possible assignments). This approximation ratio is greater than or equal to the existing method. The proposed method reduces the tackled problem to a non-convex multi-period continuous optimization problem by approximating the objective function. Then, the method transforms the reduced problem into a minimum convex cost flow problem, which is a well-known combinatorial optimization problem, and solves it by the capacity scaling algorithm. Synthetic experiments and simulation experiments using real crowd-sourcing data show that the proposed method solves the problem faster and outputs higher objective values than existing methods