51 research outputs found
A Fuzzy-Based Multimedia Content Retrieval Method Using Mood Tags and Their Synonyms in Social Networks
The preferences of Web information purchasers are rapidly evolving. Cost-effectiveness is now becoming less regarded than cost-satisfaction, which emphasizes the purchaserās psychological satisfaction. One method to improve a userās cost-satisfaction in multimedia content retrieval is to utilize the mood inherent in multimedia items. An example of applications using this method is SNS (Social Network Services), which is based on folksonomy, but its applications encounter problems due to synonyms. In order to solve the problem of synonyms in our previous study, the mood of multimedia content is represented with arousal and valence (AV) in Thayerās two-dimensional model as its internal tag. Although some problems of synonyms could now be solved, the retrieval performance of the previous study was less than that of a keyword-based method. In this paper, a new method that can solve the synonym problem is proposed, while simultaneously maintaining the same performance as the keyword-based approach. In the proposed method, a mood of multimedia content is represented with a fuzzy set of 12 moods of the Thayer model. For the analysis, the proposed method is compared with two methods, one based on AV value and the other based on keyword. The analysis results demonstrate that the proposed method is superior to the two methods
Joint unsupervised and supervised learning for context-aware language identification
Language identification (LID) recognizes the language of a spoken utterance
automatically. According to recent studies, LID models trained with an
automatic speech recognition (ASR) task perform better than those trained with
a LID task only. However, we need additional text labels to train the model to
recognize speech, and acquiring the text labels is a cost high. In order to
overcome this problem, we propose context-aware language identification using a
combination of unsupervised and supervised learning without any text labels.
The proposed method learns the context of speech through masked language
modeling (MLM) loss and simultaneously trains to determine the language of the
utterance with supervised learning loss. The proposed joint learning was found
to reduce the error rate by 15.6% compared to the same structure model trained
by supervised-only learning on a subset of the VoxLingua107 dataset consisting
of sub-three-second utterances in 11 languages.Comment: Accepted by ICASSP 202
Boosting Unknown-number Speaker Separation with Transformer Decoder-based Attractor
We propose a novel speech separation model designed to separate mixtures with
an unknown number of speakers. The proposed model stacks 1) a dual-path
processing block that can model spectro-temporal patterns, 2) a transformer
decoder-based attractor (TDA) calculation module that can deal with an unknown
number of speakers, and 3) triple-path processing blocks that can model
inter-speaker relations. Given a fixed, small set of learned speaker queries
and the mixture embedding produced by the dual-path blocks, TDA infers the
relations of these queries and generates an attractor vector for each speaker.
The estimated attractors are then combined with the mixture embedding by
feature-wise linear modulation conditioning, creating a speaker dimension. The
mixture embedding, conditioned with speaker information produced by TDA, is fed
to the final triple-path blocks, which augment the dual-path blocks with an
additional pathway dedicated to inter-speaker processing. The proposed approach
outperforms the previous best reported in the literature, achieving 24.0 and
23.7 dB SI-SDR improvement (SI-SDRi) on WSJ0-2 and 3mix respectively, with a
single model trained to separate 2- and 3-speaker mixtures. The proposed model
also exhibits strong performance and generalizability at counting sources and
separating mixtures with up to 5 speakers.Comment: 5 pages, 4 figures, accepted by ICASSP 202
Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling
We propose FSB-LSTM, a novel long short-term memory (LSTM) based architecture
that integrates full- and sub-band (FSB) modeling, for single- and
multi-channel speech enhancement in the short-time Fourier transform (STFT)
domain. The model maintains an information highway to flow an over-complete
input representation through multiple FSB-LSTM modules. Each FSB-LSTM module
consists of a full-band block to model spectro-temporal patterns at all
frequencies and a sub-band block to model patterns within each sub-band, where
each of the two blocks takes a down-sampled representation as input and returns
an up-sampled discriminative representation to be added to the block input via
a residual connection. The model is designed to have a low algorithmic
complexity, a small run-time buffer and a very low algorithmic latency, at the
same time producing a strong enhancement performance on a noisy-reverberant
speech enhancement task even if the hop size is as low as ms.Comment: in ICASSP 202
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation
We propose TF-GridNet for speech separation. The model is a novel multi-path
deep neural network (DNN) integrating full- and sub-band modeling in the
time-frequency (T-F) domain. It stacks several multi-path blocks, each
consisting of an intra-frame full-band module, a sub-band temporal module, and
a cross-frame self-attention module. It is trained to perform complex spectral
mapping, where the real and imaginary (RI) components of input signals are
stacked as features to predict target RI components. We first evaluate it on
monaural anechoic speaker separation. Without using data augmentation and
dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in
scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard
dataset for two-speaker separation. To show its robustness to noise and
reverberation, we evaluate it on monaural reverberant speaker separation using
the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!,
and obtain state-of-the-art performance on both datasets. We then extend
TF-GridNet to multi-microphone conditions through multi-microphone complex
spectral mapping, and integrate it into a two-DNN system with a beamformer in
between (named as MISO-BF-MISO in earlier studies), where the beamformer
proposed in this paper is a novel multi-frame Wiener filter computed based on
the outputs of the first DNN. State-of-the-art performance is obtained on the
multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply
the proposed algorithms to speech dereverberation and noisy-reverberant speech
enhancement. State-of-the-art performance is obtained on a dereverberation
dataset and on the dataset of the recent L3DAS22 multi-channel speech
enhancement challenge.Comment: In submission. A sound demo is available at
https://zqwang7.github.io/demos/TF-GridNet-demo/index.htm
Effect of Wavelength and Intensity of Light on a-InGaZnO TFTs under Negative Bias Illumination Stress
We investigated degradation mechanism of a-IGZO TFTs under NBIS with different wavelengths. and intensities IL of light. Negative gate bias was applied for 4000 s while drain and source were grounded, and illuminations with lambda = 450, 530, or 700 nm were applied. Illumination with photon energy exceeding similar to 2.3 eV (530 nm) induced noticeable change in threshold voltage shift Delta V-th, which can be interpreted in terms of ionization of oxygen vacancies V-O. In addition, I-L of blue illumination (450 nm) was varied from 6 to 200 lux and saturation in Delta V-th was observed after exceeding a certain I-L. We suggest that the saturation occurs because V-O-ionization rate is saturated by outward relaxation of metal atoms in the a-IGZO film. (C) The Author(s) 2016. Published by ECS.1174Ysciescopu
That's What I Said: Fully-Controllable Talking Face Generation
The goal of this paper is to synthesise talking faces with controllable
facial motions. To achieve this goal, we propose two key ideas. The first is to
establish a canonical space where every face has the same motion patterns but
different identities. The second is to navigate a multimodal motion space that
only represents motion-related features while eliminating identity information.
To disentangle identity and motion, we introduce an orthogonality constraint
between the two different latent spaces. From this, our method can generate
natural-looking talking faces with fully controllable facial attributes and
accurate lip synchronisation. Extensive experiments demonstrate that our method
achieves state-of-the-art results in terms of both visual quality and lip-sync
score. To the best of our knowledge, we are the first to develop a talking face
generation framework that can accurately manifest full target facial motions
including lip, head pose, and eye movements in the generated video without any
additional supervision beyond RGB video with audio
Inspection System for Vehicle Headlight Defects Based on Convolutional Neural Network
This paper proposes a method to detect the defects in the region of interest (ROI) based on a convolutional neural network (CNN) after alignment (position and rotation calibration) of a manufacturerās headlights to determine whether the vehicle headlights are defective. The results were compared with an existing method for distinguishing defects among the previously proposed methods. One hundred original headlight images were acquired for each of the two vehicle types for the purpose of this experiment, and 20,000 high quality images and 20,000 defective images were obtained by applying the position and rotation transformation to the original images. It was found that the method proposed in this paper demonstrated a performance improvement of more than 0.1569 (15.69% on average) as compared to the existing method
- ā¦