46 research outputs found
An Adversarial Perturbation Oriented Domain Adaptation Approach for Semantic Segmentation
We focus on Unsupervised Domain Adaptation (UDA) for the task of semantic
segmentation. Recently, adversarial alignment has been widely adopted to match
the marginal distribution of feature representations across two domains
globally. However, this strategy fails in adapting the representations of the
tail classes or small objects for semantic segmentation since the alignment
objective is dominated by head categories or large objects. In contrast to
adversarial alignment, we propose to explicitly train a domain-invariant
classifier by generating and defensing against pointwise feature space
adversarial perturbations. Specifically, we firstly perturb the intermediate
feature maps with several attack objectives (i.e., discriminator and
classifier) on each individual position for both domains, and then the
classifier is trained to be invariant to the perturbations. By perturbing each
position individually, our model treats each location evenly regardless of the
category or object size and thus circumvents the aforementioned issue.
Moreover, the domain gap in feature space is reduced by extrapolating source
and target perturbed features towards each other with attack on the domain
discriminator. Our approach achieves the state-of-the-art performance on two
challenging domain adaptation tasks for semantic segmentation: GTA5 ->
Cityscapes and SYNTHIA -> Cityscapes.Comment: To Appear in AAAI202
Cross-layer similarity knowledge distillation for speech enhancement
Speech enhancement (SE) algorithms based on deep neural networks (DNNs) often encounter challenges of limited hardware resources or strict latency requirements when deployed in real-world scenarios. However, a strong enhancement effect typically requires a large DNN. In this paper, a knowledge distillation framework for SE is proposed to compress the DNN model. We study the strategy of cross-layer connection paths, which fuses multi-level information from the teacher and transfers it to the student. To adapt to the SE task, we propose a frame-level similarity distillation loss. We apply this method to the deep complex convolution recurrent network (DCCRN) and make targeted adjustments. Experimental results show that the proposed method considerably improves the enhancement effect of the compressed DNN and outperforms other distillation methods
Practical Speech Emotion Recognition Based on Online Learning: From Acted Data to Elicited Data
We study the cross-database speech emotion recognition based on online learning. How to apply a classifier trained on acted data to naturalistic data, such as elicited data, remains a major challenge in today’s speech emotion recognition system. We introduce three types of different data sources: first, a basic speech emotion dataset which is collected from acted speech by professional actors and actresses; second, a speaker-independent data set which contains a large number of speakers; third, an elicited speech data set collected from a cognitive task. Acoustic features are extracted from emotional utterances and evaluated by using maximal information coefficient (MIC). A baseline valence and arousal classifier is designed based on Gaussian mixture models. Online training module is implemented by using AdaBoost. While the offline recognizer is trained on the acted data, the online testing data includes the speaker-independent data and the elicited data. Experimental results show that by introducing the online learning module our speech emotion recognition system can be better adapted to new data, which is an important character in real world applications
Detecting Depression from Speech through an Attentive LSTM Network
Depression endangers people's health conditions and affects the social order as a mental disorder. As an efficient diagnosis of depression, automatic depression detection has attracted lots of researcher's interest. This study presents an attention-based Long Short-Term Memory (LSTM) model for depression detection to make full use of the difference between depression and non-depression between timeframes. The proposed model uses frame-level features, which capture the temporal information of depressive speech, to replace traditional statistical features as an input of the LSTM layers. To achieve more multi-dimensional deep feature representations, the LSTM output is then passed on attention layers on both time and feature dimensions. Then, we concat the output of the attention layers and put the fused feature representation into the fully connected layer. At last, the fully connected layer's output is passed on to softmax layer. Experiments conducted on the DAIC-WOZ database demonstrate that the proposed attentive LSTM model achieves an average accuracy rate of 90.2% and outperforms the traditional LSTM network and LSTM with local attention by 0.7% and 2.3%, respectively, which indicates its feasibility