980 research outputs found
Few-shot Class-incremental Audio Classification Using Stochastic Classifier
It is generally assumed that number of classes is fixed in current audio
classification methods, and the model can recognize pregiven classes only. When
new classes emerge, the model needs to be retrained with adequate samples of
all classes. If new classes continually emerge, these methods will not work
well and even infeasible. In this study, we propose a method for fewshot
class-incremental audio classification, which continually recognizes new
classes and remember old ones. The proposed model consists of an embedding
extractor and a stochastic classifier. The former is trained in base session
and frozen in incremental sessions, while the latter is incrementally expanded
in all sessions. Two datasets (NS-100 and LS-100) are built by choosing samples
from audio corpora of NSynth and LibriSpeech, respectively. Results show that
our method exceeds four baseline ones in average accuracy and performance
dropping rate. Code is at https://github.com/vinceasvp/meta-sc.Comment: 5 pages, 3 figures, 4 tables. Accepted for publication in INTERSPEECH
202
A Graph Isomorphism Network with Weighted Multiple Aggregators for Speech Emotion Recognition
Speech emotion recognition (SER) is an essential part of human-computer
interaction. In this paper, we propose an SER network based on a Graph
Isomorphism Network with Weighted Multiple Aggregators (WMA-GIN), which can
effectively handle the problem of information confusion when neighbour nodes'
features are aggregated together in GIN structure. Moreover, a Full-Adjacent
(FA) layer is adopted for alleviating the over-squashing problem, which is
existed in all Graph Neural Network (GNN) structures, including GIN.
Furthermore, a multi-phase attention mechanism and multi-loss training strategy
are employed to avoid missing the useful emotional information in the stacked
WMA-GIN layers. We evaluated the performance of our proposed WMA-GIN on the
popular IEMOCAP dataset. The experimental results show that WMA-GIN outperforms
other GNN-based methods and is comparable to some advanced non-graph-based
methods by achieving 72.48% of weighted accuracy (WA) and 67.72% of unweighted
accuracy (UA).Comment: Accepted by Interspeech 202
Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data
Conventional sequential learning methods such as Recurrent Neural Networks
(RNNs) focus on interactions between consecutive inputs, i.e. first-order
Markovian dependency. However, most of sequential data, as seen with videos,
have complex dependency structures that imply variable-length semantic flows
and their compositions, and those are hard to be captured by conventional
methods. Here, we propose Cut-Based Graph Learning Networks (CB-GLNs) for
learning video data by discovering these complex structures of the video. The
CB-GLNs represent video data as a graph, with nodes and edges corresponding to
frames of the video and their dependencies respectively. The CB-GLNs find
compositional dependencies of the data in multilevel graph forms via a
parameterized kernel with graph-cut and a message passing framework. We
evaluate the proposed method on the two different tasks for video
understanding: Video theme classification (Youtube-8M dataset) and Video
Question and Answering (TVQA dataset). The experimental results show that our
model efficiently learns the semantic compositional structure of video data.
Furthermore, our model achieves the highest performance in comparison to other
baseline methods.Comment: 8 pages, 3 figures, Association for the Advancement of Artificial
Intelligence (AAAI2020). arXiv admin note: substantial text overlap with
arXiv:1907.0170
- …