132 research outputs found
Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events
In this paper, we propose a new strategy for acoustic scene classification
(ASC) , namely recognizing acoustic scenes through identifying distinct sound
events. This differs from existing strategies, which focus on characterizing
global acoustical distributions of audio or the temporal evolution of
short-term audio features, without analysis down to the level of sound events.
To identify distinct sound events for each scene, we formulate ASC in a
multi-instance learning (MIL) framework, where each audio recording is mapped
into a bag-of-instances representation. Here, instances can be seen as
high-level representations for sound events inside a scene. We also propose a
MIL neural networks model, which implicitly identifies distinct instances
(i.e., sound events). Furthermore, we propose two specially designed modules
that model the multi-temporal scale and multi-modal natures of the sound events
respectively. The experiments were conducted on the official development set of
the DCASE2018 Task1 Subtask B, and our best-performing model improves over the
official baseline by 9.4% (68.3% vs 58.9%) in terms of classification accuracy.
This study indicates that recognizing acoustic scenes by identifying distinct
sound events is effective and paves the way for future studies that combine
this strategy with previous ones.Comment: code URL typo, code is available at
https://github.com/hackerekcah/distinct-events-asc.gi
Visible-Infrared Person Re-Identification via Patch-Mixed Cross-Modality Learning
Visible-infrared person re-identification (VI-ReID) aims to retrieve images
of the same pedestrian from different modalities, where the challenges lie in
the significant modality discrepancy. To alleviate the modality gap, recent
methods generate intermediate images by GANs, grayscaling, or mixup strategies.
However, these methods could ntroduce extra noise, and the semantic
correspondence between the two modalities is not well learned. In this paper,
we propose a Patch-Mixed Cross-Modality framework (PMCM), where two images of
the same person from two modalities are split into patches and stitched into a
new one for model learning. In this way, the modellearns to recognize a person
through patches of different styles, and the modality semantic correspondence
is directly embodied. With the flexible image generation strategy, the
patch-mixed images freely adjust the ratio of different modality patches, which
could further alleviate the modality imbalance problem. In addition, the
relationship between identity centers among modalities is explored to further
reduce the modality variance, and the global-to-part constraint is introduced
to regularize representation learning of part features. On two VI-ReID
datasets, we report new state-of-the-art performance with the proposed method.Comment: IJCAI2
FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
This paper presents FunCodec, a fundamental neural speech codec toolkit,
which is an extension of the open-source speech processing toolkit FunASR.
FunCodec provides reproducible training recipes and inference scripts for the
latest neural speech codec models, such as SoundStream and Encodec. Thanks to
the unified design with FunASR, FunCodec can be easily integrated into
downstream tasks, such as speech recognition. Along with FunCodec, pre-trained
models are also provided, which can be used for academic or generalized
purposes. Based on the toolkit, we further propose the frequency-domain codec
models, FreqCodec, which can achieve comparable speech quality with much lower
computation and parameter complexity. Experimental results show that, under the
same compression ratio, FunCodec can achieve better reconstruction quality
compared with other toolkits and released models. We also demonstrate that the
pre-trained models are suitable for downstream tasks, including automatic
speech recognition and personalized text-to-speech synthesis. This toolkit is
publicly available at https://github.com/alibaba-damo-academy/FunCodec.Comment: 5 pages, 3 figures, submitted to ICASSP 202
Scale-aware Test-time Click Adaptation for Pulmonary Nodule and Mass Segmentation
Pulmonary nodules and masses are crucial imaging features in lung cancer
screening that require careful management in clinical diagnosis. Despite the
success of deep learning-based medical image segmentation, the robust
performance on various sizes of lesions of nodule and mass is still
challenging. In this paper, we propose a multi-scale neural network with
scale-aware test-time adaptation to address this challenge. Specifically, we
introduce an adaptive Scale-aware Test-time Click Adaptation method based on
effortlessly obtainable lesion clicks as test-time cues to enhance segmentation
performance, particularly for large lesions. The proposed method can be
seamlessly integrated into existing networks. Extensive experiments on both
open-source and in-house datasets consistently demonstrate the effectiveness of
the proposed method over some CNN and Transformer-based segmentation methods.
Our code is available at https://github.com/SplinterLi/SaTTCAComment: 11 pages, 3 figures, MICCAI 202
A Comparative Study on multichannel Speaker-attributed automatic speech recognition in Multi-party Meetings
Speaker-attributed automatic speech recognition (SA-ASR) in multiparty
meeting scenarios is one of the most valuable and challenging ASR task. It was
shown that single-channel frame-level diarization with serialized output
training (SC-FD-SOT), single-channel word-level diarization with SOT
(SC-WD-SOT) and joint training of single-channel target-speaker separation and
ASR (SC-TS-ASR) can be exploited to partially solve this problem. SC-FD-SOT
obtains the speaker-attributed transcriptions by aligning the speaker
diarization results with the ASR hypotheses, SC-WD-SOT uses word-level
diarization to get rid of the alignment dependence on timestamps, and SC-TS-ASR
jointly trains target-speaker separation and ASR modules, which achieves the
best performance. In this paper, we propose three corresponding multichannel
(MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For
different tasks/models, different multichannel data fusion strategies are
considered, including channel-level cross-channel attention for MC-FD-SOT,
frame-level cross-channel attention for MC-WD-SOT and neural beamforming for
MC-TS-ASR. Experimental results on the AliMeeting corpus reveal that our
proposed multichannel SA-ASR models can consistently outperform the
corresponding single-channel counterparts in terms of the speaker-dependent
character error rate (SD-CER)
- …