1,961 research outputs found
Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies
Automatic Speech Recognition (ASR) has shown remarkable progress, yet it
still faces challenges in real-world distant scenarios across various array
topologies each with multiple recording devices. The focal point of the CHiME-7
Distant ASR task is to devise a unified system capable of generalizing various
array topologies that have multiple recording devices and offering reliable
recognition performance in real-world environments. Addressing this task, we
introduce an ASR system that demonstrates exceptional performance across
various array topologies. First of all, we propose two attention-based
automatic channel selection modules to select the most advantageous subset of
multi-channel signals from multiple recording devices for each utterance.
Furthermore, we introduce inter-channel spatial features to augment the
effectiveness of multi-frame cross-channel attention, aiding it in improving
the capability of spatial information awareness. Finally, we propose a
multi-layer convolution fusion module drawing inspiration from the U-Net
architecture to integrate the multi-channel output into a single-channel
output. Experimental results on the CHiME-7 corpus with oracle segmentation
demonstrate that the improvements introduced in our proposed ASR system lead to
a relative reduction of 40.1% in the Macro Diarization Attributed Word Error
Rates (DA-WER) when compared to the baseline ASR system on the Eval sets.Comment: Accepted by ICASSP 202
Highly efficient triazine/carbazole-based host material for green phosphorescent organic light-emitting diodes with low efficiency roll-off
Two novel triazin/carbazole-based host materials were designed and synthesized, which demonstrated outstanding EL performance with maximum CE, PE and EQE of 69.3 cd A−1, 54.2 lm W−1 and 21.9%, respectively.</p
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization
Recently, the remarkable advance of the Large Language Model (LLM) has
inspired researchers to transfer its extraordinary reasoning capability to both
vision and language data. However, the prevailing approaches primarily regard
the visual input as a prompt and focus exclusively on optimizing the text
generation process conditioned upon vision content by a frozen LLM. Such an
inequitable treatment of vision and language heavily constrains the model's
potential. In this paper, we break through this limitation by representing both
vision and language in a unified form. Specifically, we introduce a
well-designed visual tokenizer to translate the non-linguistic image into a
sequence of discrete tokens like a foreign language that LLM can read. The
resulting visual tokens encompass high-level semantics worthy of a word and
also support dynamic sequence length varying from the image. Coped with this
tokenizer, the presented foundation model called LaVIT can handle both image
and text indiscriminately under the same generative learning paradigm. This
unification empowers LaVIT to serve as an impressive generalist interface to
understand and generate multi-modal content simultaneously. Extensive
experiments further showcase that it outperforms the existing models by a large
margin on massive vision-language tasks. Our code and models will be available
at https://github.com/jy0205/LaVIT
- …