175 research outputs found
Accelerating Transducers through Adjacent Token Merging
Recent end-to-end automatic speech recognition (ASR) systems often utilize a
Transformer-based acoustic encoder that generates embedding at a high frame
rate. However, this design is inefficient, particularly for long speech signals
due to the quadratic computation of self-attention. To address this, we propose
a new method, Adjacent Token Merging (A-ToMe), which gradually combines
adjacent tokens with high similarity scores between their key values. In this
way, the total time step could be reduced, and the inference of both the
encoder and joint network is accelerated. Experiments on LibriSpeech show that
our method can reduce 57% of tokens and improve the inference speed on GPU by
70% without any notable loss of accuracy. Additionally, we demonstrate that
A-ToMe is also an effective solution to reduce tokens in long-form ASR, where
the input speech consists of multiple utterances.Comment: Interspeech 202
Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition
The integration of Language Models (LMs) has proven to be an effective way to
address domain shifts in speech recognition. However, these approaches usually
require a significant amount of target domain text data for the training of
LMs. Different from these methods, in this work, with only a domain-specific
text prompt, we propose two zero-shot ASR domain adaptation methods using
LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two
ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR
system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an
encoder-decoder based ASR system. Experiments show that, with only one domain
prompt, both methods can effectively reduce word error rates (WER) on
out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep
LLM-fusion has the advantage of better recall of entity and out-of-vocabulary
words
ACCELERATION ON DIFFERENT BODY POSITIONS DURING RUNNING ON A TREADMILL
Many fitness index used V˙O2max and heart rate to estimate energy expenditure, but these current methods require expensive equipment for the direct measurement. This study tried to determine a more convenient way to estimate energy expenditure by comparing the relationship of heart rate with acceleration on different positions while running on a treadmill. Eight males (23-32 yr) wore three tri-axial accelerometers, and the placements of accelerometers were left wrist, trunk (low back) and left ankle. Each participant walked for 30 sec at 4 and 6 km·h-1, ran 30 sec at 8, 10, 12, 14, 16 km·h-1 after they keep stable heart rate in these speeds. All the total accelerations on three placements are significantly correlated with heart rate in this study which indicated that accelerations on human body is a good way to estimate energy expenditure. This information is very useful to develop a new device to accurately estimate energy expenditure using watch which is more convenient compare to current devices in the market
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts
Large Language Models (LLMs) have achieved significant success across various
natural language processing (NLP) tasks, encompassing question-answering,
summarization, and machine translation, among others. While LLMs excel in
general tasks, their efficacy in domain-specific applications remains under
exploration. Additionally, LLM-generated text sometimes exhibits issues like
hallucination and disinformation. In this study, we assess LLMs' capability of
producing concise survey articles within the computer science-NLP domain,
focusing on 20 chosen topics. Automated evaluations indicate that GPT-4
outperforms GPT-3.5 when benchmarked against the ground truth. Furthermore,
four human evaluators provide insights from six perspectives across four model
configurations. Through case studies, we demonstrate that while GPT often
yields commendable results, there are instances of shortcomings, such as
incomplete information and the exhibition of lapses in factual accuracy
VLM-Eval: A General Evaluation on Video Large Language Models
Despite the rapid development of video Large Language Models (LLMs), a
comprehensive evaluation is still absent. In this paper, we introduce a unified
evaluation that encompasses multiple video tasks, including captioning,
question and answering, retrieval, and action recognition. In addition to
conventional metrics, we showcase how GPT-based evaluation can match human-like
performance in assessing response quality across multiple aspects. We propose a
simple baseline: Video-LLaVA, which uses a single linear projection and
outperforms existing video LLMs. Finally, we evaluate video LLMs beyond
academic datasets, which show encouraging recognition and reasoning
capabilities in driving scenarios with only hundreds of video-instruction pairs
for fine-tuning. We hope our work can serve as a unified evaluation for video
LLMs, and help expand more practical scenarios. The evaluation code will be
available soon
Implicit Image-to-Image Schrodinger Bridge for CT Super-Resolution and Denoising
Conditional diffusion models have gained recognition for their effectiveness
in image restoration tasks, yet their iterative denoising process, starting
from Gaussian noise, often leads to slow inference speeds. As a promising
alternative, the Image-to-Image Schr\"odinger Bridge (I2SB) initializes the
generative process from corrupted images and integrates training techniques
from conditional diffusion models. In this study, we extended the I2SB method
by introducing the Implicit Image-to-Image Schrodinger Bridge (I3SB),
transitioning its generative process to a non-Markovian process by
incorporating corrupted images in each generative step. This enhancement
empowers I3SB to generate images with better texture restoration using a small
number of generative steps. The proposed method was validated on CT
super-resolution and denoising tasks and outperformed existing methods,
including the conditional denoising diffusion probabilistic model (cDDPM) and
I2SB, in both visual quality and quantitative metrics. These findings
underscore the potential of I3SB in improving medical image restoration by
providing fast and accurate generative modeling
- …