1,183 research outputs found
Functional Linear Non-Gaussian Acyclic Model for Causal Discovery
In causal discovery, non-Gaussianity has been used to characterize the
complete configuration of a Linear Non-Gaussian Acyclic Model (LiNGAM),
encompassing both the causal ordering of variables and their respective
connection strengths. However, LiNGAM can only deal with the finite-dimensional
case. To expand this concept, we extend the notion of variables to encompass
vectors and even functions, leading to the Functional Linear Non-Gaussian
Acyclic Model (Func-LiNGAM). Our motivation stems from the desire to identify
causal relationships in brain-effective connectivity tasks involving, for
example, fMRI and EEG datasets. We demonstrate why the original LiNGAM fails to
handle these inherently infinite-dimensional datasets and explain the
availability of functional data analysis from both empirical and theoretical
perspectives. {We establish theoretical guarantees of the identifiability of
the causal relationship among non-Gaussian random vectors and even random
functions in infinite-dimensional Hilbert spaces.} To address the issue of
sparsity in discrete time points within intrinsic infinite-dimensional
functional data, we propose optimizing the coordinates of the vectors using
functional principal component analysis. Experimental results on synthetic data
verify the ability of the proposed framework to identify causal relationships
among multivariate functions using the observed samples. For real data, we
focus on analyzing the brain connectivity patterns derived from fMRI data
Multi-energy X-ray linear-array detector enabled by the side-illuminated metal halide scintillator
Conventional scintillator-based X-ray imaging typically captures the full
spectral of X-ray photons without distinguishing their energy. However, the
absence of X-ray spectral information often results in insufficient image
contrast, particularly for substances possessing similar atomic numbers and
densities. In this study, we present an innovative multi-energy X-ray
linear-array detector that leverages side-illuminated X-ray scintillation using
emerging metal halide Cs3Cu2I5. The negligible self-absorption characteristic
not only improves the scintillation output but is also beneficial for improving
the energy resolution for the side-illuminated scintillation scenarios. By
exploiting Beer's law, which governs the absorption of X-ray photons with
different energies, the incident X-ray spectral can be reconstructed by
analyzing the distribution of scintillation intensity when the scintillator is
illuminated from the side. The relative error between the reconstructed and
measured X-ray spectral was less than 5.63 %. Our method offers an additional
energy-resolving capability for X-ray linear-array detectors commonly used in
computed tomography (CT) imaging setups, surpassing the capabilities of
conventional energy-integration approaches, all without requiring extra
hardware components. A proof-of-concept multi-energy CT imaging system
featuring eight energy channels was successfully implemented. This study
presents a simple and efficient strategy for achieving multi-energy X-ray
detection and CT imaging based on emerging metal halides
PromptASR for contextualized ASR with controllable style
Prompts are crucial to large language models as they provide context
information such as topic or logical relationships. Inspired by this, we
propose PromptASR, a framework that integrates prompts in end-to-end automatic
speech recognition (E2E ASR) systems to achieve contextualized ASR with
controllable style of transcriptions. Specifically, a dedicated text encoder
encodes the text prompts and the encodings are injected into the speech encoder
by cross-attending the features from two modalities. When using the ground
truth text from preceding utterances as content prompt, the proposed system
achieves 21.9% and 6.8% relative word error rate reductions on a book reading
dataset and an in-house dataset compared to a baseline ASR system. The system
can also take word-level biasing lists as prompt to improve recognition
accuracy on rare words. An additional style prompt can be given to the text
encoder and guide the ASR system to output different styles of transcriptions.
The code is available at icefall.Comment: Submitted to ICASSP202
Delay-penalized CTC implemented based on Finite State Transducer
Connectionist Temporal Classification (CTC) suffers from the latency problem
when applied to streaming models. We argue that in CTC lattice, the alignments
that can access more future context are preferred during training, thereby
leading to higher symbol delay. In this work we propose the delay-penalized CTC
which is augmented with latency penalty regularization. We devise a flexible
and efficient implementation based on the differentiable Finite State
Transducer (FST). Specifically, by attaching a binary attribute to CTC
topology, we can locate the frames that firstly emit non-blank tokens on the
resulting CTC lattice, and add the frame offsets to the log-probabilities.
Experimental results demonstrate the effectiveness of our proposed
delay-penalized CTC, which is able to balance the delay-accuracy trade-off.
Furthermore, combining the delay-penalized transducer enables the CTC model to
achieve better performance and lower latency. Our work is open-sourced and
publicly available https://github.com/k2-fsa/k2.Comment: Accepted in INTERSPEECH 202
Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context
In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting
of 50,000 hours of read English speech derived from LibriVox. To the best of
our knowledge, Libriheavy is the largest freely-available corpus of speech with
supervisions. Different from other open-sourced datasets that only provide
normalized transcriptions, Libriheavy contains richer information such as
punctuation, casing and text context, which brings more flexibility for system
building. Specifically, we propose a general and efficient pipeline to locate,
align and segment the audios in previously published Librilight to its
corresponding texts. The same as Librilight, Libriheavy also has three training
subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We
also extract the dev and test evaluation sets from the aligned audios and
guarantee there is no overlapping speakers and books in training sets. Baseline
systems are built on the popular CTC-Attention and transducer models.
Additionally, we open-source our dataset creatation pipeline which can also be
used to other audio alignment tasks.Comment: Submitted to ICASSP 202
Zipformer: A faster and better encoder for automatic speech recognition
The Conformer has become the most popular encoder model for automatic speech
recognition (ASR). It adds convolution modules to a transformer to learn both
local and global dependencies. In this work we describe a faster, more
memory-efficient, and better-performing transformer, called Zipformer. Modeling
changes include: 1) a U-Net-like encoder structure where middle stacks operate
at lower frame rates; 2) reorganized block structure with more modules, within
which we re-use attention weights for efficiency; 3) a modified form of
LayerNorm called BiasNorm allows us to retain some length information; 4) new
activation functions SwooshR and SwooshL work better than Swish. We also
propose a new optimizer, called ScaledAdam, which scales the update by each
tensor's current scale to keep the relative change about the same, and also
explictly learns the parameter scale. It achieves faster convergence and better
performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and
WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer
over other state-of-the-art ASR models. Our code is publicly available at
https://github.com/k2-fsa/icefall.Comment: Published as a conference paper at ICLR 202
Delay-penalized transducer for low-latency streaming ASR
In streaming automatic speech recognition (ASR), it is desirable to reduce
latency as much as possible while having minimum impact on recognition
accuracy. Although a few existing methods are able to achieve this goal, they
are difficult to implement due to their dependency on external alignments. In
this paper, we propose a simple way to penalize symbol delay in transducer
model, so that we can balance the trade-off between symbol delay and accuracy
for streaming models without external alignments. Specifically, our method adds
a small constant times (T/2 - t), where T is the number of frames and t is the
current frame, to all the non-blank log-probabilities (after normalization)
that are fed into the two dimensional transducer recursion. For both streaming
Conformer models and unidirectional long short-term memory (LSTM) models,
experimental results show that it can significantly reduce the symbol delay
with an acceptable performance degradation. Our method achieves similar
delay-accuracy trade-off to the previously published FastEmit, but we believe
our method is preferable because it has a better justification: it is
equivalent to penalizing the average symbol delay. Our work is open-sourced and
publicly available (https://github.com/k2-fsa/k2).Comment: Submitted to 2023 IEEE International Conference on Acoustics, Speech
and Signal Processin
- …