288 research outputs found
Robust Kalman filters with unknown covariance of multiplicative noise
In this paper, state and noise covariance estimation problems for linear
system with unknown multiplicative noise are considered. The measurement
likelihood is modelled as a mixture of two Gaussian distributions and a
Student's t distribution, respectively. The unknown covariance of
multiplicative noise is modelled as an inverse Gamma/Wishart distribution and
the initial condition is formulated as the nominal covariance. By using robust
design and choosing hierarchical priors, two variational Bayesian based robust
Kalman filters are proposed. Stability and covergence of the proposed filters,
the covariance parameters, the VB inference, and the estimation error dynamics
are analyzed. The lower and upper bounds are also provided to guarantee the
performance of the proposed filters. A target tracking simulation is provided
to validate the effectiveness of the proposed filters
MV-Map: Offboard HD-Map Generation with Multi-view Consistency
While bird's-eye-view (BEV) perception models can be useful for building
high-definition maps (HD-Maps) with less human labor, their results are often
unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps
from different viewpoints. This is because BEV perception is typically set up
in an 'onboard' manner, which restricts the computation and consequently
prevents algorithms from reasoning multiple views simultaneously. This paper
overcomes these limitations and advocates a more practical 'offboard' HD-Map
generation setup that removes the computation constraints, based on the fact
that HD-Maps are commonly reusable infrastructures built offline in data
centers. To this end, we propose a novel offboard pipeline called MV-Map that
capitalizes multi-view consistency and can handle an arbitrary number of frames
with the key design of a 'region-centric' framework. In MV-Map, the target
HD-Maps are created by aggregating all the frames of onboard predictions,
weighted by the confidence scores assigned by an 'uncertainty network'. To
further enhance multi-view consistency, we augment the uncertainty network with
the global 3D structure optimized by a voxelized neural radiance field
(Voxel-NeRF). Extensive experiments on nuScenes show that our MV-Map
significantly improves the quality of HD-Maps, further highlighting the
importance of offboard methods for HD-Map generation.Comment: ICCV 202
Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition
In recent years, speech-based self-supervised learning (SSL) has made
significant progress in various tasks, including automatic speech recognition
(ASR). An ASR model with decent performance can be realized by fine-tuning an
SSL model with a small fraction of labeled data. Reducing the demand for
labeled data is always of great practical value. In this paper, we further
extend the use of SSL to cut down labeling costs with active learning. Three
types of units on different granularities are derived from speech signals in an
unsupervised way, and their effects are compared by applying a contrastive data
selection method. The experimental results show that our proposed data
selection framework can effectively improve the word error rate (WER) by more
than 11% with the same amount of labeled data, or halve the labeling cost while
maintaining the same WER, compared to random selection.Comment: 5 pages, 3 figures. Accepted to Interspeech 202
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
This paper reveals that large language models (LLMs), despite being trained
solely on textual data, are surprisingly strong encoders for purely visual
tasks in the absence of language. Even more intriguingly, this can be achieved
by a simple yet previously overlooked strategy -- employing a frozen
transformer block from pre-trained LLMs as a constituent encoder layer to
directly process visual tokens. Our work pushes the boundaries of leveraging
LLMs for computer vision tasks, significantly departing from conventional
practices that typically necessitate a multi-modal vision-language setup with
associated language prompts, inputs, or outputs. We demonstrate that our
approach consistently enhances performance across a diverse range of tasks,
encompassing pure 2D and 3D visual recognition tasks (e.g., image and point
cloud classification), temporal modeling tasks (e.g., action recognition),
non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g.,
2D/3D visual question answering and image-text retrieval). Such improvements
are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and
OPT) and different LLM transformer blocks. We additionally propose the
information filtering hypothesis to explain the effectiveness of pre-trained
LLMs in visual encoding -- the pre-trained LLM transformer blocks discern
informative visual tokens and further amplify their effect. This hypothesis is
empirically supported by the observation that the feature activation, after
training with LLM transformer blocks, exhibits a stronger focus on relevant
regions. We hope that our work inspires new perspectives on utilizing LLMs and
deepening our understanding of their underlying mechanisms. Code is available
at https://github.com/ziqipang/LM4VisualEncoding.Comment: 23 pages, 13 figures. Code at
https://github.com/ziqipang/LM4VisualEncodin
VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching
Although diffusion models in text-to-speech have become a popular choice due
to their strong generative ability, the intrinsic complexity of sampling from
diffusion models harms their efficiency. Alternatively, we propose VoiceFlow,
an acoustic model that utilizes a rectified flow matching algorithm to achieve
high synthesis quality with a limited number of sampling steps. VoiceFlow
formulates the process of generating mel-spectrograms into an ordinary
differential equation conditional on text inputs, whose vector field is then
estimated. The rectified flow technique then effectively straightens its
sampling trajectory for efficient synthesis. Subjective and objective
evaluations on both single and multi-speaker corpora showed the superior
synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation
studies further verified the validity of the rectified flow technique in
VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202
Continuous-mode quantum key distribution with digital signal processing
Continuous-variable quantum key distribution (CVQKD) offers the specific
advantage of sharing keys remotely by the use of standard telecom components,
thereby promoting cost-effective and high-performance metropolitan
applications. Nevertheless, the introduction of high-rate spectrum broadening
has pushed CVQKD from a single-mode to a continuous-mode region, resulting in
the adoption of modern digital signal processing (DSP) technologies to recover
quadrature information from continuous-mode quantum states. However, the
security proof of DSP involving multi-point processing is a missing step. Here,
we propose a generalized method of analyzing continuous-mode state processing
by linear DSP via temporal-modes theory. The construction of temporal modes is
key in reducing the security proof to single-mode scenarios. The proposed
practicality oriented security analysis method paves the way for building
classical compatible digital CVQKD.Comment: 10 pages, 4 figure
Lightweight Neural Path Planning
Learning-based path planning is becoming a promising robot navigation
methodology due to its adaptability to various environments. However, the
expensive computing and storage associated with networks impose significant
challenges for their deployment on low-cost robots. Motivated by this practical
challenge, we develop a lightweight neural path planning architecture with a
dual input network and a hybrid sampler for resource-constrained robotic
systems. Our architecture is designed with efficient task feature extraction
and fusion modules to translate the given planning instance into a guidance
map. The hybrid sampler is then applied to restrict the planning within the
prospective regions indicated by the guide map. To enable the network training,
we further construct a publicly available dataset with various successful
planning instances. Numerical simulations and physical experiments demonstrate
that, compared with baseline approaches, our approach has nearly an order of
magnitude fewer model size and five times lower computational while achieving
promising performance. Besides, our approach can also accelerate the planning
convergence process with fewer planning iterations compared to sample-based
methods.Comment: 8 page
- …