288 research outputs found

    Robust Kalman filters with unknown covariance of multiplicative noise

    Full text link
    In this paper, state and noise covariance estimation problems for linear system with unknown multiplicative noise are considered. The measurement likelihood is modelled as a mixture of two Gaussian distributions and a Student's t distribution, respectively. The unknown covariance of multiplicative noise is modelled as an inverse Gamma/Wishart distribution and the initial condition is formulated as the nominal covariance. By using robust design and choosing hierarchical priors, two variational Bayesian based robust Kalman filters are proposed. Stability and covergence of the proposed filters, the covariance parameters, the VB inference, and the estimation error dynamics are analyzed. The lower and upper bounds are also provided to guarantee the performance of the proposed filters. A target tracking simulation is provided to validate the effectiveness of the proposed filters

    MV-Map: Offboard HD-Map Generation with Multi-view Consistency

    Full text link
    While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints. This is because BEV perception is typically set up in an 'onboard' manner, which restricts the computation and consequently prevents algorithms from reasoning multiple views simultaneously. This paper overcomes these limitations and advocates a more practical 'offboard' HD-Map generation setup that removes the computation constraints, based on the fact that HD-Maps are commonly reusable infrastructures built offline in data centers. To this end, we propose a novel offboard pipeline called MV-Map that capitalizes multi-view consistency and can handle an arbitrary number of frames with the key design of a 'region-centric' framework. In MV-Map, the target HD-Maps are created by aggregating all the frames of onboard predictions, weighted by the confidence scores assigned by an 'uncertainty network'. To further enhance multi-view consistency, we augment the uncertainty network with the global 3D structure optimized by a voxelized neural radiance field (Voxel-NeRF). Extensive experiments on nuScenes show that our MV-Map significantly improves the quality of HD-Maps, further highlighting the importance of offboard methods for HD-Map generation.Comment: ICCV 202

    Unsupervised Active Learning: Optimizing Labeling Cost-Effectiveness for Automatic Speech Recognition

    Full text link
    In recent years, speech-based self-supervised learning (SSL) has made significant progress in various tasks, including automatic speech recognition (ASR). An ASR model with decent performance can be realized by fine-tuning an SSL model with a small fraction of labeled data. Reducing the demand for labeled data is always of great practical value. In this paper, we further extend the use of SSL to cut down labeling costs with active learning. Three types of units on different granularities are derived from speech signals in an unsupervised way, and their effects are compared by applying a contrastive data selection method. The experimental results show that our proposed data selection framework can effectively improve the word error rate (WER) by more than 11% with the same amount of labeled data, or halve the labeling cost while maintaining the same WER, compared to random selection.Comment: 5 pages, 3 figures. Accepted to Interspeech 202

    Frozen Transformers in Language Models Are Effective Visual Encoder Layers

    Full text link
    This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a simple yet previously overlooked strategy -- employing a frozen transformer block from pre-trained LLMs as a constituent encoder layer to directly process visual tokens. Our work pushes the boundaries of leveraging LLMs for computer vision tasks, significantly departing from conventional practices that typically necessitate a multi-modal vision-language setup with associated language prompts, inputs, or outputs. We demonstrate that our approach consistently enhances performance across a diverse range of tasks, encompassing pure 2D and 3D visual recognition tasks (e.g., image and point cloud classification), temporal modeling tasks (e.g., action recognition), non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g., 2D/3D visual question answering and image-text retrieval). Such improvements are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and OPT) and different LLM transformer blocks. We additionally propose the information filtering hypothesis to explain the effectiveness of pre-trained LLMs in visual encoding -- the pre-trained LLM transformer blocks discern informative visual tokens and further amplify their effect. This hypothesis is empirically supported by the observation that the feature activation, after training with LLM transformer blocks, exhibits a stronger focus on relevant regions. We hope that our work inspires new perspectives on utilizing LLMs and deepening our understanding of their underlying mechanisms. Code is available at https://github.com/ziqipang/LM4VisualEncoding.Comment: 23 pages, 13 figures. Code at https://github.com/ziqipang/LM4VisualEncodin

    VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching

    Full text link
    Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.Comment: 4 figure, 5 pages, submitted to ICASSP 202

    Continuous-mode quantum key distribution with digital signal processing

    Full text link
    Continuous-variable quantum key distribution (CVQKD) offers the specific advantage of sharing keys remotely by the use of standard telecom components, thereby promoting cost-effective and high-performance metropolitan applications. Nevertheless, the introduction of high-rate spectrum broadening has pushed CVQKD from a single-mode to a continuous-mode region, resulting in the adoption of modern digital signal processing (DSP) technologies to recover quadrature information from continuous-mode quantum states. However, the security proof of DSP involving multi-point processing is a missing step. Here, we propose a generalized method of analyzing continuous-mode state processing by linear DSP via temporal-modes theory. The construction of temporal modes is key in reducing the security proof to single-mode scenarios. The proposed practicality oriented security analysis method paves the way for building classical compatible digital CVQKD.Comment: 10 pages, 4 figure

    Lightweight Neural Path Planning

    Full text link
    Learning-based path planning is becoming a promising robot navigation methodology due to its adaptability to various environments. However, the expensive computing and storage associated with networks impose significant challenges for their deployment on low-cost robots. Motivated by this practical challenge, we develop a lightweight neural path planning architecture with a dual input network and a hybrid sampler for resource-constrained robotic systems. Our architecture is designed with efficient task feature extraction and fusion modules to translate the given planning instance into a guidance map. The hybrid sampler is then applied to restrict the planning within the prospective regions indicated by the guide map. To enable the network training, we further construct a publicly available dataset with various successful planning instances. Numerical simulations and physical experiments demonstrate that, compared with baseline approaches, our approach has nearly an order of magnitude fewer model size and five times lower computational while achieving promising performance. Besides, our approach can also accelerate the planning convergence process with fewer planning iterations compared to sample-based methods.Comment: 8 page
    • …
    corecore