375 research outputs found
Agent oriented fault detection, isolation and recovery and aspect-oriented plug-and-play tracking mechanism
Fault detection, isolation, and recovery are some of the most critical activities in which astronauts and flight controllers participate. Recent systems to perform the FDIR activity lack portability and extensibility, and do not provide any explanation of the system's activity. In this research, we explore the use of an agent-oriented paradigm and Java technology for better performance of FDIR activity. Also, we have explored the use of explanation in agent-oriented systems, and designed a system-activity tracking mecha-nism that helps the user to understand the agents' behavior. We have explored different ways to generalize this mechanism for arbitrary agent systems to use. Furthermore, we studied mechanisms to automatically add the tracking mechanism to an existing agent system. By using AspectJ, an aspect-oriented tool, a plug-and-playable tracking system has been built that can add the capability to track the activity of the system to any JACK agent system easily. Our experience can help further research on using aspect-oriented tools with agent-oriented paradigms together to obtain better performance
An Online Sparse Streaming Feature Selection Algorithm
Online streaming feature selection (OSFS), which conducts feature selection
in an online manner, plays an important role in dealing with high-dimensional
data. In many real applications such as intelligent healthcare platform,
streaming feature always has some missing data, which raises a crucial
challenge in conducting OSFS, i.e., how to establish the uncertain relationship
between sparse streaming features and labels. Unfortunately, existing OSFS
algorithms never consider such uncertain relationship. To fill this gap, we in
this paper propose an online sparse streaming feature selection with
uncertainty (OS2FSU) algorithm. OS2FSU consists of two main parts: 1) latent
factor analysis is utilized to pre-estimate the missing data in sparse
streaming features before con-ducting feature selection, and 2) fuzzy logic and
neighborhood rough set are employed to alleviate the uncertainty between
estimated streaming features and labels during conducting feature selection. In
the experiments, OS2FSU is compared with five state-of-the-art OSFS algorithms
on six real datasets. The results demonstrate that OS2FSU outperforms its
competitors when missing data are encountered in OSFS
R.: Active algorithm selection
Abstract Most previous studies on active learning focused on the problem of model selection, i.e., how to identify the optimal classification model from a family of predefined models using a small, carefully selected training set. In this paper, we address the problem of active algorithm selection. The goal of this problem is to efficiently identify the optimal learning algorithm for a given dataset from a set of algorithms using a small training set. In this study, we present a general framework for active algorithm selection by extending the idea of the Hedge algorithm. It employs the worst case analysis to identify the example that can effectively increase the weighted loss function defined in the Hedge algorithm. We further extend the framework by incorporating the correlation information among unlabeled examples to accurately estimate the change in the weighted loss function, and Maximum Entropy Discrimination to automatically determine the combination weights used by the Hedge algorithm. Our empirical study with the datasets of WCCI 2006 performance prediction challenge shows promising performance of the proposed framework for active algorithm selection
DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder
Generating high-quality and person-generic visual dubbing remains a
challenge. Recent innovation has seen the advent of a two-stage paradigm,
decoupling the rendering and lip synchronization process facilitated by
intermediate representation as a conduit. Still, previous methodologies rely on
rough landmarks or are confined to a single speaker, thus limiting their
performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We
first craft the Diffusion auto-encoder by an inpainting renderer incorporating
a mask to delineate editable zones and unaltered regions. This allows for
seamless filling of the lower-face region while preserving the remaining parts.
Throughout our experiments, we encountered several challenges. Primarily, the
semantic encoder lacks robustness, constricting its ability to capture
high-level features. Besides, the modeling ignored facial positioning, causing
mouth or nose jitters across frames. To tackle these issues, we employ
versatile strategies, including data augmentation and supplementary eye
guidance. Moreover, we encapsulated a conformer-based reference encoder and
motion generator fortified by a cross-attention mechanism. This enables our
model to learn person-specific textures with varying references and reduces
reliance on paired audio-visual data. Our rigorous experiments comprehensively
highlight that our ground-breaking approach outpaces existing methods with
considerable margins and delivers seamless, intelligible videos in
person-generic and multilingual scenarios.Comment: 5 pages, Accepted to ICASSP 202
Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation
Large-scale pre-trained language models (PLMs) have shown great potential in
natural language processing tasks. Leveraging the capabilities of PLMs to
enhance automatic speech recognition (ASR) systems has also emerged as a
promising research direction. However, previous works may be limited by the
inflexible structures of PLMs and the insufficient utilization of PLMs. To
alleviate these problems, we propose the hierarchical knowledge distillation
(HKD) on the continuous integrate-and-fire (CIF) based ASR models. To transfer
knowledge from PLMs to the ASR models, HKD employs cross-modal knowledge
distillation with contrastive loss at the acoustic level and knowledge
distillation with regression loss at the linguistic level. Compared with the
original CIF-based model, our method achieves 15% and 9% relative error rate
reduction on the AISHELL-1 and LibriSpeech datasets, respectively.Comment: Accepted by INTERSPEECH 202
DMRM: A Dual-channel Multi-hop Reasoning Model for Visual Dialog
Visual Dialog is a vision-language task that requires an AI agent to engage
in a conversation with humans grounded in an image. It remains a challenging
task since it requires the agent to fully understand a given question before
making an appropriate response not only from the textual dialog history, but
also from the visually-grounded information. While previous models typically
leverage single-hop reasoning or single-channel reasoning to deal with this
complex multimodal reasoning task, which is intuitively insufficient. In this
paper, we thus propose a novel and more powerful Dual-channel Multi-hop
Reasoning Model for Visual Dialog, named DMRM. DMRM synchronously captures
information from the dialog history and the image to enrich the semantic
representation of the question by exploiting dual-channel reasoning.
Specifically, DMRM maintains a dual channel to obtain the question- and
history-aware image features and the question- and image-aware dialog history
features by a mulit-hop reasoning process in each channel. Additionally, we
also design an effective multimodal attention to further enhance the decoder to
generate more accurate responses. Experimental results on the VisDial v0.9 and
v1.0 datasets demonstrate that the proposed model is effective and outperforms
compared models by a significant margin.Comment: Accepted at AAAI 202
VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition
Enhancing automatic speech recognition (ASR) performance by leveraging
additional multimodal information has shown promising results in previous
studies. However, most of these works have primarily focused on utilizing
visual cues derived from human lip motions. In fact, context-dependent visual
and linguistic cues can also benefit in many scenarios. In this paper, we first
propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel
multimodal ASR model based on the continuous integrate-and-fire (CIF)
mechanism, which can integrate visual and textual context simultaneously or
separately, to facilitate speech recognition. Next, we introduce an effective
training strategy that improves performance in modal-incomplete test scenarios.
Then, to explore the effects of integrating vision and language, we create
VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese
and English versions. Finally, empirical results are reported on the public
Flickr8K and self-constructed VSDial datasets. We explore various cross-modal
fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and
provide insights into the effects of integrating multimodal information on
speech recognition.Comment: Accepted to ICASSP 202
- …