26 research outputs found
Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection
Out-of-distribution (OOD) detection is an important task to ensure the
reliability and safety of deep learning and the discriminator models outperform
others for now. However, the feature extraction of the discriminator models
must compress the data and lose certain information, leaving room for bad cases
and malicious attacks. In this paper, we provide a new assumption that the
discriminator models are more sensitive to some subareas of the input space and
such perceptron bias causes bad cases and overconfidence areas. Under this
assumption, we design new detection methods and indicator scores. For detection
methods, we introduce diffusion models (DMs) into OOD detection. We find that
the diffusion denoising process (DDP) of DMs also functions as a novel form of
asymmetric interpolation, which is suitable to enhance the input and reduce the
overconfidence areas. For indicator scores, we find that the features of the
discriminator models of OOD inputs occur sharp changes under DDP and use the
norm of this dynamic change as our indicator scores. Therefore, we develop a
new framework to combine the discriminator and generation models to do OOD
detection under our new assumption. The discriminator models provide proper
detection spaces and the generation models reduce the overconfidence problem.
According to our experiments on CIFAR10 and CIFAR100, our methods get
competitive results with state-of-the-art methods. Our implementation is
available at https://github.com/luping-liu/DiffOOD
Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers
Recent research has evidenced the significant potentials of Large Language
Models (LLMs) in handling challenging tasks within 3D scenes. However, current
models are constrained to addressing object-centric tasks, where each
question-answer pair focuses solely on an individual object. In real-world
applications, users may pose queries involving multiple objects or expect for
answers that precisely reference various objects. We introduce the use of
object identifiers to freely reference objects during a conversation. While
this solution appears straightforward, it presents two main challenges: 1) How
to establish a reliable one-to-one correspondence between each object and its
identifier? 2) How to incorporate complex spatial relationships among dozens of
objects into the embedding space of the LLM? To address these challenges, we
propose a two-stage alignment method, which involves learning an
attribute-aware token and a relation-aware token for each object. These tokens
capture the object's attributes and spatial relationships with surrounding
objects in the 3D scene. Once the alignment is established, we can fine-tune
our model on various downstream tasks using instruction tuning. Experiments
conducted on traditional datasets like ScanQA, ScanRefer, and Nr3D/Sr3D
showcase the effectiveness of our proposed method. Additionally, we create a 3D
scene captioning dataset annotated with rich object identifiers, with the
assistant of GPT-4. This dataset aims to further explore the capability of
object identifiers in effective object referencing and precise scene
understanding
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
Connecting Multi-modal Contrastive Representations
Multi-modal Contrastive Representation learning aims to encode different
modalities into a semantically aligned shared space. This paradigm shows
remarkable generalization ability on numerous downstream tasks across various
modalities. However, the reliance on massive high-quality data pairs limits its
further development on more modalities. This paper proposes a novel
training-efficient method for learning MCR without paired data called
Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given
two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project
them to a new space and use the data from the overlapping modality B to
aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A,
B) and (B, C) are already aligned within each MCR, the connection learned by
overlapping modality can also be transferred to non-overlapping modality pair
(A, C). To unleash the potential of C-MCR, we further introduce a
semantic-enhanced inter- and intra-MCR connection method. We first enhance the
semantic consistency and completion of embeddings across different modalities
for more robust alignment. Then we utilize the inter-MCR alignment to establish
the connection, and employ the intra-MCR alignment to better maintain the
connection for inputs from non-overlapping modalities. To demonstrate the
effectiveness of C-MCR, we connect CLIP and CLAP via texts to derive
audio-visual representations, and integrate CLIP and ULIP via images for
3D-language representations. Remarkably, without using any paired data, C-MCR
for audio-visual achieves state-of-the-art performance on audio-image
retrieval, audio-visual source localization, and counterfactual audio-image
recognition tasks. Furthermore, C-MCR for 3D-language also attains advanced
zero-shot 3D point cloud classification accuracy on ModelNet40.Comment: NeurIPS 202
The roads one must walk down: Commute and depression for Beijing’s residents
10.1016/j.trd.2022.103316Transportation Research Part D: Transport and Environment109103316-10331
Home-made blues: Residential crowding and mental health in Beijing, China
10.1177/00420980221101707Urban Studies004209802211017-00420980221101
Active microfluidic mixer chip
We report the design and fabrication of a chaotic mixer based on the electrorheological (ER) fluid-controlled valves. The flow in the main channel is perturbed by liquid flow in orthogonal side channels, driven by hydrodynamic pulsating pumps. Each pulsating pump consists of a chamber with diaphragm plus two out-of-phase ER valves operating in a push-pull mode. All the valves, pumps, and mixing channels are integrated in one polydimethylsioxane chip. Mixing characteristics in the main channel are controlled by the strength and frequency of external electric fields applied on the ER fluid.<br/
Electrorheological fluid-actuated flexible platform
The design, fabrication, and performance of an electrorheological (ER) fluid-actuated flexible platform integrated on a microfluidic chip are reported in this letter. The digitally regulated ER microvalves control the four diaphragms on which a platform is sustained. With electrical input signals, the platform can perform vibrations at tunable frequencies as well as generate complex leveling modes. The flexible platform can potentially act as a microdamper when its inputs are generated from a sensor, in combination with a feedback control system.<br/
“Standard Text” Relational Classification Model Based on Concatenated Word Vector Attention and Feature Concatenation
The task of relation classification is an important pre-task in natural language processing tasks. Relation classification can provide a high-quality corpus for tasks such as machine translation, human–computer dialogue, and structured text generation. In the process of the digitalization of standards, identifying the entity relationship in the standard text is an important prerequisite for the formation of subsequent standard knowledge. Only by accurately labeling the relationship between entities can there be higher efficiency and accuracy in the subsequent formation of knowledge bases and knowledge maps. This study proposes a standard text relational classification model based on cascaded word vector attention and feature splicing. The model was compared and ablated on our labeled standard text Chinese dataset. At the same time, in order to prove the performance of the model, the above experiments were carried out on two general English datasets, SemEval-2010 Task 8 and KBP37. On standard text datasets and general datasets, the model proposed in this study achieved excellent results