292 research outputs found
Dense Voxel 3D Reconstruction Using a Monocular Event Camera
Event cameras are sensors inspired by biological systems that specialize in
capturing changes in brightness. These emerging cameras offer many advantages
over conventional frame-based cameras, including high dynamic range, high frame
rates, and extremely low power consumption. Due to these advantages, event
cameras have increasingly been adapted in various fields, such as frame
interpolation, semantic segmentation, odometry, and SLAM. However, their
application in 3D reconstruction for VR applications is underexplored. Previous
methods in this field mainly focused on 3D reconstruction through depth map
estimation. Methods that produce dense 3D reconstruction generally require
multiple cameras, while methods that utilize a single event camera can only
produce a semi-dense result. Other single-camera methods that can produce dense
3D reconstruction rely on creating a pipeline that either incorporates the
aforementioned methods or other existing Structure from Motion (SfM) or
Multi-view Stereo (MVS) methods. In this paper, we propose a novel approach for
solving dense 3D reconstruction using only a single event camera. To the best
of our knowledge, our work is the first attempt in this regard. Our preliminary
results demonstrate that the proposed method can produce visually
distinguishable dense 3D reconstructions directly without requiring pipelines
like those used by existing methods. Additionally, we have created a synthetic
dataset with object scans using an event camera simulator. This
dataset will help accelerate other relevant research in this field
Fine-grained Activity Classification In Assembly Based On Multi-visual Modalities
Assembly activity recognition and prediction help to improve productivity, quality control, and safety measures in smart factories. This study aims to sense, recognize, and predict a worker\u27s continuous fine-grained assembly activities in a manufacturing platform. We propose a two-stage network for workers\u27 fine-grained activity classification by leveraging scene-level and temporal-level activity features. The first stage is a feature awareness block that extracts scene-level features from multi-visual modalities, including red, green blue (RGB) and hand skeleton frames. We use the transfer learning method in the first stage and compare three different pre-trained feature extraction models. Then, we transmit the feature information from the first stage to the second stage to learn the temporal-level features of activities. The second stage consists of the Recurrent Neural Network (RNN) layers and a final classifier. We compare the performance of two different RNNs in the second stage, including the Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU). The partial video observation method is used in the prediction of fine-grained activities. In the experiments using the trimmed activity videos, our model achieves an accuracy of \u3e 99% on our dataset and \u3e 98% on the public dataset UCF 101, outperforming the state-of-the-art models. The prediction model achieves an accuracy of \u3e 97% in predicting activity labels using 50% of the onset activity video information. In the experiments using an untrimmed video with continuous assembly activities, we combine our recognition and prediction models and achieve an accuracy of \u3e 91% in real time, surpassing the state-of-the-art models for the recognition of continuous assembly activities
Mitigating Representation Bias in Action Recognition: Algorithms and Benchmarks
Deep learning models have achieved excellent recognition results on
large-scale video benchmarks. However, they perform poorly when applied to
videos with rare scenes or objects, primarily due to the bias of existing video
datasets. We tackle this problem from two different angles: algorithm and
dataset. From the perspective of algorithms, we propose Spatial-aware
Multi-Aspect Debiasing (SMAD), which incorporates both explicit debiasing with
multi-aspect adversarial training and implicit debiasing with the spatial
actionness reweighting module, to learn a more generic representation invariant
to non-action aspects. To neutralize the intrinsic dataset bias, we propose
OmniDebias to leverage web data for joint training selectively, which can
achieve higher performance with far fewer web data. To verify the
effectiveness, we establish evaluation protocols and perform extensive
experiments on both re-distributed splits of existing datasets and a new
evaluation dataset focusing on the action with rare scenes. We also show that
the debiased representation can generalize better when transferred to other
datasets and tasks.Comment: ECCVW 202
Experimental study on thermal runaway risk of 18650 lithium ion battery under side-heating condition
Sparse Array Enabled Near-Field Communications: Beam Pattern Analysis and Hybrid Beamforming Design
Extremely large-scale array (XL-array) has emerged as a promising technology
to enable near-field communications for achieving enhanced spectrum efficiency
and spatial resolution, by drastically increasing the number of antennas.
However, this also inevitably incurs higher hardware and energy cost, which may
not be affordable in future wireless systems. To address this issue, we propose
in this paper to exploit two types of sparse arrays (SAs) for enabling
near-field communications. Specifically, we first consider the linear sparse
array (LSA) and characterize its near-field beam pattern. It is shown that
despite the achieved beam-focusing gain, the LSA introduces several undesired
grating-lobes, which have comparable beam power with the main-lobe and are
focused on specific regions. An efficient hybrid beamforming design is then
proposed for the LSA to deal with the potential strong inter-user interference
(IUI). Next, we consider another form of SA, called extended coprime array
(ECA), which is composed of two LSA subarrays with different (coprime)
inter-antenna spacing. By characterizing the ECA near-field beam pattern, we
show that compared with the LSA with the same array sparsity, the ECA can
greatly suppress the beam power of near-field grating-lobes thanks to the
offset effect of the two subarrays, albeit with a larger number of
grating-lobes. This thus motivates us to propose a customized two-phase hybrid
beamforming design for the ECA. Finally, numerical results are presented to
demonstrate the rate performance gain of the proposed two SAs over the
conventional uniform linear array (ULA).Comment: In this paper, we propose to exploit sparse arrays for enabling
near-field communications and characterize its unique beam pattern for
facilitating its hybrid beamforming desig
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
The recent advancements in text-to-3D generation mark a significant milestone
in generative models, unlocking new possibilities for creating imaginative 3D
assets across various real-world scenarios. While recent advancements in
text-to-3D generation have shown promise, they often fall short in rendering
detailed and high-quality 3D models. This problem is especially prevalent as
many methods base themselves on Score Distillation Sampling (SDS). This paper
identifies a notable deficiency in SDS, that it brings inconsistent and
low-quality updating direction for the 3D model, causing the over-smoothing
effect. To address this, we propose a novel approach called Interval Score
Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes
interval-based score matching to counteract over-smoothing. Furthermore, we
incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline.
Extensive experiments show that our model largely outperforms the
state-of-the-art in quality and training efficiency.Comment: The first two authors contributed equally to this work. Our code will
be available at: https://github.com/EnVision-Research/LucidDreame
When Urban Region Profiling Meets Large Language Models
Urban region profiling from web-sourced data is of utmost importance for
urban planning and sustainable development. We are witnessing a rising trend of
LLMs for various fields, especially dealing with multi-modal data research such
as vision-language learning, where the text modality serves as a supplement
information for the image. Since textual modality has never been introduced
into modality combinations in urban region profiling, we aim to answer two
fundamental questions in this paper: i) Can textual modality enhance urban
region profiling? ii) and if so, in what ways and with regard to which aspects?
To answer the questions, we leverage the power of Large Language Models (LLMs)
and introduce the first-ever LLM-enhanced framework that integrates the
knowledge of textual modality into urban imagery profiling, named LLM-enhanced
Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP).
Specifically, it first generates a detailed textual description for each
satellite image by an open-source Image-to-Text LLM. Then, the model is trained
on the image-text pairs, seamlessly unifying natural language supervision for
urban visual representation learning, jointly with contrastive loss and
language modeling loss. Results on predicting three urban indicators in four
major Chinese metropolises demonstrate its superior performance, with an
average improvement of 6.1% on R^2 compared to the state-of-the-art methods.
Our code and the image-language dataset will be released upon paper
notification
- …