10,216 research outputs found
Linguistically-driven framework for computationally efficient and scalable sign recognition
We introduce a new general framework for sign recognition from monocular video using limited quantities of annotated data. The novelty of the hybrid framework we describe here is that we exploit state-of-the art learning methods while also incorporating features based on what we know about the linguistic composition of lexical signs. In particular, we analyze hand shape, orientation, location, and motion trajectories, and then use CRFs to combine this linguistically significant information for purposes of sign recognition. Our robust modeling and recognition of these sub-components of sign production allow an efficient parameterization of the sign recognition problem as compared with purely data-driven methods. This parameterization enables a scalable and extendable time-series learning approach that advances the state of the art in sign recognition, as shown by the results reported here for recognition of isolated, citation-form, lexical signs from American Sign Language (ASL)
Tactile Mapping and Localization from High-Resolution Tactile Imprints
This work studies the problem of shape reconstruction and object localization
using a vision-based tactile sensor, GelSlim. The main contributions are the
recovery of local shapes from contact, an approach to reconstruct the tactile
shape of objects from tactile imprints, and an accurate method for object
localization of previously reconstructed objects. The algorithms can be applied
to a large variety of 3D objects and provide accurate tactile feedback for
in-hand manipulation. Results show that by exploiting the dense tactile
information we can reconstruct the shape of objects with high accuracy and do
on-line object identification and localization, opening the door to reactive
manipulation guided by tactile sensing. We provide videos and supplemental
information in the project's website
http://web.mit.edu/mcube/research/tactile_localization.html.Comment: ICRA 2019, 7 pages, 7 figures. Website:
http://web.mit.edu/mcube/research/tactile_localization.html Video:
https://youtu.be/uMkspjmDbq
CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
We present a method for teaching machines to understand and model the
underlying spatial common sense of diverse human-object interactions in 3D in a
self-supervised way. This is a challenging task, as there exist specific
manifolds of the interactions that can be considered human-like and natural,
but the human pose and the geometry of objects can vary even for similar
interactions. Such diversity makes the annotating task of 3D interactions
difficult and hard to scale, which limits the potential to reason about that in
a supervised way. One way of learning the 3D spatial relationship between
humans and objects during interaction is by showing multiple 2D images captured
from different viewpoints when humans interact with the same type of objects.
The core idea of our method is to leverage a generative model that produces
high-quality 2D images from an arbitrary text prompt input as an "unbounded"
data generator with effective controllability and view diversity. Despite its
imperfection of the image quality over real images, we demonstrate that the
synthesized images are sufficient to learn the 3D human-object spatial
relations. We present multiple strategies to leverage the synthesized images,
including (1) the first method to leverage a generative image model for 3D
human-object spatial relation learning; (2) a framework to reason about the 3D
spatial relations from inconsistent 2D cues in a self-supervised manner via 3D
occupancy reasoning with pose canonicalization; (3) semantic clustering to
disambiguate different types of interactions with the same object types; and
(4) a novel metric to assess the quality of 3D spatial learning of interaction.Comment: Accepted to ICCV 2023 (Oral Presentation). Project Page:
https://jellyheadandrew.github.io/projects/choru
Detection and Localization of Firearm Carriers in Complex Scenes for Improved Safety Measures
Detecting firearms and accurately localizing individuals carrying them in
images or videos is of paramount importance in security, surveillance, and
content customization. However, this task presents significant challenges in
complex environments due to clutter and the diverse shapes of firearms. To
address this problem, we propose a novel approach that leverages human-firearm
interaction information, which provides valuable clues for localizing firearm
carriers. Our approach incorporates an attention mechanism that effectively
distinguishes humans and firearms from the background by focusing on relevant
areas. Additionally, we introduce a saliency-driven locality-preserving
constraint to learn essential features while preserving foreground information
in the input image. By combining these components, our approach achieves
exceptional results on a newly proposed dataset. To handle inputs of varying
sizes, we pass paired human-firearm instances with attention masks as channels
through a deep network for feature computation, utilizing an adaptive average
pooling layer. We extensively evaluate our approach against existing methods in
human-object interaction detection and achieve significant results (AP=77.8\%)
compared to the baseline approach (AP=63.1\%). This demonstrates the
effectiveness of leveraging attention mechanisms and saliency-driven locality
preservation for accurate human-firearm interaction detection. Our findings
contribute to advancing the fields of security and surveillance, enabling more
efficient firearm localization and identification in diverse scenarios.Comment: This paper is accepted in IEEE Transactions on Computational Social
System
- …