30 research outputs found
GRAB: A Dataset of Whole-Body Human Grasping of Objects
Training computers to understand, model, and synthesize human grasping
requires a rich dataset containing complex 3D object shapes, detailed contact
information, hand pose and shape, and the 3D body motion over time. While
"grasping" is commonly thought of as a single hand stably lifting an object, we
capture the motion of the entire body and adopt the generalized notion of
"whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping
Actions with Bodies), of whole-body grasps, containing full 3D shape and pose
sequences of 10 subjects interacting with 51 everyday objects of varying shape
and size. Given MoCap markers, we fit the full 3D body shape and pose,
including the articulated face and hands, as well as the 3D object pose. This
gives detailed 3D meshes over time, from which we compute contact between the
body and object. This is a unique dataset, that goes well beyond existing ones
for modeling and understanding how humans grasp and manipulate objects, how
their full body is involved, and how interaction varies with the task. We
illustrate the practical value of GRAB with an example application; we train
GrabNet, a conditional generative network, to predict 3D hand grasps for unseen
3D object shapes. The dataset and code are available for research purposes at
https://grab.is.tue.mpg.de.Comment: ECCV 202
How Noisy Does a Noisy Miner Have to Be? Amplitude Adjustments of Alarm Calls in an Avian Urban ‘Adapter’
Background: Urban environments generate constant loud noise, which creates a formidable challenge for many animals relying on acoustic communication. Some birds make vocal adjustments that reduce auditory masking by altering, for example, the frequency (kHz) or timing of vocalizations. Another adjustment, well documented for birds under laboratory and natural field conditions, is a noise level-dependent change in sound signal amplitude (the ‘Lombard effect’). To date, however, field research on amplitude adjustments in urban environments has focused exclusively on bird song. Methods: We investigated amplitude regulation of alarm calls using, as our model, a successful urban ‘adapter ’ species, the Noisy miner, Manorina melanocephala. We compared several different alarm calls under contrasting noise conditions. Results: Individuals at noisier locations (arterial roads) alarm called significantly more loudly than those at quieter locations (residential streets). Other mechanisms known to improve sound signal transmission in ‘noise’, namely use of higher perches and in-flight calling, did not differ between site types. Intriguingly, the observed preferential use of different alarm calls by Noisy miners inhabiting arterial roads and residential streets was unlikely to have constituted a vocal modification made in response to sound-masking in the urban environment because the calls involved fell within the main frequency range of background anthropogenic noise. Conclusions: The results of our study suggest that a species, which has the ability to adjust the amplitude of its signals
A Cervid Vocal Fold Model Suggests Greater Glottal Efficiency in Calling at High Frequencies
Male Rocky Mountain elk (Cervus elaphus nelsoni) produce loud and high fundamental frequency bugles during the mating season, in contrast to the male European Red Deer (Cervus elaphus scoticus) who produces loud and low fundamental frequency roaring calls. A critical step in understanding vocal communication is to relate sound complexity to anatomy and physiology in a causal manner. Experimentation at the sound source, often difficult in vivo in mammals, is simulated here by a finite element model of the larynx and a wave propagation model of the vocal tract, both based on the morphology and biomechanics of the elk. The model can produce a wide range of fundamental frequencies. Low fundamental frequencies require low vocal fold strain, but large lung pressure and large glottal flow if sound intensity level is to exceed 70 dB at 10 m distance. A high-frequency bugle requires both large muscular effort (to strain the vocal ligament) and high lung pressure (to overcome phonation threshold pressure), but at least 10 dB more intensity level can be achieved. Glottal efficiency, the ration of radiated sound power to aerodynamic power at the glottis, is higher in elk, suggesting an advantage of high-pitched signaling. This advantage is based on two aspects; first, the lower airflow required for aerodynamic power and, second, an acoustic radiation advantage at higher frequencies. Both signal types are used by the respective males during the mating season and probably serve as honest signals. The two signal types relate differently to physical qualities of the sender. The low-frequency sound (Red Deer call) relates to overall body size via a strong relationship between acoustic parameters and the size of vocal organs and body size. The high-frequency bugle may signal muscular strength and endurance, via a ‘vocalizing at the edge’ mechanism, for which efficiency is critical
Domain Transfer for 3D Pose Estimation from Color Images without Manual Annotations
We introduce a novel learning method for 3D pose estimation from color images. While acquiring annotations for color images is a difficult task, our approach circumvents this problem by learning a mapping from paired color and depth images captured with an RGB-D camera. We jointly learn the pose from synthetic depth images that are easy to generate, and learn to align these synthetic depth images with the real depth images. We show our approach for the task of 3D hand pose estimation and 3D object pose estimation, both from color images only. Our method achieves performances comparable to state-of-the-art methods on popular benchmark datasets, without requiring any annotations for the color images
{HandSeg}: {An Automatically Labeled Dataset for Hand Segmentation from Depth Images}
We introduce a large-scale RGBD hand segmentation dataset, with detailed and automatically generated high-quality ground-truth annotations. Existing real-world datasets are limited in quantity due to the difficulty in manually annotating ground-truth labels. By leveraging a pair of brightly colored gloves and an RGBD camera, we propose an acquisition pipeline that eases the task of annotating very large datasets with minimal human intervention. We then quantify the importance of a large annotated dataset in this domain, and compare the performance of existing datasets in the training of deep-learning architectures. Finally, we propose a novel architecture employing strided convolution/deconvolutions in place of max-pooling and unpooling layers. Our variant outperforms baseline architectures while remaining computationally efficient at inference time. Source and datasets will be made publicly available