7 research outputs found

    Recognizing object surface material from impact sounds for robot manipulation

    Get PDF
    © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.We investigated the use of impact sounds generated during exploratory behaviors in a robotic manipulation setup as cues for predicting object surface material and for recognizing individual objects. We collected and make available the YCB-impact sounds dataset which includes over 3,000 impact sounds for the YCB set of everyday objects lying on a table. Impact sounds were generated in three modes: (i) human holding a gripper and hitting, scratching, or dropping the object; (ii) gripper attached to a teleoperated robot hitting the object from the top; (iii) autonomously operated robot hitting the objects from the side with two different speeds. A convolutional neural network is trained from scratch to recognize the object material (steel, aluminium, hard plastic, soft plastic, other plastic, ceramic, wood, paper/cardboard, foam, glass, rubber) from a single impact sound. On the manually collected dataset with more variability in the speed of the action, nearly 60% accuracy for the test set (not presented objects) was achieved. On a robot setup and a stereotypical poking action from top, accuracy of 85% was achieved. This performance drops to 79% if multiple exploratory actions are combined. Individual objects from the set of 75 objects can be recognized with a 79% accuracy. This work demonstrates promising results regarding the possibility of using impact sound for recognition in tasks like single-stream recycling where objects have to be sorted based on their material composition.This work was supported by the project Interactive Perception-Action-Learning for Modelling Objects (IPALM) (H2020 – FET – ERA-NET Cofund – CHIST-ERA III / Technology Agency of the Czech Republic, EPSILON, no. TH05020001) and partially supported by the project MDM2016-0656 funded by MCIN/ AEI /10.13039/501100011033. M.D. was supported by grant RYC-2017-22563 funded by MCIN/ AEI /10.13039/501100011033 and by “ESF Investing in your future”. S.P. and M.H. were additionally supported by OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics”. We thank Bedrich Himmel for assistance with sound setup, Antonio Miranda and Andrej Kruzliak for data collection, and Lukas Rustler for video preparation.This work was supported by the project Interactive Perception-Action-Learning for Modelling Objects (IPALM) (H2020 – FET – ERA-NET Cofund – CHIST-ERA III / Technology Agency of the Czech Republic, EPSILON, no. TH05020001) and partially supported by the project MDM2016-0656 funded by MCIN/ AEI /10.13039/501100011033. M.D. was supported by grant RYC-2017-22563 funded by MCIN/ AEI /10.13039/501100011033 and by “ESF Investing in your future”. S.P. and M.H. were additionally supported by OP VVV MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 “Research Center for Informatics”. We thank Bedrich Himmel for assistance with sound setup, Antonio Miranda and Andrej Kruzliak for data collection, and Lukas Rustler for video preparation.Peer ReviewedPostprint (author's final draft

    Swoosh! Rattle! Thump! -- Actions that Sound

    Full text link
    Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however, we have often ignored a key sense: sound. This is primarily due to the lack of data that captures the interplay of action and sound. In this work, we perform the first large-scale study of the interactions between sound and robotic action. To do this, we create the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using our robotic platform Tilt-Bot. By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action and present three key insights. First, sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench. Second, sound also contains information about the causal effects of an action, i.e. given the sound produced, we can predict what action was applied to the object. Finally, object representations derived from audio embeddings are indicative of implicit physical properties. We demonstrate that on previously unseen objects, audio embeddings generated through interactions can predict forward models 24% better than passive visual embeddings. Project videos and data are at https://dhiraj100892.github.io/swoosh/Comment: To be presented at Robotics: Science and Systems 202

    Audio-Material Modeling and Reconstruction for Multimodal Interaction

    Get PDF
    Interactive virtual environments enable the creation of training simulations, games, and social applications. These virtual environments can create a sense of presence in the environment: a sensation that its user is truly in another location. To maintain presence, interactions with virtual objects should engage multiple senses. Furthermore, multisensory input should be consistent, e.g. a virtual bowl that visually appears plastic should also sound like plastic when dropped on the floor. In this dissertation, I propose methods to improve the perceptual realism of virtual object impact sounds and ensure consistency between those sounds and the input from other senses. Recreating the impact sound of a real-world object requires an accurate estimate of that object's material parameters. The material parameters that affect impact sound---collectively forming the audio-material---include the material damping parameters for a damping model. I propose and evaluate damping models and use them to estimate material damping parameters for real-world objects. I also consider how interaction with virtual objects can be made more consistent between the senses of sight, hearing, and touch. First, I present a method for modeling the damping behavior of impact sounds, using generalized proportional damping to both estimate more expressive material damping parameters from recorded impact sounds and perform impact sound synthesis. Next, I present a method for estimating material damping parameters in the presence of confounding factors and with no knowledge of the object's shape. To accomplish this, a probabilistic damping model captures various external effects to produce robust damping parameter estimates. Next, I present a method for consistent multimodal interaction with textured surfaces. Texture maps serve as a single unified representation of mesoscopic detail for the purposes of visual rendering, sound synthesis, and rigid-body simulation. Finally, I present a method for geometry and material classification using multimodal audio-visual input. Using this method, a real-world scene can be scanned and virtually reconstructed while accurately modeling both the visual appearances and audio-material parameters of each object.Doctor of Philosoph

    Shape and material from sound

    No full text
    Hearing an object falling onto the ground, humans can recover rich information including its rough shape, material, and falling height. In this paper, we build machines to approximate such competency. We first mimic human knowledge of the physical world by building an efficient, physics-based simulation engine. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further accelerate the process by learning a mapping from a sound wave to object properties, and using the predicted values to initialize the inference. This mapping can be viewed as an approximation of human commonsense learned from past experience. Our model performs well on both synthetic audio clips and real recordings without requiring any annotated data. We conduct behavior studies to compare human responses with ours on estimating object shape, material, and falling height from sound. Our model achieves near-human performance.National Science Foundation (U.S.) (1212849)National Science Foundation (U.S.) (1447476)United States. Office of Naval Research. Multidisciplinary University Research Initiative (N00014-16-1-2007)Toyota Research InstituteSamsung (Firm)ShellNational Science Foundation (U.S.) (STC Award CCF-1231216

    Inferring Shape and Material from Sound

    No full text
    Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging. One possible solution is to rely on supervised learning, which requires a large-scale dataset containing sounds of various objects, with clean labels on their appearances, shape and material. However, it is difficult and expensive to capture such a dataset. Another approach is to tackle the problem in an analysis-by-synthesis framework, where we iterative update current estimates given a generative model. This, however, requires sophisticated generative models, which is too computationally expensive to support iterative inference. Finally, despite the popularity of deep learning methods in auditory perception tasks, most of them are derived from visual recognition tasks, which may not be suitable for processing audios. To address such difficulties, we first present a novel, open-source pipeline that generates audio-visual data, purely from 3D object shapes and their physical properties. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We further demonstrate that the representation learned on synthetic audio-visual data can transfer to real-world scenarios. In addition, the generative model can be made efficient enough to support iterative inference, where we construct an analysis-by-synthesis framework that infers object’s shape and material by hearing it falling on the ground.S.M
    corecore