34,426 research outputs found
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
Coding local and global binary visual features extracted from video sequences
Binary local features represent an effective alternative to real-valued
descriptors, leading to comparable results for many visual analysis tasks,
while being characterized by significantly lower computational complexity and
memory requirements. When dealing with large collections, a more compact
representation based on global features is often preferred, which can be
obtained from local features by means of, e.g., the Bag-of-Visual-Word (BoVW)
model. Several applications, including for example visual sensor networks and
mobile augmented reality, require visual features to be transmitted over a
bandwidth-limited network, thus calling for coding techniques that aim at
reducing the required bit budget, while attaining a target level of efficiency.
In this paper we investigate a coding scheme tailored to both local and global
binary features, which aims at exploiting both spatial and temporal redundancy
by means of intra- and inter-frame coding. In this respect, the proposed coding
scheme can be conveniently adopted to support the Analyze-Then-Compress (ATC)
paradigm. That is, visual features are extracted from the acquired content,
encoded at remote nodes, and finally transmitted to a central controller that
performs visual analysis. This is in contrast with the traditional approach, in
which visual content is acquired at a node, compressed and then sent to a
central unit for further processing, according to the Compress-Then-Analyze
(CTA) paradigm. In this paper we experimentally compare ATC and CTA by means of
rate-efficiency curves in the context of two different visual analysis tasks:
homography estimation and content-based retrieval. Our results show that the
novel ATC paradigm based on the proposed coding primitives can be competitive
with CTA, especially in bandwidth limited scenarios.Comment: submitted to IEEE Transactions on Image Processin
Objects that Sound
In this paper our objectives are, first, networks that can embed audio and
visual inputs into a common space that is suitable for cross-modal retrieval;
and second, a network that can localize the object that sounds in an image,
given the audio signal. We achieve both these objectives by training from
unlabelled video using only audio-visual correspondence (AVC) as the objective
function. This is a form of cross-modal self-supervision from video.
To this end, we design new network architectures that can be trained for
cross-modal retrieval and localizing the sound source in an image, by using the
AVC task. We make the following contributions: (i) show that audio and visual
embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and
between-mode retrieval; (ii) explore various architectures for the AVC task,
including those for the visual stream that ingest a single image, or multiple
images, or a single image and multi-frame optical flow; (iii) show that the
semantic object that sounds within an image can be localized (using only the
sound, no motion or flow information); and (iv) give a cautionary tale on how
to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
Techniques for effective and efficient fire detection from social media images
Social media could provide valuable information to support decision making in
crisis management, such as in accidents, explosions and fires. However, much of
the data from social media are images, which are uploaded in a rate that makes
it impossible for human beings to analyze them. Despite the many works on image
analysis, there are no fire detection studies on social media. To fill this
gap, we propose the use and evaluation of a broad set of content-based image
retrieval and classification techniques for fire detection. Our main
contributions are: (i) the development of the Fast-Fire Detection method
(FFDnR), which combines feature extractor and evaluation functions to support
instance-based learning, (ii) the construction of an annotated set of images
with ground-truth depicting fire occurrences -- the FlickrFire dataset, and
(iii) the evaluation of 36 efficient image descriptors for fire detection.
Using real data from Flickr, our results showed that FFDnR was able to achieve
a precision for fire detection comparable to that of human annotators.
Therefore, our work shall provide a solid basis for further developments on
monitoring images from social media.Comment: 12 pages, Proceedings of the International Conference on Enterprise
Information Systems. Specifically: Marcos Bedo, Gustavo Blanco, Willian
Oliveira, Mirela Cazzolato, Alceu Costa, Jose Rodrigues, Agma Traina, Caetano
Traina, 2015, Techniques for effective and efficient fire detection from
social media images, ICEIS, 34-4
Digital Image Access & Retrieval
The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
- …