850 research outputs found
Generic 3D Representation via Pose Estimation and Matching
Though a large body of computer vision research has investigated developing
generic semantic representations, efforts towards developing a similar
representation for 3D has been limited. In this paper, we learn a generic 3D
representation through solving a set of foundational proxy 3D tasks:
object-centric camera pose estimation and wide baseline feature matching. Our
method is based upon the premise that by providing supervision over a set of
carefully selected foundational tasks, generalization to novel tasks and
abstraction capabilities can be achieved. We empirically show that the internal
representation of a multi-task ConvNet trained to solve the above core problems
generalizes to novel 3D tasks (e.g., scene layout estimation, object pose
estimation, surface normal estimation) without the need for fine-tuning and
shows traits of abstraction abilities (e.g., cross-modality pose estimation).
In the context of the core supervised tasks, we demonstrate our representation
achieves state-of-the-art wide baseline feature matching results without
requiring apriori rectification (unlike SIFT and the majority of learned
features). We also show 6DOF camera pose estimation given a pair local image
patches. The accuracy of both supervised tasks come comparable to humans.
Finally, we contribute a large-scale dataset composed of object-centric street
view scenes along with point correspondences and camera pose information, and
conclude with a discussion on the learned representation and open research
questions.Comment: Published in ECCV16. See the project website
http://3drepresentation.stanford.edu/ and dataset website
https://github.com/amir32002/3D_Street_Vie
Corpus Analysis Tools for Computational Hook Discovery.
Compared to studies with symbolic music data, advances in music description from audio have overwhelmingly focused on ground truth reconstruction and maximizing prediction accuracy, with only a small fraction of studies using audio description to gain insight into musical data. We present a strategy for the corpus analysis of audio data that is optimized for interpretable results. The approach brings two previously unexplored concepts to the audio domain: audio bigram distributions, and the use of corpus-relative or 'second-order' descriptors. To test the real-world applicability of our method, we present an experiment in which we model song recognition data collected in a widely-played music game. By using the proposed corpus analysis pipeline we are able to present a cognitively adequate analysis that allows a model interpretation in terms of the listening history and experience of our participants. We find that our corpus-based audio features are able to explain a comparable amount of variance to symbolic features for this task when used alone and that they can supplement symbolic features profitably when the two types of features are used in tandem. Finally, we highlight new insights into what makes music recognizable
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
We introduce OpenShape, a method for learning multi-modal joint
representations of text, image, and point clouds. We adopt the commonly used
multi-modal contrastive learning framework for representation alignment, but
with a specific focus on scaling up 3D representations to enable open-world 3D
shape understanding. To achieve this, we scale up training data by ensembling
multiple 3D datasets and propose several strategies to automatically filter and
enrich noisy text descriptions. We also explore and compare strategies for
scaling 3D backbone networks and introduce a novel hard negative mining module
for more efficient training. We evaluate OpenShape on zero-shot 3D
classification benchmarks and demonstrate its superior capabilities for
open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy
of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than
10% for existing methods. OpenShape also achieves an accuracy of 85.3% on
ModelNet40, outperforming previous zero-shot baseline methods by 20% and
performing on par with some fully-supervised methods. Furthermore, we show that
our learned embeddings encode a wide range of visual and semantic concepts
(e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D
and image-3D interactions. Due to their alignment with CLIP embeddings, our
learned shape representations can also be integrated with off-the-shelf
CLIP-based models for various applications, such as point cloud captioning and
point cloud-conditioned image generation.Comment: Project Website: https://colin97.github.io/OpenShape
Information-Theoretic Measures of Predictability for Music Content Analysis.
PhDThis thesis is concerned with determining similarity in musical audio, for the purpose of applications
in music content analysis. With the aim of determining similarity, we consider the
problem of representing temporal structure in music. To represent temporal structure, we propose
to compute information-theoretic measures of predictability in sequences. We apply our
measures to track-wise representations obtained from musical audio; thereafter we consider the
obtained measures predictors of musical similarity. We demonstrate that our approach benefits
music content analysis tasks based on musical similarity.
For the intermediate-specificity task of cover song identification, we compare contrasting
discrete-valued and continuous-valued measures of pairwise predictability between sequences.
In the discrete case, we devise a method for computing the normalised compression distance
(NCD) which accounts for correlation between sequences. We observe that our measure improves
average performance over NCD, for sequential compression algorithms. In the continuous
case, we propose to compute information-based measures as statistics of the prediction error
between sequences. Evaluated using 300 Jazz standards and using the Million Song Dataset,
we observe that continuous-valued approaches outperform discrete-valued approaches. Further,
we demonstrate that continuous-valued measures of predictability may be combined to improve
performance with respect to baseline approaches. Using a filter-and-refine approach, we demonstrate
state-of-the-art performance using the Million Song Dataset.
For the low-specificity tasks of similarity rating prediction and song year prediction, we propose
descriptors based on computing track-wise compression rates of quantised audio features,
using multiple temporal resolutions and quantisation granularities. We evaluate our descriptors
using a dataset of 15 500 track excerpts of Western popular music, for which we have 7 800
web-sourced pairwise similarity ratings. Combined with bag-of-features descriptors, we obtain
performance gains of 31.1% and 10.9% for similarity rating prediction and song year prediction.
For both tasks, analysis of selected descriptors reveals that representing features at multiple time
scales benefits prediction accuracy.This work was supported by a UK EPSRC DTA studentship
A Location-Aware Middleware Framework for Collaborative Visual Information Discovery and Retrieval
This work addresses the problem of scalable location-aware distributed indexing to enable the leveraging of collaborative effort for the construction and maintenance of world-scale visual maps and models which could support numerous activities including navigation, visual localization, persistent surveillance, structure from motion, and hazard or disaster detection. Current distributed approaches to mapping and modeling fail to incorporate global geospatial addressing and are limited in their functionality to customize search. Our solution is a peer-to-peer middleware framework based on XOR distance routing which employs a Hilbert Space curve addressing scheme in a novel distributed geographic index. This allows for a universal addressing scheme supporting publish and search in dynamic environments while ensuring global availability of the model and scalability with respect to geographic size and number of users. The framework is evaluated using large-scale network simulations and a search application that supports visual navigation in real-world experiments
Data-driven shape analysis and processing
Data-driven methods serve an increasingly important role in discovering geometric, structural, and semantic relationships between shapes. In contrast to traditional approaches that process shapes in isolation of each other, data-driven methods aggregate information from 3D model collections to improve the analysis, modeling and editing of shapes. Through reviewing the literature, we provide an overview of the main concepts and components of these methods, as well as discuss their application to classification, segmentation, matching, reconstruction, modeling and exploration, as well as scene analysis and synthesis. We conclude our report with ideas that can inspire future research in data-driven shape analysis and processing
- …