16 research outputs found
CrowdCam: Instantaneous Navigation of Crowd Images Using Angled Graph
We present a near real-time algorithm for interactively exploring a collectively captured moment without explicit 3D reconstruction. Our system favors immediacy and local coherency to global consistency. It is common to represent photos as vertices of a weighted graph, where edge weights measure similarity or distance between pairs of photos. We introduce Angled Graphs as a new data structure to organize collections of photos in a way that enables the construction of visually smooth paths. Weighted angled graphs extend weighted graphs with angles and angle weights which penalize turning along paths. As a result, locally straight paths can be computed by specifying a photo and a direction. The weighted angled graphs of photos used in this paper can be regarded as the result of discretizing the Riemannian geometry of the high dimensional manifold of all possible photos. Ultimately, our system enables everyday people to take advantage of each others' perspectives in order to create on-the-spot spatiotemporal visual experiences similar to the popular bullet-time sequence. We believe that this type of application will greatly enhance shared human experiences spanning from events as personal as parents watching their children's football game to highly publicized red carpet galas.Swiss National Science FoundationEuropean Commission (ERC grant #210806 4DVideo, 7th Framework Programme (FP7/2007- 2013)
Matching and Predicting Street Level Images
The paradigm of matching images to a very large dataset
has been used for numerous vision tasks and is a powerful one. If the
image dataset is large enough, one can expect to nd good matches of
almost any image to the database, allowing label transfer [3, 15], and
image editing or enhancement [6, 11]. Users of this approach will want
to know how many images are required, and what features to use for
nding semantic relevant matches. Furthermore, for navigation tasks or
to exploit context, users will want to know the predictive quality of the
dataset: can we predict the image that would be seen under changes in
camera position?
We address these questions in detail for one category of images: street
level views. We have a dataset of images taken from an enumeration of
positions and viewpoints within Pittsburgh.We evaluate how well we can
match those images, using images from non-Pittsburgh cities, and how
well we can predict the images that would be seen under changes in cam-
era position. We compare performance for these tasks for eight di erent
feature sets, nding a feature set that outperforms the others (HOG).
A combination of all the features performs better in the prediction task
than any individual feature. We used Amazon Mechanical Turk workers
to rank the matches and predictions of di erent algorithm conditions by
comparing each one to the selection of a random image. This approach
can evaluate the e cacy of di erent feature sets and parameter settings
for the matching paradigm with other image categories.United States. Dept. of Defense (ARDA VACE)United States. National Geospatial-Intelligence Agency (NEGI-1582-04- 0004)United States. National Geospatial-Intelligence Agency (MURI Grant N00014-06-1-0734)France. Agence nationale de la recherche (project HFIBMR (ANR-07-BLAN- 0331-01))Institut national de recherche en informatique et en automatique (France)Xerox Fellowship Progra
Discovering states and transformations in image collections
Objects in visual scenes come in a rich variety of transformed states. A few classes of transformation have been heavily studied in computer vision: mostly simple, parametric changes in color and geometry. However, transformations in the physical world occur in many more flavors, and they come with semantic meaning: e.g., bending, folding, aging, etc. The transformations an object can undergo tell us about its physical and functional properties. In this paper, we introduce a dataset of objects, scenes, and materials, each of which is found in a variety of transformed states. Given a novel collection of images, we show how to explain the collection in terms of the states and transformations it depicts. Our system works by generalizing across object classes: states and transformations learned on one set of objects are used to interpret the image collection for an entirely new object class
SceneScape: Text-Driven Consistent Scene Generation
We present a method for text-driven perpetual view generation -- synthesizing
long-term videos of various scenes solely, given an input text prompt
describing the scene and camera poses. We introduce a novel framework that
generates such videos in an online fashion by combining the generative power of
a pre-trained text-to-image model with the geometric priors learned by a
pre-trained monocular depth prediction model. To tackle the pivotal challenge
of achieving 3D consistency, i.e., synthesizing videos that depict
geometrically-plausible scenes, we deploy an online test-time training to
encourage the predicted depth map of the current frame to be geometrically
consistent with the synthesized scene. The depth maps are used to construct a
unified mesh representation of the scene, which is progressively constructed
along the video generation process. In contrast to previous works, which are
applicable only to limited domains, our method generates diverse scenes, such
as walkthroughs in spaceships, caves, or ice castles.Comment: Project page: https://scenescape.github.io
CG2Real: Improving the Realism of Computer Generated Images using a Large Collection of Photographs
Computer Graphics (CG) has achieved a high level of realism, producing strikingly vivid images. This realism, however, comes at the cost of long and often expensive manual modeling, and most often humans can still distinguish between CG images and real images. We present a novel method to make CG images look more realistic that is simple and accessible to novice users. Our system uses a large collection of photographs gathered from online repositories. Given a CG image, we retrieve a small number of real images with similar global structure. We identify corresponding regions between the CG and real images using a novel mean-shift cosegmentation algorithm. The user can then automatically transfer color, tone, and texture from matching regions to the CG image. Our system only uses image processing operations and does not require a 3D model of the scene, making it fast and easy to integrate into digital content creation workflows. Results of a user study show that our improved CG images appear more realistic than the originals
Computational crowd camera : enabling remote-vision via sparse collective plenoptic sampling
Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 61-63).In this thesis, I present a near real-time algorithm for interactively exploring a collectively captured moment without explicit 3D reconstruction. This system favors immediacy and local coherency to global consistency. It is common to represent photos as vertices of a weighted graph, where edge weights measure similarity or distance between pairs of photos. I introduce Angled Graphs as a new data structure to organize collections of photos in a way that enables the construction of visually smooth paths. Weighted angled graphs extend weighted graphs with angles and angle weights which penalize turning along paths. As a result, locally straight paths can be computed by specifying a photo and a direction. The weighted angled graphs of photos used in this paper can be regarded as the result of discretizing the Riemannian geometry of the high dimensional manifold of all possible photos. Ultimately, this system enables everyday people to take advantage of each others' perspectives in order to create on-the-spot spatiotemporal visual experiences similar to the popular bullet-time sequence. I believe that this type of application will greatly enhance shared human experiences spanning from events as personal as parents watching their children's football game to highly publicized red carpet galas. In addition, security applications can greatly benefit from such a system by quickly making sense of a large collection of visual data.by Aydın Arpa.S.M