277,559 research outputs found

    Street View Motion-from-Structure-from-Motion

    Full text link
    We describe a structure-from-motion framework that handles “generalized ” cameras, such as moving rolling-shutter cameras, and works at an unprecedented scale— billions of images covering millions of linear kilometers of roads—by exploiting a good relative pose prior along vehicle paths. We exhibit a planet-scale, appearance-augmented point cloud constructed with our framework and demonstrate its practical use in correcting the pose of a street-level image collection. 1

    Merging Top-View Lidar Data With Street-View SFM Data To Enhance Urban Flood Simulation

    Full text link
    Top-view data obtainedfrom LiDAR systemshas long been used as topographic-input data for urban flood modelling applications. This high-resolution input data has considerable potential to improve urban flood modelling predictions with more detail. However, the difficulty of employing top-view data is that it may create some missing urban features because this type ofdata cannot represent anyurban features,which are hiddenunderneath other objects. These hidden featuresmay play a substantial part in diverting floodwater flowing through,especially in complex urban areas. The recent advances in Photogrammetry and Computer Vision techniques offer an opportunity to create high-resolution topographic data. By using a consumer digital camera,2Ddigital photoscan betaken from different viewpoints. The so-called Structure from Motion (SfM) techniquecan usethese overlappingphotos and reconstruct theminto3D point-cloud data with a high level of accuracy and resolution,usinga cost effective approach. In this work, we create street-view SfM point-cloud data obtained from street viewpoints. We also introduce a new multi-view approach by merging top-view LiDAR data withstreet-view SfM data. This new multi-view data can be used as topographic input data for a coupled 1D-2D model. When applyingsuch newdata, the flood simulation results can highlight some flood propagations much better than using the traditional top-view LiDAR data. Therefore, it has the potential toenhance the multi-view approach into practicable flood-modelling applications for the present and future urbanizing areas

    Long-Term Person Re-Identification in the Wild

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Person re-identification (re-ID) has been attracting extensive research interest because of its non-fungible position in applications such as surveillance security, criminal investigation and forensic reasoning. Existing works assume that pedestrians keep their clothes unchanged while passing across disjoint cameras in a short period. It narrows person re-ID to a short-term problem and incurs solutions using appearance-based similarity measurement. However, this assumption is not always true in practice. For example, pedestrians are high likely to re-appear after a long-time period, such as several days. This emerging problem is termed as long-term person re-ID (LT-reID). Regarding different types of sensors deployed, LT-reID is divided into two subtasks: person re-ID after a long-time gap (LTG-reID) and cross-camera-modality person re-ID (CCM-reID). LTG-reID utilizes only RGB cameras, while CCM-reID employs different types of sensors. Besides challenges in classical person re-ID, CCM-reID faces additional data distribution discrepancy caused by modality difference, and LTG-reID suffers severe within-person appearance inconsistency caused by clothing changes. These variations seriously degrade the performance of existing re-ID methods. To address the aforementioned problems, this thesis investigates LT-reID from four aspects: motion pattern mining, view bias mitigation, cross-modality matching and hybrid representation learning. Motion pattern mining aims to address LTG-reID by crafting true motion information. To this point, a fine motion encoding method is proposed, which extracts motion patterns hierarchically by encoding trajectory-aligned descriptors with Fisher vectors in a spatial-aligned pyramid. View bias mitigation targets on narrowing discrepancy caused by viewpoint difference. This thesis proposes two solutions: VN-GAN normalizes gaits from various views into a unified one, and VT-GAN achieves view transformation between gaits from any two views. Cross-modality matching aims to learn modality-invariant representations. To this end, this thesis proposes to asymmetrically project heterogeneous features across modalities onto a modality-agnostic space and simultaneously reconstruct the projected data using a shared dictionary on the space. Hybrid representation learning explores both subtle identity properties and motion patterns. Regarding that, a two-stream network is proposed: the space-time stream performs on image sequences to learn identity-related patterns, e.g., body geometric structure and movement, and skeleton motion stream operates on normalized 3D skeleton sequences to learn motion patterns. Moreover, two datasets particular for LTG-reID are presented: Motion-reID is collected by two real-world surveillance cameras, and CVID-reID involves tracklets clipped from street-shot videos of celebrities on the Internet. Both datasets include abundant within-person cloth variations, highly dynamic background and diverse camera viewpoints, which promote the development of LT-reID research

    Linear Global Translation Estimation with Feature Tracks

    Full text link
    This paper derives a novel linear position constraint for cameras seeing a common scene point, which leads to a direct linear method for global camera translation estimation. Unlike previous solutions, this method deals with collinear camera motion and weak image association at the same time. The final linear formulation does not involve the coordinates of scene points, which makes it efficient even for large scale data. We solve the linear equation based on L1L_1 norm, which makes our system more robust to outliers in essential matrices and feature correspondences. We experiment this method on both sequentially captured images and unordered Internet images. The experiments demonstrate its strength in robustness, accuracy, and efficiency.Comment: Changes: 1. Adopt BMVC2015 style; 2. Combine sections 3 and 5; 3. Move "Evaluation on synthetic data" out to supplementary file; 4. Divide subsection "Evaluation on general data" to subsections "Experiment on sequential data" and "Experiment on unordered Internet data"; 5. Change Fig. 1 and Fig.8; 6. Move Fig. 6 and Fig. 7 to supplementary file; 7 Change some symbols; 8. Correct some typo

    3D high definition video coding on a GPU-based heterogeneous system

    Get PDF
    H.264/MVC is a standard for supporting the sensation of 3D, based on coding from 2 (stereo) to N views. H.264/MVC adopts many coding options inherited from single view H.264/AVC, and thus its complexity is even higher, mainly because the number of processing views is higher. In this manuscript, we aim at an efficient parallelization of the most computationally intensive video encoding module for stereo sequences. In particular, inter prediction and its collaborative execution on a heterogeneous platform. The proposal is based on an efficient dynamic load balancing algorithm and on breaking encoding dependencies. Experimental results demonstrate the proposed algorithm's ability to reduce the encoding time for different stereo high definition sequences. Speed-up values of up to 90Ă— were obtained when compared with the reference encoder on the same platform. Moreover, the proposed algorithm also provides a more energy-efficient approach and hence requires less energy than the sequential reference algorith

    Generic 3D Representation via Pose Estimation and Matching

    Full text link
    Though a large body of computer vision research has investigated developing generic semantic representations, efforts towards developing a similar representation for 3D has been limited. In this paper, we learn a generic 3D representation through solving a set of foundational proxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching. Our method is based upon the premise that by providing supervision over a set of carefully selected foundational tasks, generalization to novel tasks and abstraction capabilities can be achieved. We empirically show that the internal representation of a multi-task ConvNet trained to solve the above core problems generalizes to novel 3D tasks (e.g., scene layout estimation, object pose estimation, surface normal estimation) without the need for fine-tuning and shows traits of abstraction abilities (e.g., cross-modality pose estimation). In the context of the core supervised tasks, we demonstrate our representation achieves state-of-the-art wide baseline feature matching results without requiring apriori rectification (unlike SIFT and the majority of learned features). We also show 6DOF camera pose estimation given a pair local image patches. The accuracy of both supervised tasks come comparable to humans. Finally, we contribute a large-scale dataset composed of object-centric street view scenes along with point correspondences and camera pose information, and conclude with a discussion on the learned representation and open research questions.Comment: Published in ECCV16. See the project website http://3drepresentation.stanford.edu/ and dataset website https://github.com/amir32002/3D_Street_Vie
    • …
    corecore