692 research outputs found
Proposal Flow: Semantic Correspondences from Object Proposals
Finding image correspondences remains a challenging problem in the presence
of intra-class variations and large changes in scene layout. Semantic flow
methods are designed to handle images depicting different instances of the same
object or scene category. We introduce a novel approach to semantic flow,
dubbed proposal flow, that establishes reliable correspondences using object
proposals. Unlike prevailing semantic flow approaches that operate on pixels or
regularly sampled local regions, proposal flow benefits from the
characteristics of modern object proposals, that exhibit high repeatability at
multiple scales, and can take advantage of both local and geometric consistency
constraints among proposals. We also show that the corresponding sparse
proposal flow can effectively be transformed into a conventional dense flow
field. We introduce two new challenging datasets that can be used to evaluate
both general semantic flow techniques and region-based approaches such as
proposal flow. We use these benchmarks to compare different matching
algorithms, object proposals, and region features within proposal flow, to the
state of the art in semantic flow. This comparison, along with experiments on
standard datasets, demonstrates that proposal flow significantly outperforms
existing semantic flow methods in various settings.Comment: arXiv admin note: text overlap with arXiv:1511.0506
3D Object Class Detection in the Wild
Object class detection has been a synonym for 2D bounding box localization
for the longest time, fueled by the success of powerful statistical learning
techniques, combined with robust image representations. Only recently, there
has been a growing interest in revisiting the promise of computer vision from
the early days: to precisely delineate the contents of a visual scene, object
by object, in 3D. In this paper, we draw from recent advances in object
detection and 2D-3D object lifting in order to design an object class detector
that is particularly tailored towards 3D object class detection. Our 3D object
class detection method consists of several stages gradually enriching the
object detection output with object viewpoint, keypoints and 3D shape
estimates. Following careful design, in each stage it constantly improves the
performance and achieves state-ofthe-art performance in simultaneous 2D
bounding box and viewpoint estimation on the challenging Pascal3D+ dataset
The Cityscapes Dataset for Semantic Urban Scene Understanding
Visual understanding of complex urban street scenes is an enabling factor for
a wide range of applications. Object detection has benefited enormously from
large-scale datasets, especially in the context of deep learning. For semantic
urban scene understanding, however, no current dataset adequately captures the
complexity of real-world urban scenes.
To address this, we introduce Cityscapes, a benchmark suite and large-scale
dataset to train and test approaches for pixel-level and instance-level
semantic labeling. Cityscapes is comprised of a large, diverse set of stereo
video sequences recorded in streets from 50 different cities. 5000 of these
images have high quality pixel-level annotations; 20000 additional images have
coarse annotations to enable methods that leverage large volumes of
weakly-labeled data. Crucially, our effort exceeds previous attempts in terms
of dataset size, annotation richness, scene variability, and complexity. Our
accompanying empirical study provides an in-depth analysis of the dataset
characteristics, as well as a performance evaluation of several
state-of-the-art approaches based on our benchmark.Comment: Includes supplemental materia
Lifting GIS Maps into Strong Geometric Context for Scene Understanding
Contextual information can have a substantial impact on the performance of
visual tasks such as semantic segmentation, object detection, and geometric
estimation. Data stored in Geographic Information Systems (GIS) offers a rich
source of contextual information that has been largely untapped by computer
vision. We propose to leverage such information for scene understanding by
combining GIS resources with large sets of unorganized photographs using
Structure from Motion (SfM) techniques. We present a pipeline to quickly
generate strong 3D geometric priors from 2D GIS data using SfM models aligned
with minimal user input. Given an image resectioned against this model, we
generate robust predictions of depth, surface normals, and semantic labels. We
show that the precision of the predicted geometry is substantially more
accurate other single-image depth estimation methods. We then demonstrate the
utility of these contextual constraints for re-scoring pedestrian detections,
and use these GIS contextual features alongside object detection score maps to
improve a CRF-based semantic segmentation framework, boosting accuracy over
baseline models
Fully-Automated Packaging Structure Recognition of Standardized Logistics Assets on Images
Innerhalb einer logistischen Lieferkette müssen vielfältige Transportgüter an zahlreichen Knotenpunkten bearbeitet, wiedererkannt und kontrolliert werden. Dabei ist oft ein großer manueller Aufwand erforderlich, um die Paketidentität oder auch die Packstruktur zu erkennen oder zu verifizieren. Solche Schritte sind notwendig, um beispielsweise eine Lieferung auf ihre Vollständigkeit hin zu überprüfen. Wir untersuchen die Konzeption und Implementierung eines Verfahrens zur vollständigen Automatisierung der Erkennung der Packstruktur logistischer Sendungen. Ziel dieses
Verfahrens ist es, basierend auf einem einzigen Farbbild, eine oder mehrere Transporteinheiten akkurat zu lokalisieren und relevante Charakteristika, wie beispielsweise die Gesamtzahl oder die Anordnung der enthaltenen Packstücke, zu erkennen. Wir stellen eine aus mehreren Komponenten bestehende Bildverarbeitungs-Pipeline vor, die diese Aufgabe der Packstrukturerkennung lösen soll.
Unsere erste Implementierung des Verfahrens verwendet mehrere Deep Learning Modelle, genauer gesagt Convolutional Neural Networks zur Instanzsegmentierung, sowie Bildverarbeitungsmethoden und heuristische Komponenten. Wir verwenden einen eigenen Datensatz von Echtbildern aus einer Logistik-Umgebung für Training und Evaluation unseres Verfahrens. Wir zeigen, dass unsere Lösung in der Lage ist, die korrekte Packstruktur in etwa 85% der Testfälle unseres Datensatzes zu erkennen, und sogar eine höhere Genauigkeit erzielt wird, wenn nur die meist vorkommenden Packstücktypen betrachtet werden.
Für eine ausgewählte Bilderkennungs-Komponente unseres Algorithmus vergleichen wir das Potenzial der Verwendung weniger rechenintensiver, eigens designter Bildverarbeitungsmethoden mit den zuvor implementierten Deep Learning Verfahren. Aus dieser Untersuchung schlussfolgern wir die bessere Eignung der lernenden Verfahren, welche wir auf deren sehr gute Fähigkeit zur Generalisierung zurückführen.
Außerdem formulieren wir das Problem der Objekt-Lokalisierung in Bildern anhand selbst gewählter Merkmalspunkte, wie beispielsweise Eckpunkte logistischer Transporteinheiten. Ziel hiervon ist es, Objekte präziser zu lokalisieren, als dies insbesondere im Vergleich zur Verwendung herkömmlicher umgebender Rechtecke möglich ist, während gleichzeitig die Objektform durch bekanntes Vorwissen zur Objektgeometrie forciert wird. Wir stellen ein spezifisches Deep Learning Modell vor, welches die beschriebene Aufgabe löst im Fall von Objekten, welche durch vier Eckpunkte beschrieben
werden können. Das dabei entwickelte Modell mit Namen TetraPackNet wird evaluiert mittels allgemeiner und anwendungsbezogener Metriken. Wir belegen die Anwendbarkeit der Lösung im Falle unserer Bilderkennungs-Pipeline und argumentieren die Relevanz für andere Anwendungsfälle, wie beispielweise Kennzeichenerkennung
Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning
Growing interest in autonomous driving (AD) and intelligent vehicles (IVs) is
fueled by their promise for enhanced safety, efficiency, and economic benefits.
While previous surveys have captured progress in this field, a comprehensive
and forward-looking summary is needed. Our work fills this gap through three
distinct articles. The first part, a "Survey of Surveys" (SoS), outlines the
history, surveys, ethics, and future directions of AD and IV technologies. The
second part, "Milestones in Autonomous Driving and Intelligent Vehicles Part I:
Control, Computing System Design, Communication, HD Map, Testing, and Human
Behaviors" delves into the development of control, computing system,
communication, HD map, testing, and human behaviors in IVs. This part, the
third part, reviews perception and planning in the context of IVs. Aiming to
provide a comprehensive overview of the latest advancements in AD and IVs, this
work caters to both newcomers and seasoned researchers. By integrating the SoS
and Part I, we offer unique insights and strive to serve as a bridge between
past achievements and future possibilities in this dynamic field.Comment: 17pages, 6figures. IEEE Transactions on Systems, Man, and
Cybernetics: System
Lane Line Detection and Object Scene Segmentation Using Otsu Thresholding and the Fast Hough Transform for Intelligent Vehicles in Complex Road Conditions
An Otsu-threshold- and Canny-edge-detection-based fast Hough transform (FHT) approach to lane detection was proposed to improve the accuracy of lane detection for autonomous vehicle driving. During the last two decades, autonomous vehicles have become very popular, and it is constructive to avoid traffic accidents due to human mistakes. The new generation needs automatic vehicle intelligence. One of the essential functions of a cutting-edge automobile system is lane detection. This study recommended the idea of lane detection through improved (extended) Canny edge detection using a fast Hough transform. The Gaussian blur filter was used to smooth out the image and reduce noise, which could help to improve the edge detection accuracy. An edge detection operator known as the Sobel operator calculated the gradient of the image intensity to identify edges in an image using a convolutional kernel. These techniques were applied in the initial lane detection module to enhance the characteristics of the road lanes, making it easier to detect them in the image. The Hough transform was then used to identify the routes based on the mathematical relationship between the lanes and the vehicle. It did this by converting the image into a polar coordinate system and looking for lines within a specific range of contrasting points. This allowed the algorithm to distinguish between the lanes and other features in the image. After this, the Hough transform was used for lane detection, making it possible to distinguish between left and right lane marking detection extraction; the region of interest (ROI) must be extracted for traditional approaches to work effectively and easily. The proposed methodology was tested on several image sequences. The least-squares fitting in this region was then used to track the lane. The proposed system demonstrated high lane detection in experiments, demonstrating that the identification method performed well regarding reasoning speed and identification accuracy, which considered both accuracy and real-time processing and could satisfy the requirements of lane recognition for lightweight automatic driving systems
Object recognition using multi-view imaging
Single view imaging data has been used in most previous research in computer vision and
image understanding and lots of techniques have been developed. Recently with the fast
development and dropping cost of multiple cameras, it has become possible to have many
more views to achieve image processing tasks. This thesis will consider how to use the
obtained multiple images in the application of target object recognition.
In this context, we present two algorithms for object recognition based on scale-
invariant feature points. The first is single view object recognition method (SOR), which
operates on single images and uses a chirality constraint to reduce the recognition errors
that arise when only a small number of feature points are matched. The procedure is
extended in the second multi-view object recognition algorithm (MOR) which operates on
a multi-view image sequence and, by tracking feature points using a dynamic programming
method in the plenoptic domain subject to the epipolar constraint, is able to fuse feature
point matches from all the available images, resulting in more robust recognition.
We evaluated these algorithms using a number of data sets of real images capturing
both indoor and outdoor scenes. We demonstrate that MOR is better than SOR particularly for noisy and low resolution images, and it is also able to recognize objects that are
partially occluded by combining it with some segmentation techniques
- …