308 research outputs found
A vision system for mobile maritime surveillance platforms
Mobile surveillance systems play an important role to minimise security and safety threats in high-risk or hazardous environments. Providing a mobile marine surveillance platform with situational awareness of its environment is important for mission success. An essential part of situational awareness is the ability to detect and subsequently track potential target objects.Typically, the exact type of target objects is unknown, hence detection is addressed as a problem of finding parts of an image that stand out in relation to their surrounding regions or are atypical to the domain. Contrary to existing saliency methods, this thesis proposes the use of a domain specific visual attention approach for detecting potential regions of interest in maritime imagery. For this, low-level features that are indicative of maritime targets are identified. These features are then evaluated with respect to their local, regional, and global significance. Together with a domain specific background segmentation technique, the features are combined in a Bayesian classifier to direct visual attention to potential target objects.The maritime environment introduces challenges to the camera system: gusts, wind, swell, or waves can cause the platform to move drastically and unpredictably. Pan-tilt-zoom cameras that are often utilised for surveillance tasks can adjusting their orientation to provide a stable view onto the target. However, in rough maritime environments this requires high-speed and precise inputs. In contrast, omnidirectional cameras provide a full spherical view, which allows the acquisition and tracking of multiple targets at the same time. However, the target itself only occupies a small fraction of the overall view. This thesis proposes a novel, target-centric approach for image stabilisation. A virtual camera is extracted from the omnidirectional view for each target and is adjusted based on the measurements of an inertial measurement unit and an image feature tracker. The combination of these two techniques in a probabilistic framework allows for stabilisation of rotational and translational ego-motion. Furthermore, it has the specific advantage of being robust to loosely calibrated and synchronised hardware since the fusion of tracking and stabilisation means that tracking uncertainty can be used to compensate for errors in calibration and synchronisation. This then completely eliminates the need for tedious calibration phases and the adverse effects of assembly slippage over time.Finally, this thesis combines the visual attention and omnidirectional stabilisation frameworks and proposes a multi view tracking system that is capable of detecting potential target objects in the maritime domain. Although the visual attention framework performed well on the benchmark datasets, the evaluation on real-world maritime imagery produced a high number of false positives. An investigation reveals that the problem is that benchmark data sets are unconsciously being influenced by human shot selection, which greatly simplifies the problem of visual attention. Despite the number of false positives, the tracking approach itself is robust even if a high number of false positives are tracked
Automatic Food Intake Assessment Using Camera Phones
Obesity is becoming an epidemic phenomenon in most developed countries. The fundamental cause of obesity and overweight is an energy imbalance between calories consumed and calories expended. It is essential to monitor everyday food intake for obesity prevention and management. Existing dietary assessment methods usually require manually recording and recall of food types and portions. Accuracy of the results largely relies on many uncertain factors such as user\u27s memory, food knowledge, and portion estimations. As a result, the accuracy is often compromised. Accurate and convenient dietary assessment methods are still blank and needed in both population and research societies.
In this thesis, an automatic food intake assessment method using cameras, inertial measurement units (IMUs) on smart phones was developed to help people foster a healthy life style. With this method, users use their smart phones before and after a meal to capture images or videos around the meal. The smart phone will recognize food items and calculate the volume of the food consumed and provide the results to users. The technical objective is to explore the feasibility of image based food recognition and image based volume estimation.
This thesis comprises five publications that address four specific goals of this work: (1) to develop a prototype system with existing methods to review the literature methods, find their drawbacks and explore the feasibility to develop novel methods; (2) based on the prototype system, to investigate new food classification methods to improve the recognition accuracy to a field application level; (3) to design indexing methods for large-scale image database to facilitate the development of new food image recognition and retrieval algorithms; (4) to develop novel convenient and accurate food volume estimation methods using only smart phones with cameras and IMUs.
A prototype system was implemented to review existing methods. Image feature detector and descriptor were developed and a nearest neighbor classifier were implemented to classify food items. A reedit card marker method was introduced for metric scale 3D reconstruction and volume calculation.
To increase recognition accuracy, novel multi-view food recognition algorithms were developed to recognize regular shape food items. To further increase the accuracy and make the algorithm applicable to arbitrary food items, new food features, new classifiers were designed. The efficiency of the algorithm was increased by means of developing novel image indexing method in large-scale image database. Finally, the volume calculation was enhanced through reducing the marker and introducing IMUs. Sensor fusion technique to combine measurements from cameras and IMUs were explored to infer the metric scale of the 3D model as well as reduce noises from these sensors
RD-VIO: Robust Visual-Inertial Odometry for Mobile Augmented Reality in Dynamic Environments
It is typically challenging for visual or visual-inertial odometry systems to
handle the problems of dynamic scenes and pure rotation. In this work, we
design a novel visual-inertial odometry (VIO) system called RD-VIO to handle
both of these two problems. Firstly, we propose an IMU-PARSAC algorithm which
can robustly detect and match keypoints in a two-stage process. In the first
state, landmarks are matched with new keypoints using visual and IMU
measurements. We collect statistical information from the matching and then
guide the intra-keypoint matching in the second stage. Secondly, to handle the
problem of pure rotation, we detect the motion type and adapt the
deferred-triangulation technique during the data-association process. We make
the pure-rotational frames into the special subframes. When solving the
visual-inertial bundle adjustment, they provide additional constraints to the
pure-rotational motion. We evaluate the proposed VIO system on public datasets.
Experiments show the proposed RD-VIO has obvious advantages over other methods
in dynamic environments
Multimodal, Embodied and Location-Aware Interaction
This work demonstrates the development of mobile, location-aware, eyes-free applications which utilise multiple sensors to provide a continuous, rich and embodied interaction. We bring together ideas from the fields of
gesture recognition, continuous multimodal interaction, probability theory and audio interfaces to design and develop location-aware applications and embodied interaction in both a small-scale, egocentric body-based case and a large-scale, exocentric `world-based' case.
BodySpace is a gesture-based application, which utilises multiple sensors and pattern recognition enabling the human body to be used as the interface for an application. As an example, we describe the development of a gesture controlled music player, which functions by placing the device at different parts of the body. We describe a new approach to the segmentation and recognition of gestures for this kind of application and show how simulated physical model-based interaction techniques and the use of real world constraints can shape the gestural interaction.
GpsTunes is a mobile, multimodal navigation system equipped with inertial control that enables users to actively explore and navigate through an area in an augmented physical space, incorporating and displaying uncertainty resulting from inaccurate sensing and unknown user intention. The system propagates uncertainty appropriately via Monte Carlo sampling and output is displayed both visually and in audio, with audio rendered via granular synthesis. We demonstrate the use of uncertain prediction in the real world and show that appropriate display of the full distribution of potential future user positions with respect to sites-of-interest can improve the quality
of interaction over a simplistic interpretation of the sensed data. We show that this system enables eyes-free navigation around set trajectories or paths unfamiliar to the user for varying trajectory width and context. We demon-
strate the possibility to create a simulated model of user behaviour, which may be used to gain an insight into the user behaviour observed in our field trials. The extension of this application to provide a general mechanism for
highly interactive context aware applications via density exploration is also presented. AirMessages is an example application enabling users to take an embodied approach to scanning a local area to find messages left in their
virtual environment
Multimodal, Embodied and Location-Aware Interaction
This work demonstrates the development of mobile, location-aware, eyes-free applications which utilise multiple sensors to provide a continuous, rich and embodied interaction. We bring together ideas from the fields of
gesture recognition, continuous multimodal interaction, probability theory and audio interfaces to design and develop location-aware applications and embodied interaction in both a small-scale, egocentric body-based case and a large-scale, exocentric `world-based' case.
BodySpace is a gesture-based application, which utilises multiple sensors and pattern recognition enabling the human body to be used as the interface for an application. As an example, we describe the development of a gesture controlled music player, which functions by placing the device at different parts of the body. We describe a new approach to the segmentation and recognition of gestures for this kind of application and show how simulated physical model-based interaction techniques and the use of real world constraints can shape the gestural interaction.
GpsTunes is a mobile, multimodal navigation system equipped with inertial control that enables users to actively explore and navigate through an area in an augmented physical space, incorporating and displaying uncertainty resulting from inaccurate sensing and unknown user intention. The system propagates uncertainty appropriately via Monte Carlo sampling and output is displayed both visually and in audio, with audio rendered via granular synthesis. We demonstrate the use of uncertain prediction in the real world and show that appropriate display of the full distribution of potential future user positions with respect to sites-of-interest can improve the quality
of interaction over a simplistic interpretation of the sensed data. We show that this system enables eyes-free navigation around set trajectories or paths unfamiliar to the user for varying trajectory width and context. We demon-
strate the possibility to create a simulated model of user behaviour, which may be used to gain an insight into the user behaviour observed in our field trials. The extension of this application to provide a general mechanism for
highly interactive context aware applications via density exploration is also presented. AirMessages is an example application enabling users to take an embodied approach to scanning a local area to find messages left in their
virtual environment
Enriching remote labs with computer vision and drones
165 p.With the technological advance, new learning technologies are being developed in order to contribute to better learning experience. In particular, remote labs constitute an interesting and a practical way that can motivate nowadays students to learn. The studen can at anytime, and from anywhere, access the remote lab and do his lab-work. Despite many advantages, remote tecnologies in education create a distance between the student and the teacher. Without the presence of a teacher, students can have difficulties, if no appropriate interventions can be taken to help them. In this thesis, we aim to enrich an existing remote electronic lab made for engineering students called "LaboREM" (for remote Laboratory) in two ways: first we enable the student to send high level commands to a mini-drone available in the remote lab facility. The objective is to examine the front panels of electronic measurement instruments, by the camera embedded on the drone. Furthermore, we allow remote student-teacher communication using the drone, in case there is a teacher present in the remote lab facility. Finally, the drone has to go back home when the mission is over to land on a platform for automatic recharge of the batteries. Second, we propose an automatic system that estimates the affective state of the student (frustrated/confused/flow) in order to take appropriate interventions to ensure good learning outcomes. For example, if the studen is having major difficulties we can try to give him hints or to reduce the difficulty level of the lab experiment. We propose to do this by using visual cues (head pose estimation and facil expression analysis). Many evidences on the state of the student can be acquired, however these evidences are incomplete, sometims inaccurate, and do not cover all the aspects of the state of the student alone. This is why we propose to fuse evidences using the theory of Dempster-Shafer that allows the fusion of incomplete evidence
Object-Aware Tracking and Mapping
Reasoning about geometric properties of digital cameras and optical physics enabled
researchers to build methods that localise cameras in 3D space from a video
stream, while – often simultaneously – constructing a model of the environment.
Related techniques have evolved substantially since the 1980s, leading to increasingly
accurate estimations. Traditionally, however, the quality of results is strongly
affected by the presence of moving objects, incomplete data, or difficult surfaces
– i.e. surfaces that are not Lambertian or lack texture. One insight of this work is
that these problems can be addressed by going beyond geometrical and optical constraints,
in favour of object level and semantic constraints. Incorporating specific
types of prior knowledge in the inference process, such as motion or shape priors,
leads to approaches with distinct advantages and disadvantages.
After introducing relevant concepts in Chapter 1 and Chapter 2, methods for building
object-centric maps in dynamic environments using motion priors are investigated
in Chapter 5. Chapter 6 addresses the same problem as Chapter 5, but presents
an approach which relies on semantic priors rather than motion cues. To fully exploit
semantic information, Chapter 7 discusses the conditioning of shape representations
on prior knowledge and the practical application to monocular, object-aware
reconstruction systems
Recommended from our members
Inertial-aided Visual Perception of Geometry and Semantics
We describe components of a visual perception system to understand the geometry and semantics of the three-dimensional scene by utilizing monocular cameras and inertial measurement units (IMUs). The use of the two sensor modalities is motivated by the wide availability of the camera-IMU sensor packages present in mobile devices from phones to cars, and their complementary sensing capabilities: IMUs can track the motion of the sensor platform over a short period of time accurately, and provide a scaled and gravity-aligned global reference frame, while cameras can capture rich photometric signatures of the scene, and provide relative motion constraints between images up to scale. We first show that visual 3D reconstruction can be improved by leveraging the global orientation frame -- easily inferred from inertials. In the gravity-aligned global orientation frame, a shape prior can be imposed in depth prediction from a single image, where the normal vectors to surfaces of objects of certain classes tend to align with gravity or orthogonal to it. Adding such a prior to baseline methods for monocular depth prediction yields improvements beyond the state-of-the-art and illustrates the power of utilizing inertials in 3D reconstruction. The global reference provided by inertials is not only gravity-aligned but also scaled, which is exploited in depth completion: We describe a method to infer dense metric depth from camera motion and sparse depth as estimated using a visual-inertial odometry system. Unlike other scenarios using point clouds from lidar or structured light sensors, we have few hundreds to few thousand points, insufficient to inform the topology of the scene. Our method first constructs a piecewise planar scaffolding of the scene, and then uses it to infer dense depth using the image along with the sparse points. We use a predictive cross-modal criterion, akin to “self-supervision,” measuring photometric consistency across time, forward-backward pose consistency, and geometric compatibility with the sparse point cloud. We also launch the first visual-inertial + depth dataset (dubbed ``VOID''), which we hope will foster additional exploration into combining the complementary strengths of visual and inertial sensors. To compare our method to prior work, we adopt the unsupervised KITTI depth completion benchmark, and show state-of-the-art performance on it.In addition to dense geometry, the camera-IMU sensor package can also be used to recover the semantics of the scene. We present two methods to augment a point cloud map with class-labeled objects represented in the form of either scaled and oriented bounding boxes or CAD models. The tradeoff of the two shape representation resides in their generality and capability to model detailed structures. While being more generic, 3D bounding boxes fail to model the details of the objects, whereas CAD models preserve the finest shape details but require more computation and are limited to previously seen objects. Nevertheless, both methods populate an unknown environment with 3D objects placed in a Euclidean reference frame inferred causally and on-line using monocular video along with inertial sensors. Besides, both methods include bottom-up and top-down components, whereby deep networks trained for detection provide likelihood scores for object hypotheses provided by a nonlinear filter, whose state serves as memory. We test our methods on KITTI and SceneNN datasets, and also introduce the VISMA dataset, which contains ground truth pose, point-cloud map, and object models, along with time-stamped inertial measurements.To reduce the drift of the visual-inertial SLAM system -- a building block of all the visual perception systems we have built, we introduce an efficient loop closure detection approach based on the idea of hierarchical pooling of image descriptors. We also open-sourced a full-fledged SLAM system equipped with mapping and loop closure capabilities. The code is publicly available at https://github.com/ucla-vision/xivo
- …