80 research outputs found

    Dynamic Scene Understanding with Applications to Traffic Monitoring

    Get PDF
    Many breakthroughs have been witnessed in the computer vision community in recent years, largely due to deep Convolutional Neural Networks (CNN) and largescale datasets. This thesis aims to investigate dynamic scene understanding from images. The problem of dynamic scene understanding involves simultaneously solving several sub-tasks including object detection, object recognition, and segmentation. Successfully completing these tasks will enable us to interpret the objects of interest within a scene. Vision-based traffic monitoring is one of many fast-emerging areas in the intelligent transportation system (ITS). In the thesis, we focus on the following problems in traffic scene understanding. They are 1) How to detect and recognize all the objects of interest in street view images? 2) How to employ CNN features and semantic pixel labelling to boost the performance of pedestrian detection? 3) How to enhance the discriminative power of CNN representations for improving the performance of fine-grained car recognition? 4) How to learn an adaptive color space to represent vehicle images for vehicle color recognition? For the first task, we propose a single learning based detection framework to detect three important classes of objects (traffic signs, cars, and cyclists). The proposed framework consists of a dense feature extractor and detectors of these three classes. The advantage of using one common framework is that the detection speed is much faster, since all dense features need only to be evaluated once and then are shared with all detectors. The proposed framework introduces spatially pooled features as a part of aggregated channel features to enhance the robustness to noises and image deformations. We also propose an object subcategorization scheme as a means of capturing the intra-class variation of objects. To address the second problem, we show that by re-using the convolutional feature maps (CFMs) of a deep CNN model as visual features to train an ensemble of boosted decision forests, we are able to remarkably improve the performance of pedestrian detection without using specially designed learning algorithms. We also show that semantic pixel labelling can be simply combined with a pedestrian detector to further boost the detection performance. Fine-grained details of objects usually contain very discriminative information which are crucial for fine-grained object recognition. Conventional pooling strategies (e.g. max-pooling, average-pooling) may discard these fine-grained details and hurt the iii iv recognition performance. To remedy this problem, we propose a spatially weighted pooling (swp) strategy which considerably improves the discriminative power of CNN representations. The swp pools the CNN features with the guidance of its learnt masks, which measures the importance of the spatial units in terms of discriminative power. In image color recognition, visual features are extracted from image pixels represented in one color space. The choice of the color space may influence the quality of extracted features and impact the recognition performance. We propose a color transformation method that converts image pixels from the RGB space to a learnt space for improving the recognition performance. Moreover, we propose a ColorNet which optimizes the architecture of AlexNet and embeds a mini-CNN of color transformation for vehicle color recognition.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    Diabetic foot ulcer classification using mapped binary patterns and convolutional neural networks

    Get PDF
    Diabetic foot ulcer (DFU) is a major complication of diabetes and can lead to lower limb amputation if not treated early and properly. In addition to the traditional clinical approaches, in recent years, research on automation using computer vision and machine learning methods plays an important role in DFU classification, achieving promising successes. The most recent automatic approaches to DFU classification are based on convolutional neural networks (CNNs), using solely RGB images as input. In this paper, we present a CNN-based DFU classification method in which we showed that feeding an appropriate feature (texture information) to the CNN model provides a complementary performance to the standard RGB-based deep models of the DFU classification task, and better performance can be obtained if both RGB images and their texture features are combined and used as input to the CNN. To this end, the proposed method consists of two main stages. The first stage extracts texture information from the RGB image using the mapped binary patterns technique. The obtained mapped image is used to aid the second stage in recognizing DFU as it contains texture information of ulcer. The stack of RGB and mapped binary patterns images are fed to the CNN as a tensor input or as a fused image, which is a linear combination of RGB and mapped binary patterns images. The performance of the proposed approach was evaluated using two recently published DFU datasets: the Part-A dataset of healthy and unhealthy (DFU) cases [17] and Part-B dataset of ischaemia and infection cases [18]. The results showed that the proposed methods provided better performance than the state-of-the-art CNN-based methods with 0.981% (AUC) and 0.952% (F-Measure) on the Part-A dataset, 0.995% (AUC) and 0.990% (F-measure) for the Part-B ischaemia dataset, and 0.820% (AUC) and 0.744% (F-measure) on the Part-B infection dataset

    Re-identifying people in the crowd

    Get PDF
    Developing an automated surveillance system is of great interest for various reasons including forensic and security applications. In the case of a network of surveillance cameras with non-overlapping fields of view, person detection and tracking alone are insufficient to track a subject of interest across the network. In this case, instances of a person captured in one camera view need to be retrieved among a gallery of different people, in other camera views. This vision problem is commonly known as person re-identification (re-id). Cross-view instances of pedestrians exhibit varied levels of illumination, viewpoint, and pose variations which makes the problem very challenging. Despite recent progress towards improving accuracy, existing systems suffer from low applicability to real-world scenarios. This is mainly caused by the need for large amounts of annotated data from pairwise camera views to be available for training. Given the difficulty of obtaining such data and annotating it, this thesis aims to bring the person re-id problem a step closer to real-world deployment. In the first contribution, the single-shot protocol, where each individual is represented by a pair of images that need to be matched, is considered. Following the extensive annotation of four datasets for six attributes, an evaluation of the most widely used feature extraction schemes is conducted. The results reveal two high-performing descriptors among those evaluated, and show illumination variation to have the most impact on re-id accuracy. Motivated by the wide availability of videos from surveillance cameras and the additional visual and temporal information they provide, video-based person re-id is then investigated, and a su-pervised system is developed. This is achieved by improving and extending the best performing image-based person descriptor into three dimensions and combining it with distance metric learn-ing. The system obtained achieves state-of-the-art results on two widely used datasets. Given the cost and difficulty of obtaining labelled data from pairwise cameras in a network to train the model, an unsupervised video-based person re-id method is also developed. It is based on a set-based distance measure that leverages rank vectors to estimate the similarity scores between person tracklets. The proposed system outperforms other unsupervised methods by a large margin on two datasets while competing with deep learning methods on another large-scale dataset

    Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

    Full text link
    Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets; and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds. However, with the growing hope that transformers can become the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. To build a 3D ViT from its 2D sibling, we "inflate" the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. The resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. It can hence act as a strong baseline for new 3D ViTs. Moreover, we note that pursing a unified 2D-3D ViT design has practical relevance besides just scientific curiosity. Specifically, we demonstrate that Simple3D-Former naturally enables to exploit the wealth of pre-trained weights from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in to enhancing the 3D task performance "for free"

    Learning by correlation for computer vision applications: from Kernel methods to deep learning

    Get PDF
    Learning to spot analogies and differences within/across visual categories is an arguably powerful approach in machine learning and pattern recognition which is directly inspired by human cognition. In this thesis, we investigate a variety of approaches which are primarily driven by correlation and tackle several computer vision applications

    Visual Place Recognition under Severe Viewpoint and Appearance Changes

    Get PDF
    Over the last decade, the eagerness of the robotic and computer vision research communities unfolded extensive advancements in long-term robotic vision. Visual localization is the constituent of this active research domain; an ability of an object to correctly localize itself while mapping the environment simultaneously, technically termed as Simultaneous Localization and Mapping (SLAM). Visual Place Recognition (VPR), a core component of SLAM is a well-known paradigm. In layman terms, at a certain place/location within an environment, a robot needs to decide whether it’s the same place experienced before? Visual Place Recognition utilizing Convolutional Neural Networks (CNNs) has made a major contribution in the last few years. However, the image retrieval-based VPR becomes more challenging when the same places experience strong viewpoint and seasonal transitions. This thesis concentrates on improving the retrieval performance of VPR system, generally targeting the place correspondence. Despite the remarkable performances of state-of-the-art deep CNNs for VPR, the significant computation- and memory-overhead limit their practical deployment for resource constrained mobile robots. This thesis investigates the utility of shallow CNNs for power-efficient VPR applications. The proposed VPR frameworks focus on novel image regions that can contribute in recognizing places under dubious environment and viewpoint variations. Employing challenging place recognition benchmark datasets, this thesis further illustrates and evaluates the robustness of shallow CNN-based regional features against viewpoint and appearance changes coupled with dynamic instances, such as pedestrians, vehicles etc. Finally, the presented computation-efficient and light-weight VPR methodologies have shown boostup in matching performance in terms of Area under Precision-Recall curves (AUC-PR curves) over state-of-the-art deep neural network based place recognition and SLAM algorithms

    All-weather object recognition using radar and infrared sensing

    Get PDF
    Autonomous cars are an emergent technology which has the capacity to change human lives. The current sensor systems which are most capable of perception are based on optical sensors. For example, deep neural networks show outstanding results in recognising objects when used to process data from cameras and Light Detection And Ranging (LiDAR) sensors. However these sensors perform poorly under adverse weather conditions such as rain, fog, and snow due to the sensor wavelengths. This thesis explores new sensing developments based on long wave polarised infrared (IR) imagery and imaging radar to recognise objects. First, we developed a methodology based on Stokes parameters using polarised infrared data to recognise vehicles using deep neural networks. Second, we explored the potential of using only the power spectrum captured by low-THz radar sensors to perform object recognition in a controlled scenario. This latter work is based on a data-driven approach together with the development of a data augmentation method based on attenuation, range and speckle noise. Last, we created a new large-scale dataset in the ”wild” with many different weather scenarios (sunny, overcast, night, fog, rain and snow) showing radar robustness to detect vehicles in adverse weather. High resolution radar and polarised IR imagery, combined with a deep learning approach, are shown as a potential alternative to current automotive sensing systems based on visible spectrum optical technology as they are more robust in severe weather and adverse light conditions.UK Engineering and Physical Research Council, grant reference EP/N012402/

    Tightly-coupled manipulation pipelines: Combining traditional pipelines and end-to-end learning

    Get PDF
    Traditionally, robot manipulation tasks are solved by engineering solutions in a modular fashion --- typically consisting of object detection, pose estimation, grasp planning, motion planning, and finally run a control algorithm to execute the planned motion. This traditional approach to robot manipulation separates the hard problem of manipulation into several self-contained stages, which can be developed independently, and gives interpretable outputs at each stage of the pipeline. However, this approach comes with a plethora of issues, most notably, their generalisability to a broad range of tasks; it is common that as tasks get more difficult, the systems become increasingly complex. To combat the flaws of these systems, recent trends have seen robots visually learning to predict actions and grasp locations directly from sensor input in an end-to-end manner using deep neural networks, without the need to explicitly model the in-between modules. This thesis investigates a sample of methods, which fall somewhere on a spectrum from pipelined to fully end-to-end, which we believe to be more advantageous for developing a general manipulation system; one that could eventually be used in highly dynamic and unpredictable household environments. The investigation starts at the far end of the spectrum, where we explore learning an end-to-end controller in simulation and then transferring to the real world by employing domain randomisation, and finish on the other end, with a new pipeline, where the individual modules bear little resemblance to the "traditional" ones. The thesis concludes with a proposition of a new paradigm: Tightly-coupled Manipulation Pipelines (TMP). Rather than learning all modules implicitly in one large, end-to-end network or conversely, having individual, pre-defined modules that are developed independently, TMPs suggest taking the best of both world by tightly coupling actions to observations, whilst still maintaining structure via an undefined number of learned modules, which do not have to bear any resemblance to the modules seen in "traditional" systems.Open Acces

    Dynamic deep learning for automatic facial expression recognition and its application in diagnosis of ADHD & ASD

    Get PDF
    Neurodevelopmental conditions like Attention Deficit Hyperactivity Disorder (ADHD) and Autism Spectrum Disorder (ASD) impact a significant number of children and adults worldwide. Currently, the means of diagnosing of such conditions is carried out by experts, who employ standard questionnaires and look for certain behavioural markers through manual observation. Such methods are not only subjective, difficult to repeat, and costly but also extremely time consuming. However, with the recent surge of research into automatic facial behaviour analysis and it's varied applications, it could prove to be a potential way of tackling these diagnostic difficulties. Automatic facial expression recognition is one of the core components of this field but it has always been challenging to do it accurately in an unconstrained environment. This thesis presents a dynamic deep learning framework for robust automatic facial expression recognition. It also proposes an approach to apply this method for facial behaviour analysis which can help in the diagnosis of conditions like ADHD and ASD. The proposed facial expression algorithm uses a deep Convolutional Neural Networks (CNN) to learn models of facial Action Units (AU). It attempts to model three main distinguishing features of AUs: shape, appearance and short term dynamics, jointly in a CNN. The appearance is modelled through local image regions relevant to each AU, shape is encoded using binary masks computed from automatically detected facial landmarks and dynamics is encoded by using a short sequence of image as input to CNN. In addition, the method also employs Bi-directional Long Short Memory (BLSTM) recurrent neural networks for modelling long term dynamics. The proposed approach is evaluated on a number of databases showing state-of-the-art performance for both AU detection and intensity estimation tasks. The AU intensities estimated using this approach along with other 3D face tracking data, are used for encoding facial behaviour. The encoded facial behaviour is applied for learning models which can help in detection of ADHD and ASD. This approach was evaluated on the KOMAA database which was specially collected for this purpose. Experimental results show that facial behaviour encoded in this way provide a high discriminative power for classification of people with these conditions. It is shown that the proposed system is a potentially useful, objective and time saving contribution to the clinical diagnosis of ADHD and ASD

    Automating Inspection of Tunnels With Photogrammetry and Deep Learning

    Get PDF
    Asset Management of large underground transportation infrastructure requires frequent and detailed inspections to assess its overall structural conditions and to focus available funds where required. At the time of writing, the common approach to perform visual inspections is heavily manual, therefore slow, expensive, and highly subjective. This research evaluates the applicability of an automated pipeline to perform visual inspections of underground infrastructure for asset management purposes. It also analyses the benefits of using lightweight and low-cost hardware versus high-end technology. The aim is to increase the automation in performing such task to overcome the main drawbacks of the traditional regime. It replaces subjectivity, approximation and limited repeatability of the manual inspection with objectivity and consistent accuracy. Moreover, it reduces the overall end-to-end time required for the inspection and the associated costs. This might translate to more frequent inspections per given budget, resulting in increased service life of the infrastructure. Shorter inspections have social benefits as well. In fact, local communities can rely on a safe transportation with minimum levels of disservice. At last, but not least, it drastically improves health and safety conditions for the inspection engineers who need to spend less time in this hazardous environment. The proposed pipeline combines photogrammetric techniques for photo-realistic 3D reconstructions alongside with machine learning-based defect detection algorithms. This approach allows to detect and map visible defects on the tunnel’s lining in local coordinate system and provides the asset manager with a clear overview of the critical areas over all infrastructure. The outcomes of the research show that the accuracy of the proposed pipeline largely outperforms human results, both in three-dimensional mapping and defect detection performance, pushing the benefit-cost ratio strongly in favour of the automated approach. Such outcomes will impact the way construction industry approaches visual inspections and shift towards automated strategies