12 research outputs found

    Region-Object Relation-Aware Dense Captioning via Transformer

    Get PDF
    Dense captioning provides detailed captions of complex visual scenes. While a number of successes have been achieved in recent years, there are still two broad limitations: 1) most existing methods adopt an encoder-decoder framework, where the contextual information is sequentially encoded using long short-term memory (LSTM). However, the forget gate mechanism of LSTM makes it vulnerable when dealing with a long sequence and 2) the vast majority of prior arts consider regions of interests (RoIs) equally important, thus failing to focus on more informative regions. The consequence is that the generated captions cannot highlight important contents of the image, which does not seem natural. To overcome these limitations, in this article, we propose a novel end-to-end transformer-based dense image captioning architecture, termed the transformer-based dense captioner (TDC). TDC learns the mapping between images and their dense captions via a transformer, prioritizing more informative regions. To this end, we present a novel unit, named region-object correlation score unit (ROCSU), to measure the importance of each region, where the relationships between detected objects and the region, alongside the confidence scores of detected objects within the region, are taken into account. Extensive experimental results and ablation studies on the standard dense-captioning datasets demonstrate the superiority of the proposed method to the state-of-the-art methods

    Self-supervised high dynamic range imaging : what can be learned from a single 8-bit video?

    Get PDF
    Recently, Deep Learning-based methods for inverse tone mapping standard dynamic range (SDR) images to obtain high dynamic range (HDR) images have become very popular. These methods manage to fill over-exposed areas convincingly both in terms of details and dynamic range. To be effective, deep learning-based methods need to learn from large datasets and transfer this knowledge to the network weights. In this work, we tackle this problem from a completely different perspective. What can we learn from a single SDR 8-bit video? With the presented self-supervised approach, we show that, in many cases, a single SDR video is sufficient to generate an HDR video of the same quality or better than other state-of-the-art methods

    Deep Synthesis of Cloud Lighting

    Get PDF
    Current appearance models for the sky are able to represent clear sky illumination to a high degree of accuracy. However, these models all lack a common feature of real-skies: clouds. These are an essential component for many applications which rely on realistic skies, such as image editing and synthesis. While clouds can be added to existing sky models through rendering, this is hard to achieve due to the difficulties of representing clouds and the complexities of volumetric light transport. In this work, an alternative approach to this problem is proposed whereby clouds are synthesized using a learned data-driven representation. This leverages a captured collection of High Dynamic Range cloudy sky imagery, and combines this dataset with clear sky models to produce plausible cloud appearance from a coarse representation of cloud positions. This representation is artist controllable, allowing for novel cloud scapes to be rapidly synthesized, and used for lighting virtual environments

    Deep synthesis of cloud lighting

    Get PDF
    Current appearance models for the sky are able to represent clear sky illumination to a high degree of accuracy. However, these models all lack a common feature of real-skies: clouds. These are an essential component for many applications which rely on realistic skies, such as image editing and synthesis. While clouds can be added to existing sky models through rendering, this is hard to achieve due to the difficulties of representing clouds and the complexities of volumetric light transport. In this work, an alternative approach to this problem is proposed whereby clouds are synthesized using a learned data-driven representation. This leverages a captured collection of High Dynamic Range cloudy sky imagery, and combines this dataset with clear sky models to produce plausible cloud appearance from a coarse representation of cloud positions. This representation is artist controllable, allowing for novel cloudscapes to be rapidly synthesized and used for lighting virtual environments

    HDR image-based deep learning approach for automatic detection of split defects on sheet metal stamping parts

    Get PDF
    Sheet metal stamping is widely used for high-volume production. Despite the wide adoption, it can lead to defects in the manufactured components, making their quality unacceptable. Because of the variety of defects that can occur on the final product, human inspectors are frequently employed to detect them. However, they can be unreliable and costly, particularly at speeds that match the stamping rate. In this paper, we propose an automatic inspection framework for the stamping process that is based on computer vision and deep learning techniques. The low cost, remote sensing capability and simple implementation mean that it can be easily deployed in an industrial setting. A particular focus of this research is to account for the harsh lighting conditions and the highly reflective nature of products found in manufacturing environments that affect optical sensing techniques by making it difficult to capture the details of a scene. High dynamic range images can capture details of an environment in harsh lighting conditions, and in the context of this work, can capture highly reflective metals found in sheet metal stamping manufacturing. Building on this imaging technique, we propose a framework including a deep learning model to detect defects in sheet metal stamping parts. To test the framework, sheet metal ‘Nakajima’ samples were pressed with an industrial stamping press. Then optimally exposed, sequence of exposures, tone-mapped and high dynamic range images of the samples were used to train convolutional neural network-based detectors. Analysis of the resulting models showed that high dynamic range image-based models achieved substantially higher accuracy and minimal false-positive predictions

    Deep learning for high dynamic range imaging

    Get PDF
    High Dynamic Range (HDR) imaging enables us to capture, manipulate and reproduce real world lighting with high fidelity. HDR displays are becoming more common, however, most content, from over 100 years of media, is Low Dynamic Range (LDR). Dynamic range expansion methods generate HDR from LDR content, for displaying on HDR displays and various other applications, attempting to recover missing information from the original HDR signal that is lost due to saturation or quantisation. Multiple methods have been proposed, addressing different aspects of the problem, however most are model-driven and require adjustment of parameters which may also vary depending on the content. This thesis proposes using Convolutional Neural Networks (CNNs) for fully data-driven, end-to-end dynamic range expansion. A novel multibranch CNN termed ExpandNet is proposed that processes LDR inputs on multiple scales without using upsampling layers, which have been observed to cause artefacts when used in other traditional CNNs, in particular the UNet architecture. ExpandNet is evaluated and compared against traditional methods including other CNNs and is found to outperform all other methods on multiple metrics. An investigation of the effect of upsampling layers on the spectrum of the network outputs is then presented, characterising their impact in the Fourier domain, providing a way to assess the structural biases of CNNs. A new upsampling module is then proposed, based on the Guided Image Filter that provides spectrally consistent outputs when used in a UNet architecture, forming the Guided UNet (GUNet). The GUNet architecture is evaluated similarly to ExpandNet and is found to perform well, while executing faster and consuming less memory than ExpandNet. Finally, multiple methods using Generative Adversarial Networks (GAN) are proposed and evaluated, that hallucinate content in badly-exposed areas of the input while simultaneously expanding the range of the well-exposed areas. A single network configuration based on UNet, termed L-GAN, produces better results qualitatively and also performs well quantitatively compared to state-of-the-art methods such as ExpandNet and GUNet

    Recognising place under distinct weather variability, a comparison between end-to-end and metric learning approaches

    No full text
    Autonomous driving requires robust and accurate real time localisation information to navigate and perform trajectory planning. Although Global Navigation Satellite Systems (GNSS) are most frequently used in this application, they are unreliable within urban environments because of multipath and non-line-of-sight errors. Alternative solutions exist that exploit rich visual content from images that can be corresponded with a stored representation, such as a map, to determine the vehicles location. However, one major cause of reduced location accuracy are variations in environmental conditions between the images captured and those stored in the representation. We tackle this issue directly by collecting a simulated and real-world dataset captured over a single route under multiple environmental conditions. We demonstrate the effectiveness of an end-to-end approach in recognising place and by extension determining vehicle location
    corecore