159 research outputs found

    Robust Reflection Removal with Flash-only Cues in the Wild

    Full text link
    We propose a simple yet effective reflection-free cue for robust reflection removal from a pair of flash and ambient (no-flash) images. The reflection-free cue exploits a flash-only image obtained by subtracting the ambient image from the corresponding flash image in raw data space. The flash-only image is equivalent to an image taken in a dark environment with only a flash on. This flash-only image is visually reflection-free and thus can provide robust cues to infer the reflection in the ambient image. Since the flash-only image usually has artifacts, we further propose a dedicated model that not only utilizes the reflection-free cue but also avoids introducing artifacts, which helps accurately estimate reflection and transmission. Our experiments on real-world images with various types of reflection demonstrate the effectiveness of our model with reflection-free flash-only cues: our model outperforms state-of-the-art reflection removal approaches by more than 5.23dB in PSNR. We extend our approach to handheld photography to address the misalignment between the flash and no-flash pair. With misaligned training data and the alignment module, our aligned model outperforms our previous version by more than 3.19dB in PSNR on a misaligned dataset. We also study using linear RGB images as training data. Our source code and dataset are publicly available at https://github.com/ChenyangLEI/flash-reflection-removal.Comment: Extension of CVPR 2021 paper [arXiv:2103.04273], submitted to TPAMI. Our source code and dataset are publicly available at http://github.com/ChenyangLEI/flash-reflection-remova

    Reasoning about Scene and Image Structure for Computer Vision

    Get PDF
    The wide availability of cheap consumer cameras has democratized photography for novices and experts alike, with more than a trillion photographs taken each year. While many of these cameras---especially those on mobile phones---have inexpensive optics and make imperfect measurements, the use of modern computational techniques can allow the recovery of high-quality photographs as well as of scene attributes. In this dissertation, we explore algorithms to infer a wide variety of physical and visual properties of the world, including color, geometry, reflectance etc., from images taken by casual photographers in unconstrained settings. We specifically focus on neural network-based methods, while incorporating domain knowledge about scene structure and the physics of image formation. We describe novel techniques to produce high-quality images in poor lighting environments, train scene map estimators in the absence of ground-truth data and learn to output our understanding and uncertainty on the scene given observed images. The key to inferring scene properties from casual photography is to exploit the internal structure of natural scenes and the expressive capacity of neural networks. We demonstrate that neural networks can be used to identify the internal structure of scenes maps, and that our prior understanding on natural scenes can shape the design, training and the output representation of neural networks

    Spray Formation and Cavitation of Fuel Injectors with Various Metal and Optical Nozzles

    Get PDF
    This thesis addresses the need for fundamental understanding of the mechanisms of fuel spray formation and mixture preparation in direct injection spark ignition (DISI) engines. Fuel injection systems for DISI engines undergo rapid developments in their design and performance, therefore, their spray breakup mechanisms in the physical conditions encountered in DISI engines over a range of operating conditions and injection strategies require continuous attention. In this context, there are sparse data in the literature on spray formation differences between conventionally drilled injectors by spark erosion and latest laser drilled injector nozzles. A comparison was first carried out between the holes of spark eroded and laser-drilled injectors of same nominal type by analysing their in-nozzle geometry and surface roughness under an electron microscope. Then the differences in their spray characteristics under quiescent conditions, as well as in a motoring optical engine, are discussed on the basis of high-speed imaging experiments and image processing methods. Specifically, the spray development mechanism was quantified by spray tip penetration and cone angle data under a range of representative low-load and high-low engine operating conditions (0.5 bar and 1.0 bar absolute, respectively), as well as at low and high injector body temperatures (20 °C and 90 °C) to represent cold and warm engine-head conditions. Droplet sizing was also performed with the spark and laser drilled injectors using Phase Doppler Anemometry in a quiescent chamber and the analysis was extended to include flash boiling conditions (120 °C) and other hydrocarbon and alcohols; iso-octane, ethanol and butanol. This thesis also presents the design and development of a real-size quartz optical nozzle, 200 µm in diameter, suitable for high-temperature applications and also compatible with new fuels such as alcohols. Mass flow of typical real multi-hole injectors was measured, and relevant fluid mechanics dimensionless parameters were derived. Laser and mechanical drilling of the quartz nozzle holes were compared to each other. Abrasive flow machining of the optical nozzles was also performed and analysed by microscopy in comparison to the real injector. Results with a highspeed camera showed successful imaging of microscopic in-nozzle flow and cavitation phenomena, coupled to downstream spray formation, under a variety of conditions including high fuel temperature flash-boiling effects, although undesirable needle movement was an issue and limitation

    Delving Deep into Fine-Grained Sketch-Based Image Retrieval.

    Get PDF
    PhD ThesisTo see is to sketch. Since prehistoric times, people use sketch-like petroglyphs as an effective communicative tool which predates the appearance of language tens of thousands of years ago. This is even more true nowadays that with the ubiquitous proliferation of touchscreen devices, sketching is possibly the only rendering mechanism readily available for all to express visual intentions. The intriguing free-hand property of human sketches, however, becomes a major obstacle when practically applied – humans are not faithful artists, the sketches drawn are iconic abstractions of mental images and can quickly fall off the visual manifold of natural objects. When matching discriminatively with their corresponding photos, this problem is known as finegrained sketch-based image retrieval (FG-SBIR) and has drawn increasing interest due to its potential commercial adoption. This thesis delves deep into FG-SBIR by intuitively analysing the intrinsic unique traits of human sketches and make such understanding importantly leveraged to enhance their links to match with photos under deep learning. More specifically, this thesis investigates and has developed four methods for FG-SBIR as follows: Chapter 3 describes a discriminative-generative hybrid method to better bridge the domain gap between photo and sketch. Existing FG-SBIR models learn a deep joint embedding space with discriminative losses only to pull matching pairs of photos and sketches close and push mismatched pairs away, thus indirectly align the two domains. To this end, we introduce a i generative task of cross-domain image synthesis. Concretely when an input photo is embedded in the joint space, the embedding vector is used as input to a generative model to synthesise the corresponding sketch. This task enforces the learned embedding space to preserve all the domain invariant information that is useful for cross-domain reconstruction, thus explicitly reducing the domain gap as opposed to existing models. Such an approach achieves the first near-human performance on the largest FG-SBIR dataset to date, Sketchy. Chapter 4 presents a new way of modelling human sketch and shows how such modelling can be integrated into existing FG-SBIR paradigm with promising performance. Instead of modelling the forward sketching pass, we attempt to invert it. We model this inversion by translating iconic free-hand sketches to contours that resemble more geometrically realistic projections of object boundaries and separately factorise out the salient added details. This factorised rerepresentation makes it possible for more effective sketch-photo matching. Specifically, we propose a novel unsupervised image style transfer model based on enforcing a cyclic embedding consistency constraint. A deep four-way Siamese model is then formulated to importantly utilise the synthesised contours by extracting distinct complementary detail features for FG-SBIR. Chapter 5 extends the practical applicability of FG-SBIR to work well beyond its training categories. Existing models, while successful, require instance-level pairing within each coarsegrained category as annotated training data, leaving their ability to deal with out-of-sample data unknown. We identify cross-category generalisation for FG-SBIR as a domain generalisation problem and propose the first solution. Our key contribution is a novel unsupervised learning approach to model a universal manifold of prototypical visual sketch traits. This manifold can then be used to paramaterise the learning of a sketch/photo representation. Model adaptation to novel categories then becomes automatic via embedding the novel sketch in the manifold and updating the representation and retrieval function accordingly. Chapter 6 challenges the ImageNet pre-training that has long been considered crucial by the FG-SBIR community due to the lack of large sketch-photo paired datasets for FG-SBIR training, and propose a self-supervised alternative for representation pre-training. Specifically, we consider the jigsaw puzzle game of recomposing images from shuffled parts. We identify two ii key facets of jigsaw task design that are required for effective performance. The first is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than existing classifier instantiation of the Jigsaw idea. We show for the first time that ImageNet classification is unnecessary as a pre-training strategy for FG-SBIR and confirm the efficacy of our jigsaw approach

    Physics-based vision meets deep learning

    Get PDF
    Physics-based vision explores computer vision and graphics problems by applying methods based upon physical models. On the other hand, deep learning is a learning-based technique, where a substantial number of observations are used to train an expressive yet unexplainable neural network model. In this thesis, we propose the concept of a model-based decoder, which is an unlearnable and differentiable neural layer being designed according to a physics-based model. Constructing neural networks with such model-based decoders afford the model strong learning capability as well as the potential to respect the underlying physics. We start the study by developing a toolbox of differentiable photometric layers ported from classical photometric techniques. This enables us to perform the image formation process given geometry, illumination and reflectance function. Applying these differentiable photometric layers into a bidirectional reflectance distribution function (BRDF) estimation network training, we show the network could be trained in a self-supervised manner without the knowledge of ground truth BRDFs. Next, in a more general setting, we attempt to solve inverse rendering problems in a self-supervised fashion by making use of model-based decoders. Here, an inverse rendering network decomposes a single image into normal and diffuse albedo map and illumination. In order to achieve self-supervised training, we draw inspiration from multiview stereo (MVS) and employ a Lambertian model and a cross-projection MVS model to generate model-based supervisory signals. Finally, we seek potential hybrids of a neural decoder and a model-based decoder on a pair of practical problems: image relighting, and fine-scale depth prediction and novel view synthesis. In contrast to using model-based decoders to only supervise the training, the model-based decoder in our hybrid model serves to disentangle the intricate problem into a set of physically connected solvable ones. In practice, we develop a hybrid model that can estimate a fine-scale depth map and generate novel view synthesis from a single image by using a physical subnet to combine results from an inverse rendering network with a monodepth prediction network. As for neural image relighting, we propose another hybrid model using a Lambertian renderer to generate initial estimates of relighting results followed by a neural renderer performing corrections over deficits in initial renderings. We demonstrate the model-based decoder can significantly improve the quality of results and relax the demands for labelled data

    Clearing the Clouds: Extracting 3D information from amongst the noise

    Get PDF
    Advancements permitting the rapid extraction of 3D point clouds from a variety of imaging modalities across the global landscape have provided a vast collection of high fidelity digital surface models. This has created a situation with unprecedented overabundance of 3D observations which greatly outstrips our current capacity to manage and infer actionable information. While years of research have removed some of the manual analysis burden for many tasks, human analysis is still a cornerstone of 3D scene exploitation. This is especially true for complex tasks which necessitate comprehension of scale, texture and contextual learning. In order to ameliorate the interpretation burden and enable scientific discovery from this volume of data, new processing paradigms are necessary to keep pace. With this context, this dissertation advances fundamental and applied research in 3D point cloud data pre-processing and deep learning from a variety of platforms. We show that the representation of 3D point data is often not ideal and sacrifices fidelity, context or scalability. First ground scanning terrestrial LIght Detection And Ranging (LiDAR) models are shown to have an inherent statistical bias, and present a state of the art method for correcting this, while preserving data fidelity and maintaining semantic structure. This technique is assessed in the dense canopy of Micronesia, with our technique being the best at retaining high levels of detail under extreme down-sampling (\u3c 1%). Airborne systems are then explored with a method which is presented to pre-process data to preserve a global contrast and semantic content in deep learners. This approach is validated with a building footprint detection task from airborne imagery captured in Eastern TN from the 3D Elevation Program (3DEP), our approach was found to achieve significant accuracy improvements over traditional techniques. Finally, topography data spanning the globe is used to assess past and previous global land cover change. Utilizing Shuttle Radar Topography Mission (SRTM) and Moderate Resolution Imaging Spectroradiometer (MODIS) data, paired with the airborne preprocessing technique described previously, a model for predicting land-cover change from topography observations is described. The culmination of these efforts have the potential to enhance the capabilities of automated 3D geospatial processing, substantially lightening the burden of analysts, with implications improving our responses to global security, disaster response, climate change, structural design and extraplanetary exploration

    Multimedia Forensics

    Get PDF
    This book is open access. Media forensics has never been more relevant to societal life. Not only media content represents an ever-increasing share of the data traveling on the net and the preferred communications means for most users, it has also become integral part of most innovative applications in the digital information ecosystem that serves various sectors of society, from the entertainment, to journalism, to politics. Undoubtedly, the advances in deep learning and computational imaging contributed significantly to this outcome. The underlying technologies that drive this trend, however, also pose a profound challenge in establishing trust in what we see, hear, and read, and make media content the preferred target of malicious attacks. In this new threat landscape powered by innovative imaging technologies and sophisticated tools, based on autoencoders and generative adversarial networks, this book fills an important gap. It presents a comprehensive review of state-of-the-art forensics capabilities that relate to media attribution, integrity and authenticity verification, and counter forensics. Its content is developed to provide practitioners, researchers, photo and video enthusiasts, and students a holistic view of the field

    Multi-Object Tracking System based on LiDAR and RADAR for Intelligent Vehicles applications

    Get PDF
    El presente Trabajo Fin de Grado tiene como objetivo el desarrollo de un Sistema de Detección y Multi-Object Tracking 3D basado en la fusión sensorial de LiDAR y RADAR para aplicaciones de conducción autónoma basándose en algoritmos tradicionales de Machine Learning. La implementación realizada está basada en Python, ROS y cumple requerimientos de tiempo real. En la etapa de detección de objetos se utiliza el algoritmo de segmentación del plano RANSAC, para una posterior extracción de Bounding Boxes mediante DBSCAN. Una Late Sensor Fusion mediante Intersection over Union 3D y un sistema de tracking BEV-SORT completan la arquitectura propuesta.This Final Degree Project aims to develop a 3D Multi-Object Tracking and Detection System based on the Sensor Fusion of LiDAR and RADAR for autonomous driving applications based on traditional Machine Learning algorithms. The implementation is based on Python, ROS and complies with real-time requirements. In the Object Detection stage, the RANSAC plane segmentation algorithm is used, for a subsequent extraction of Bounding Boxes using DBSCAN. A Late Sensor Fusion using Intersection over Union 3D and a BEV-SORT tracking system complete the proposed architecture.Grado en Ingeniería en Electrónica y Automática Industria

    Multimedia Forensics

    Get PDF
    This book is open access. Media forensics has never been more relevant to societal life. Not only media content represents an ever-increasing share of the data traveling on the net and the preferred communications means for most users, it has also become integral part of most innovative applications in the digital information ecosystem that serves various sectors of society, from the entertainment, to journalism, to politics. Undoubtedly, the advances in deep learning and computational imaging contributed significantly to this outcome. The underlying technologies that drive this trend, however, also pose a profound challenge in establishing trust in what we see, hear, and read, and make media content the preferred target of malicious attacks. In this new threat landscape powered by innovative imaging technologies and sophisticated tools, based on autoencoders and generative adversarial networks, this book fills an important gap. It presents a comprehensive review of state-of-the-art forensics capabilities that relate to media attribution, integrity and authenticity verification, and counter forensics. Its content is developed to provide practitioners, researchers, photo and video enthusiasts, and students a holistic view of the field

    The multifocal visual evoked cortical potential in visual field mapping: a methodological study.

    Get PDF
    The application of multifocal techniques to the visual evoked cortical potential permits objective electrophysiological mapping of the visual field. The multifocal visual evoked cortical potential (mfVECP) presents several technical challenges. Signals are small, are influenced by a number of sources of noise and waveforms vary both across the visual field and between subjects due to the complex geometry of the visual cortex. Together these factors hamper the ability to distinguish between a mfVECP response from the healthy visual pathway, and a response that is reduced or absent and is therefore representative of pathology. This thesis presents a series of methodological investigations with the aim of maximising the information available in the recorded electrophysiological response, thereby improving the performance of the mfVECP. A novel method of calculating the signal to noise ratio (SNR) of mfVECP waveform responses is introduced. A noise estimate unrelated to the response of the visual cortex to the visual stimulus is created. This is achieved by cross-correlating m-sequences which are created when the orthogonal set of m-sequences are created but are not used to control a stimulus region, with the physiological record. This metric is compared to the approach of defining noise within a delayed time window and shows good correlation. ROC analysis indicates a small improvement in the ability to distinguish between physiological waveform responses and noise. Defining the signal window as 45-250ms is recommended. Signal quality is improved by post-acquisition bandwidth filtering. A wide range of bandwidths are compared and the greatest gains are seen with a bandpass of 3 to 20Hz applied after cross-correlation. Responses evoked when stimulation is delivered using a cathode ray tube (CRT) and a liquid crystal display (LCD) projector system are compared. The mode of stimulus delivery affects the waveshape of responses. A significantly higher SNR is seen in waveforms is shown in waveforms evoked by an m=16 bit m-sequence delivered by a CRT monitor. Differences for shorter m-sequences were not statistically significant. The area of the visual field which can usefully be tested is investigated by increasing the field of view of stimulation from 20° to 40° of radius in 10° increments. A field of view of 30° of radius is shown to provide stimulation of as much of the visual field as possible without losing signal quality. Stimulation rates of 12.5 to 75Hz are compared. Slowing the stimulation rate produced increases waveform amplitudes, latencies and SNR values. The best performance was achieved with 25Hz stimulation. It is shown that a six-minute recording stimulated at 25Hz is superior to an eight-minute, 75Hz acquisition. An electrophysiology system capable of providing multifocal stimulation, synchronising with the acquisition of data from a large number of electrodes and performing cross-correlation has been created. This is a powerful system which permits the interrogation of the dipoles evoked within the complex geometry of the visual cortex from a very large number of orientations, which will improve detection ability. The system has been used to compare the performance of 16 monopolar recording channels in detecting responses to stimulation throughout the visual field. A selection of four electrodes which maximise the available information throughout the visual field has been made. It is shown that a several combinations of four electrodes provide good responses throughout the visual field, but that it is important to have them distributed on either hemisphere and above and below Oz. A series of investigations have indicated methods of maximising the available information in mfVECP recordings and progress the technique towards becoming a robust clinical tool. A powerful multichannel multifocal electrophysiology system has been created, with the ability to simultaneously acquire data from a very large number of bipolar recording channels and thereby detect many small dipole responses to stimulation of many small areas of the visual field. This will be an invaluable tool in future investigations. Performance has been shown to improve when the presence or absence of a waveform is determined by a novel SNR metric, when data is filtered post-acquisition through a 3-20Hz bandpass after cross-correlation and when a CRT is used to deliver the stimulus. The field of view of stimulation can usefully be extended to a radius of 30° when a 60-region dartboard pattern is employed. Performance can be enhanced at the same time as acquisition time is reduced by 25%, by the use of a 25Hz rate of stimulation instead of the frequently employed rate of 75Hz
    • …
    corecore