75 research outputs found

    An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement

    Full text link
    Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are often coupled with the lack of example pairs, which inhibits the application of supervised learning strategies. To address these challenges, we propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples. In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information. The proposed design allows our recurrent cells to efficiently propagate spatio-temporal information across frames and reduces the need for high complexity networks. Our setting enables learning from unpaired videos in a cyclic adversarial manner, where the proposed recurrent units are employed in all architectures. Efficient training is accomplished by introducing one single discriminator that learns the joint distribution of source and target domain simultaneously. The enhancement results demonstrate clear superiority of the proposed video enhancer over the state-of-the-art methods, in all terms of visual quality, quantitative metrics, and inference speed. Notably, our video enhancer is capable of enhancing over 35 frames per second of FullHD video (1080x1920)

    Recent Advances in Image Restoration with Applications to Real World Problems

    Get PDF
    In the past few decades, imaging hardware has improved tremendously in terms of resolution, making widespread usage of images in many diverse applications on Earth and planetary missions. However, practical issues associated with image acquisition are still affecting image quality. Some of these issues such as blurring, measurement noise, mosaicing artifacts, low spatial or spectral resolution, etc. can seriously affect the accuracy of the aforementioned applications. This book intends to provide the reader with a glimpse of the latest developments and recent advances in image restoration, which includes image super-resolution, image fusion to enhance spatial, spectral resolution, and temporal resolutions, and the generation of synthetic images using deep learning techniques. Some practical applications are also included

    AI for time-resolved imaging: from fluorescence lifetime to single-pixel time of flight

    Get PDF
    Time-resolved imaging is a field of optics which measures the arrival time of light on the camera. This thesis looks at two time-resolved imaging modalities: fluorescence lifetime imaging and time-of-flight measurement for depth imaging and ranging. Both of these applications require temporal accuracy on the order of pico- or nanosecond (10−12 − 10−9s) scales. This demands special camera technology and optics that can sample light-intensity extremely quickly, much faster than an ordinary video camera. However, such detectors can be very expensive compared to regular cameras while offering lower image quality. Further, information of interest is often hidden (encoded) in the raw temporal data. Therefore, computational imaging algorithms are used to enhance, analyse and extract information from time-resolved images. "A picture is worth a thousand words". This describes a fundamental blessing and curse of image analysis: images contain extreme amounts of data. Consequently, it is very difficult to design algorithms that encompass all the possible pixel permutations and combinations that can encode this information. Fortunately, the rise of AI and machine learning (ML) allow us to instead create algorithms in a data-driven way. This thesis demonstrates the application of ML to time-resolved imaging tasks, ranging from parameter estimation in noisy data and decoding of overlapping information, through super-resolution, to inferring 3D information from 1D (temporal) data

    Scalable Methodologies and Analyses for Modality Bias and Feature Exploitation in Language-Vision Multimodal Deep Learning

    Get PDF
    Multimodal machine learning benchmarks have exponentially grown in both capability and popularity over the last decade. Language-vision question-answering tasks such as Visual Question Answering (VQA) and Video Question Answering (video-QA) have ---thanks to their high difficulty--- become a particularly popular means through which to develop and test new modelling designs and methodology for multimodal deep learning. The challenging nature of VQA and video-QA tasks leaves plenty of room for innovation at every component of the deep learning pipeline: from dataset to modelling methodology. Such circumstances are ideal for innovating in the space of language-vision multimodality. Furthermore, the wider field is currently undergoing an incredible period of growth and increasing interest. I therefore aim to contribute to multiple key components of the VQA and video-QA pipeline, but specifically in a manner such that my contributions remain relevant, ‘scaling’ with the revolutionary new benchmark models and datasets of the near future instead of being rendered obsolete by them. The work in this thesis: highlights and explores the disruptive and problematic presence of language bias in the popular TVQA video-QA dataset, and proposes a dataset-invariant method to identify subsets that respond to different modalities; thoroughly explores the suitability of bilinear pooling as a language-vision fusion technique in video-QA, offering experimental and theoretical insight, and highlighting the parallels in multimodal processing with neurological theories; explores the nascent visual equivalent of languague modelling (`visual modelling') in order to boost the power of visual features; and proposes a dataset-invariant neurolinguistically-inspired labelling scheme for use in multimodal question-answering. I explore the positive and negative results that my experiments across this thesis yield. I conclude by discussing the limitations of my contributions, and conclude with proposals for future directions of study in the areas I contribute to

    Face Image and Video Analysis in Biometrics and Health Applications

    Get PDF
    Computer Vision (CV) enables computers and systems to derive meaningful information from acquired visual inputs, such as images and videos, and make decisions based on the extracted information. Its goal is to acquire, process, analyze, and understand the information by developing a theoretical and algorithmic model. Biometrics are distinctive and measurable human characteristics used to label or describe individuals by combining computer vision with knowledge of human physiology (e.g., face, iris, fingerprint) and behavior (e.g., gait, gaze, voice). Face is one of the most informative biometric traits. Many studies have investigated the human face from the perspectives of various different disciplines, ranging from computer vision, deep learning, to neuroscience and biometrics. In this work, we analyze the face characteristics from digital images and videos in the areas of morphing attack and defense, and autism diagnosis. For face morphing attacks generation, we proposed a transformer based generative adversarial network to generate more visually realistic morphing attacks by combining different losses, such as face matching distance, facial landmark based loss, perceptual loss and pixel-wise mean square error. In face morphing attack detection study, we designed a fusion-based few-shot learning (FSL) method to learn discriminative features from face images for few-shot morphing attack detection (FS-MAD), and extend the current binary detection into multiclass classification, namely, few-shot morphing attack fingerprinting (FS-MAF). In the autism diagnosis study, we developed a discriminative few shot learning method to analyze hour-long video data and explored the fusion of facial dynamics for facial trait classification of autism spectrum disorder (ASD) in three severity levels. The results show outstanding performance of the proposed fusion-based few-shot framework on the dataset. Besides, we further explored the possibility of performing face micro- expression spotting and feature analysis on autism video data to classify ASD and control groups. The results indicate the effectiveness of subtle facial expression changes on autism diagnosis

    Learning from limited labeled data - Zero-Shot and Few-Shot Learning

    Get PDF
    Human beings have the remarkable ability to recognize novel visual concepts after observing only few or zero examples of them. Deep learning, however, often requires a large amount of labeled data to achieve a good performance. Labeled instances are expensive, difficult and even infeasible to obtain because the distribution of training instances among labels naturally exhibits a long tail. Therefore, it is of great interest to investigate how to learn efficiently from limited labeled data. This thesis concerns an important subfield of learning from limited labeled data, namely, low-shot learning. The setting assumes the availability of many labeled examples from known classes and the goal is to learn novel classes from only a few~(few-shot learning) or zero~(zero-shot learning) training examples of them. To this end, we have developed a series of multi-modal learning approaches to facilitate the knowledge transfer from known classes to novel classes for a wide range of visual recognition tasks including image classification, semantic image segmentation and video action recognition. More specifically, this thesis mainly makes the following contributions. First, as there is no agreed upon zero-shot image classification benchmark, we define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets. Second, in order to tackle the labeled data scarcity, we propose feature generation frameworks that synthesize data in the visual feature space for novel classes. Third, we extend zero-shot learning and few-shot learning to the semantic segmentation task and propose a challenging benchmark for it. We show that incorporating semantic information into a semantic segmentation network is effective in segmenting novel classes. Finally, we develop better video representation for the few-shot video classification task and leverage weakly-labeled videos by an efficient retrieval method.Menschen haben die bemerkenswerte FĂ€higkeit, neuartige visuelle Konzepte zu erkennen, nachdem sie nur wenige oder gar keine Beispiele davon beobachtet haben. Tiefes Lernen erfordert jedoch oft eine große Menge an beschrifteten Daten, um eine gute Leistung zu erzielen. Etikettierte Instanzen sind teuer, schwierig und sogar undurchfĂŒhrbar, weil die Verteilung der Trainingsinstanzen auf die Etiketten naturgemĂ€ĂŸ einen langen Schwanz aufweist. Daher ist es von großem Interesse zu untersuchen, wie man effizient aus begrenzten gelabelten Daten lernen kann. Diese These betrifft einen wichtigen Teilbereich des Lernens aus begrenzt gelabelten Daten, nĂ€mlich das Low-Shot-Lernen. Das Setting setzt die VerfĂŒgbarkeit vieler gelabelter Beispiele aus bekannten Klassen voraus, und das Ziel ist es, neuartige Klassen aus nur wenigen (few-shot learning) oder null (zero-shot learning) Trainingsbeispielen davon zu lernen. Zu diesem Zweck haben wir eine Reihe von multimodalen LernansĂ€tzen entwickelt, um den Wissenstransfer von bekannten Klassen zu neuartigen Klassen fĂŒr ein breites Spektrum von visuellen Erkennungsaufgaben zu erleichtern, darunter Bildklassifizierung, semantische Bildsegmentierung und Videoaktionserkennung. Genauer gesagt, leistet diese Arbeit hauptsĂ€chlich die folgenden BeitrĂ€ge. Da es keinen vereinbarten Benchmark fĂŒr die Zero-Shot- Bildklassifikation gibt, definieren wir zunĂ€chst einen neuen Benchmark, indem wir sowohl die Evaluierungsprotokolle als auch die Datensplits öffentlich zugĂ€nglicher DatensĂ€tze vereinheitlichen. Zweitens schlagen wir zur BewĂ€ltigung der etikettierten Datenknappheit einen Rahmen fĂŒr die Generierung von Merkmalen vor, der Daten im visuellen Merkmalsraum fĂŒr neuartige Klassen synthetisiert. Drittens dehnen wir das Zero-Shot-Lernen und das few-Shot-Lernen auf die semantische Segmentierungsaufgabe aus und schlagen dafĂŒr einen anspruchsvollen Benchmark vor. Wir zeigen, dass die Einbindung semantischer Informationen in ein semantisches Segmentierungsnetz bei der Segmentierung neuartiger Klassen effektiv ist. Schließlich entwickeln wir eine bessere Videodarstellung fĂŒr die Klassifizierungsaufgabe ”few-shot video” und nutzen schwach markierte Videos durch eine effiziente Abrufmethode.Max Planck Institute Informatic

    Multi-frame reconstruction using super-resolution, inpainting, segmentation and codecs

    Get PDF
    In this thesis, different aspects of video and light field reconstruction are considered such as super-resolution, inpainting, segmentation and codecs. For this purpose, each of these strategies are analyzed based on a specific goal and a specific database. Accordingly, databases which are relevant to film industry, sport videos, light fields and hyperspectral videos are used for the sake of improvement. This thesis is constructed around six related manuscripts, in which several approaches are proposed for multi-frame reconstruction. Initially, a novel multi-frame reconstruction strategy is proposed for lightfield super-resolution in which graph-based regularization is applied along with edge preserving filtering for improving the spatio-angular quality of lightfield. Second, a novel video reconstruction is proposed which is built based on compressive sensing (CS), Gaussian mixture models (GMM) and sparse 3D transform-domain block matching. The motivation of the proposed technique is the improvement in visual quality performance of the video frames and decreasing the reconstruction error in comparison with the former video reconstruction methods. In the next approach, student-t mixture models and edge preserving filtering are applied for the purpose of video super-resolution. Student-t mixture model has a heavy tail which makes it robust and suitable as a video frame patch prior and rich in terms of log likelihood for information retrieval. In another approach, a hyperspectral video database is considered, and a Bayesian dictionary learning process is used for hyperspectral video super-resolution. To that end, Beta process is used in Bayesian dictionary learning and a sparse coding is generated regarding the hyperspectral video super-resolution. The spatial super-resolution is followed by a spectral video restoration strategy, and the whole process leveraged two different dictionary learnings, in which the first one is trained for spatial super-resolution and the second one is trained for the spectral restoration. Furthermore, in another approach, a novel framework is proposed for replacing advertisement contents in soccer videos in an automatic way by using deep learning strategies. For this purpose, a UNET architecture is applied (an image segmentation convolutional neural network technique) for content segmentation and detection. Subsequently, after reconstructing the segmented content in the video frames (considering the apparent loss in detection), the unwanted content is replaced by new one using a homography mapping procedure. In addition, in another research work, a novel video compression framework is presented using autoencoder networks that encode and decode videos by using less chroma information than luma information. For this purpose, instead of converting Y'CbCr 4:2:2/4:2:0 videos to and from RGB 4:4:4, the video is kept in Y'CbCr 4:2:2/4:2:0 and merged the luma and chroma channels after the luma is downsampled to match the chroma size. An inverse function is performed for the decoder. The performance of these models is evaluated by using CPSNR, MS-SSIM, and VMAF metrics. The experiments reveal that, as compared to video compression involving conversion to and from RGB 4:4:4, the proposed method increases the video quality by about 5.5% for Y'CbCr 4:2:2 and 8.3% for Y'CbCr 4:2:0 while reducing the amount of computation by nearly 37% for Y'CbCr 4:2:2 and 40% for Y'CbCr 4:2:0. The thread that ties these approaches together is reconstruction of the video and light field frames based on different aspects of problems such as having loss of information, blur in the frames, existing noise after reconstruction, existing unpleasant content, excessive size of information and high computational overhead. In three of the proposed approaches, we have used Plug-and-Play ADMM model for the first time regarding reconstruction of videos and light fields in order to address both information retrieval in the frames and tackling noise/blur at the same time. In two of the proposed models, we applied sparse dictionary learning to reduce the data dimension and demonstrate them as an efficient linear combination of basis frame patches. Two of the proposed approaches are developed in collaboration with industry, in which deep learning frameworks are used to handle large set of features and to learn high-level features from the data

    Compressed-domain transcoding of H.264/AVC and SVC video streams

    Get PDF

    Generic Object Detection and Segmentation for Real-World Environments

    Get PDF
    • 

    corecore