291 research outputs found

    UPOCR: Towards Unified Pixel-Level OCR Interface

    Full text link
    In recent years, the optical character recognition (OCR) field has been proliferating with plentiful cutting-edge approaches for a wide spectrum of tasks. However, these approaches are task-specifically designed with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder. Learnable task prompts are introduced to push the general feature representations extracted by the encoder toward task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the generated and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code will be publicly available

    Image Evolution Analysis Through Forensic Techniques

    Get PDF

    Beyond the pixels: learning and utilising video compression features for localisation of digital tampering.

    Get PDF
    Video compression is pervasive in digital society. With rising usage of deep convolutional neural networks (CNNs) in the fields of computer vision, video analysis and video tampering detection, it is important to investigate how patterns invisible to human eyes may be influencing modern computer vision techniques and how they can be used advantageously. This work thoroughly explores how video compression influences accuracy of CNNs and shows how optimal performance is achieved when compression levels in the training set closely match those of the test set. A novel method is then developed, using CNNs, to derive compression features directly from the pixels of video frames. It is then shown that these features can be readily used to detect inauthentic video content with good accuracy across multiple different video tampering techniques. Moreover, the ability to explain these features allows predictions to be made about their effectiveness against future tampering methods. The problem is motivated with a novel investigation into recent video manipulation methods, which shows that there is a consistent drive to produce convincing, photorealistic, manipulated or synthetic video. Humans, blind to the presence of video tampering, are also blind to the type of tampering. New detection techniques are required and, in order to compensate for human limitations, they should be broadly applicable to multiple tampering types. This thesis details the steps necessary to develop and evaluate such techniques

    Active and passive approaches for image authentication

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    DEEP LEARNING FOR FORENSICS

    Get PDF
    The advent of media sharing platforms and the easy availability of advanced photo or video editing software have resulted in a large quantity of manipulated images and videos being shared on the internet. While the intent behind such manipulations varies widely, concerns on the spread of fake news and misinformation is growing. Therefore, detecting manipulation has become an emerging necessity. Different from traditional classification, semantic object detection or segmentation, manipulation detection/classification pays more attention to low-level tampering artifacts than to semantic content. The main challenges in this problem include (a) investigating features to reveal tampering artifacts, (b) developing generic models which are robust to a large scale of post-processing methods, (c) applying algorithms to higher resolution in real scenarios and (d) handling the new emerging manipulation techniques. In this dissertation, we propose approaches to tackling these challenges. Manipulation detection utilizes both low-level tamper artifacts and semantic contents, suggesting that richer features needed to be harnessed to reveal more evidence. To learn rich features, we propose a two-stream Faster R-CNN network and train it end-to-end to detect the tampered regions given a manipulated image. Experiments on four standard image manipulation datasets demonstrate that our two-stream framework outperforms each individual stream, and also achieves state-of-the-art performance compared to alternative methods with robustness to resizing and compression. Additionally, to extend manipulation detection from image to video, we introduce VIDNet, Video Inpainting Detection Network, which contains an encoder-decoder architecture with a quad-directional local attention module. To reveal artifacts encoded in compression, VIDNet additionally takes in Error Level Analysis (ELA) frames to augment RGB frames, producing multimodal features at different levels with an encoder. Besides, to improve the generalization of manipulation detection model, we introduce a manipulated image generation process that creates true positives using currently available datasets. Drawing from traditional work on image blending, we propose a novel generator for creating such examples. In addition, we also propose to further create examples that force the algorithm to focus on boundary artifacts during training. Extensive experimental results validate our proposal. Furthermore, to apply deep learning models to high resolution scenarios efficiently, we treat the problem as a mask refinement given a coarse low resolution prediction. We propose to convert the regions of interest into strip images and compute a boundary prediction in the strip domain. Extensive experiments on both the public and a newly created high resolution dataset strongly validate our approach. Finally, to handle new emerging manipulation techniques while preserving performance on learned manipulation, we investigate incremental learning. We propose a multi-model and multi-level knowledge distillation strategy to preserve performance on old categories while training on new categories. Experiments on standard incremental learning benchmarks show that our method improves the overall performance over standard distillation techniques

    Image Forensics in the Wild

    Get PDF

    An ensemble architecture for forgery detection and localization in digital images

    Get PDF
    Questa tesi presenta un approccio d'insieme unificato - "ensemble" - per il rilevamento e la localizzazione di contraffazioni in immagini digitali. Il focus della ricerca è su due delle più comuni ma efficaci tecniche di contraffazione: "copy-move" e "splicing". L'architettura proposta combina una serie di metodi di rilevamento e localizzazione di manipolazioni per ottenere prestazioni migliori rispetto a metodi utilizzati in modalità "standalone". I principali contributi di questo lavoro sono elencati di seguito. In primo luogo, nel Capitolo 1 e 2 viene presentata un'ampia rassegna dell'attuale stato dell'arte nel rilevamento di manipolazioni ("forgery"), con particolare attenzione agli approcci basati sul deep learning. Un'importante intuizione che ne deriva è la seguente: questi approcci, sebbene promettenti, non possono essere facilmente confrontati in termini di performance perché tipicamente vengono valutati su dataset personalizzati a causa della mancanza di dati annotati con precisione. Inoltre, spesso questi dati non sono resi disponibili pubblicamente. Abbiamo poi progettato un algoritmo di rilevamento di manipolazioni copy-move basato su "keypoint", descritto nel capitolo 3. Rispetto a esistenti approcci simili, abbiamo aggiunto una fase di clustering basato su densità spaziale per filtrare le corrispondenze rumorose dei keypoint. I risultati hanno dimostrato che questo metodo funziona bene su due dataset di riferimento e supera uno dei metodi più citati in letteratura. Nel Capitolo 4 viene proposta una nuova architettura per predire la direzione della luce 3D in una data immagine. Questo approccio sfrutta l'idea di combinare un metodo "data-driven" con un modello di illuminazione fisica, consentendo così di ottenere prestazioni migliori. Al fine di sopperire al problema della scarsità di dati per l'addestramento di architetture di deep learning altamente parametrizzate, in particolare per il compito di scomposizione intrinseca delle immagini, abbiamo sviluppato due algoritmi di generazione dei dati. Questi sono stati utilizzati per produrre due dataset - uno sintetico e uno di immagini reali - con lo scopo di addestrare e valutare il nostro approccio. Il modello di stima della direzione della luce proposto è stato sfruttato in un nuovo approccio di rilevamento di manipolazioni di tipo splicing, discusso nel Capitolo 5, in cui le incoerenze nella direzione della luce tra le diverse regioni dell'immagine vengono utilizzate per evidenziare potenziali attacchi splicing. L'approccio ensemble proposto è descritto nell'ultimo capitolo. Questo include un modulo "FusionForgery" che combina gli output dei metodi "base" proposti in precedenza e assegna un'etichetta binaria (forged vs. original). Nel caso l'immagine sia identificata come contraffatta, il nostro metodo cerca anche di specializzare ulteriormente la decisione tra attacchi splicing o copy-move. In questo secondo caso, viene eseguito anche un tentativo di ricostruire le regioni "sorgente" utilizzate nell'attacco copy-move. Le prestazioni dell'approccio proposto sono state valutate addestrandolo e testandolo su un dataset sintetico, generato da noi, comprendente sia attacchi copy-move che di tipo splicing. L'approccio ensemble supera tutti i singoli metodi "base" in termini di prestazioni, dimostrando la validità della strategia proposta.This thesis presents a unified ensemble approach for forgery detection and localization in digital images. The focus of the research is on two of the most common but effective forgery techniques: copy-move and splicing. The ensemble architecture combines a set of forgery detection and localization methods in order to achieve improved performance with respect to standalone approaches. The main contributions of this work are listed in the following. First, an extensive review of the current state of the art in forgery detection, with a focus on deep learning-based approaches is presented in Chapter 1 and 2. An important insight that is derived is the following: these approaches, although promising, cannot be easily compared in terms of performance because they are typically evaluated on custom datasets due to the lack of precisely annotated data. Also, they are often not publicly available. We then designed a keypoint-based copy-move detection algorithm, which is described in Chapter 3. Compared to previous existing keypoints-based approaches, we added a density-based clustering step to filter out noisy keypoints matches. This method has been demonstrated to perform well on two benchmark datasets and outperforms one of the most cited state-of-the-art methods. In Chapter 4 a novel architecture is proposed to predict the 3D light direction of the light in a given image. This approach leverages the idea of combining, in a data-driven method, a physical illumination model that allows for improved regression performance. In order to fill in the gap of data scarcity for training highly-parameterized deep learning architectures, especially for the task of intrinsic image decomposition, we developed two data generation algorithms that were used to produce two datasets - one synthetic and one of real images - to train and evaluate our approach. The proposed light direction estimation model has then been employed to design a novel splicing detection approach, discussed in Chapter 5, in which light direction inconsistencies between different regions in the image are used to highlight potential splicing attacks. The proposed ensemble scheme for forgery detection is described in the last chapter. It includes a "FusionForgery" module that combines the outputs of the different previously proposed "base" methods and assigns a binary label (forged vs. pristine) to the input image. In the case of forgery prediction, our method also tries to further specialize the decision between splicing and copy-move attacks. If the image is predicted as copy-moved, an attempt to reconstruct the source regions used in the copy-move attack is also done. The performance of the proposed approach has been assessed by training and testing it on a synthetic dataset, generated by us, comprising both copy-move and splicing attacks. The ensemble approach outperforms all of the individual "base" methods, demonstrating the validity of the proposed strategy

    Image and Video Forensics

    Get PDF
    Nowadays, images and videos have become the main modalities of information being exchanged in everyday life, and their pervasiveness has led the image forensics community to question their reliability, integrity, confidentiality, and security. Multimedia contents are generated in many different ways through the use of consumer electronics and high-quality digital imaging devices, such as smartphones, digital cameras, tablets, and wearable and IoT devices. The ever-increasing convenience of image acquisition has facilitated instant distribution and sharing of digital images on digital social platforms, determining a great amount of exchange data. Moreover, the pervasiveness of powerful image editing tools has allowed the manipulation of digital images for malicious or criminal ends, up to the creation of synthesized images and videos with the use of deep learning techniques. In response to these threats, the multimedia forensics community has produced major research efforts regarding the identification of the source and the detection of manipulation. In all cases (e.g., forensic investigations, fake news debunking, information warfare, and cyberattacks) where images and videos serve as critical evidence, forensic technologies that help to determine the origin, authenticity, and integrity of multimedia content can become essential tools. This book aims to collect a diverse and complementary set of articles that demonstrate new developments and applications in image and video forensics to tackle new and serious challenges to ensure media authenticity
    corecore