3 research outputs found

    Image and Video Analytics for Document Processing and Event Recognition

    Get PDF
    The proliferation of handheld devices with cameras is among many changes in the past several decades which affected the document image analysis community by providing a far less constrained document imaging experience compared to traditional non-portable flatbed scanners. Although these devices provide more flexibility in capturing, the users now have to consider numerous environmental challenges including 1) a limited field-of-view keeping users from acquiring a high-quality images of large sources in a single frame, 2) Light reflections on glossy surfaces that result in saturated regions, and 3) Crumpled or non-planar documents that cannot be captured effectively from a single pose. Another change is the application of deep neural networks such as the deep convolutional neural networks (CNNs) for text analysis which is showing unprecedented performance over the classical approaches. Beginning with the success in character recognition, CNNs have shown their strength in many tasks in document analysis as well as computer vision. Researchers have explored potential applicability of CNNs for tasks such as text detection and segmentation, and have been quite successful. These networks, trained to perform single tasks, have recently evolved to handle multiple tasks. This introduces several important challenges including imposing multiple tasks on single architecture network and integrating multiple architectures with different tasks. In this dissertation, we make contributions in both of these areas. First, we propose a novel Graphcut-based document image mosaicking method which seeks to overcome the known limitations of the previous approaches. Our method does not require any prior knowledge of the content of the document images, making it more widely applicable and robust. Information regarding the geometrical disposition between the overlapping images is exploited to minimize the errors at the boundary regions. We incorporate a sharpness measure which induces cut generation in a way that results in the mosaic including the sharpest pixels. Our method is shown to outperform previous methods, both quantitatively and qualitatively. Second, we address the problem of removing highlight regions caused by the light sources reflecting off glossy surfaces in indoor environments. We devise an efficient method to detect and remove the highlights from the target scene by jointly estimating separate homographies for the target scene and the highlights. Our method is based on the observation that when given two images captured at different viewpoints, the displacement of the target scene is different from that of the highlight regions. We show the effectiveness of our method in removing the highlight reflections by comparing it with the related state-of-the-art methods. Unlike the previous methods, our method has the ability to handle saturated and relatively large highlights which completely obscure the content underneath. Third, we address the problem of selecting instances of a planar object in a video or set of images based on an evaluation of its "frontalness". We introduce the idea of "evaluating the frontalness" by computing how close the object's surface normal aligns with the optical axis of a camera. The unique and novel aspect of our method is that unlike previous planar object pose estimation methods, our method does not require a frontal reference image. The intuition is that a true frontal image can be used to reproduce other non-frontal images by perspective projection, while the non-frontal images have limited ability to do so. We show comparing 'frontal' and 'non-frontal' can be extended to compare 'more frontal' and 'less frontal' images. Based on this observation, our method estimates the relative frontalness of an image by exploiting the objective space error. We also propose the use of a K-invariant space to evaluate the frontalness even when the camera intrinsic parameters are unknown (e.g., images/videos from the web). Our method improves the accuracy over a baseline method. Lastly, we address the problem of integrating multiple deep neural networks (specifically CNNs) with different architectures and different tasks into a unified framework. To demonstrate the end-to-end integration of networks with different tasks and different architecture, we select event recognition and object detection. One of the novel aspects of our approach is that this is the first attempt to exploit the power of deep convolutional neural networks to directly integrate relevant object information into a unified network to improve event recognition performance. Our architecture allows the sharing of the convolutional layers and a fully connected layer which effectively integrates event recognition with the rigid and non-rigid object detection

    Analysis of affine motion-compensated prediction and its application in aerial video coding

    Get PDF
    Motion-compensated prediction is used in video coding standards like High Efficiency Video Coding (HEVC) as one key element of data compression. Commonly, a purely translational motion model is employed. In order to also cover non-translational motion types like rotation or scaling (zoom) contained in aerial video sequences such as captured from unmanned aerial vehicles, an affine motion model can be applied. In this work, a model for affine motion-compensated prediction in video coding is derived by extending a model of purely translational motion-compensated prediction. Using the rate-distortion theory and the displacement estimation error caused by inaccurate affine motion parameter estimation, the minimum required bit rate for encoding the prediction error is determined. In this model, the affine transformation parameters are assumed to be affected by statistically independent estimation errors, which all follow a zero-mean Gaussian distributed probability density function (pdf). The joint pdf of the estimation errors is derived and transformed into the pdf of the location-dependent displacement estimation error in the image. The latter is related to the minimum required bit rate for encoding the prediction error. Similar to the derivations of the fully affine motion model, a four-parameter simplified affine model is investigated. It is of particular interest since such a model is considered for the upcoming video coding standard Versatile Video Coding (VVC) succeeding HEVC. As the simplified affine motion model is able to describe most motions contained in aerial surveillance videos, its application in video coding is justified. Both models provide valuable information about the minimum bit rate for encoding the prediction error as a function of affine estimation accuracies. Although the bit rate in motion-compensated prediction can be considerably reduced by using a motion model which is able to describe motion types occurring in the scene, the total video bit rate may remain quite high, depending on the motion estimation accuracy. Thus, at the example of aerial surveillance sequences, a codec independent region of interest- ( ROI -) based aerial video coding system is proposed that exploits the characteristic of such sequences. Assuming the captured scene to be planar, one frame can be projected into another using global motion compensation. Consequently, only new emerging areas have to be encoded. At the decoder, all new areas are registered into a so-called mosaic. From this, reconstructed frames are extracted and concatenated as a video sequence. To also preserve moving objects in the reconstructed video, local motion is detected and encoded in addition to the new areas. The proposed general ROI coding system was evaluated for very low and low bit rates between 100 and 5000 kbit/s for aerial sequences of HD resolution. It is able to reduce the bit rate by 90% compared to common HEVC coding of similar quality. Subjective tests confirm that the overall image quality of the ROI coding system exceeds that of a common HEVC encoder especially at very low bit rates below 1 Mbit/s. To prevent discontinuities introduced by inaccurate global motion estimation, as may be caused by radial lens distortion, a fully automatic in-loop radial distortion compensation is proposed. For this purpose, an unknown radial distortion compensation parameter that is constant for a group of frames is jointly estimated with the global motion. This parameter is optimized to minimize the distortions of the projections of frames in the mosaic. By this approach, the global motion compensation was improved by 0.27dB and discontinuities in the frames extracted from the mosaic are diminished. As an additional benefit, the generation of long-term mosaics becomes possible, constructed by more than 1500 aerial frames with unknown radial lens distortion and without any calibration or manual lens distortion compensation.Bewegungskompensierte Prädiktion wird in Videocodierstandards wie High Efficiency Video Coding (HEVC) als ein Schlüsselelement zur Datenkompression verwendet. Typischerweise kommt dabei ein rein translatorisches Bewegungsmodell zum Einsatz. Um auch nicht-translatorische Bewegungen wie Rotation oder Skalierung (Zoom) beschreiben zu können, welche beispielsweise in von unbemannten Luftfahrzeugen aufgezeichneten Luftbildvideosequenzen enthalten sind, kann ein affines Bewegungsmodell verwendet werden. In dieser Arbeit wird aufbauend auf einem rein translatorischen Bewegungsmodell ein Modell für affine bewegungskompensierte Prädiktion hergeleitet. Unter Verwendung der Raten-Verzerrungs-Theorie und des Verschiebungsschätzfehlers, welcher aus einer inexakten affinen Bewegungsschätzung resultiert, wird die minimal erforderliche Bitrate zur Codierung des Prädiktionsfehlers hergeleitet. Für die Modellierung wird angenommen, dass die sechs Parameter einer affinen Transformation durch statistisch unabhängige Schätzfehler gestört sind. Für jeden dieser Schätzfehler wird angenommen, dass die Wahrscheinlichkeitsdichteverteilung einer mittelwertfreien Gaußverteilung entspricht. Aus der Verbundwahrscheinlichkeitsdichte der Schätzfehler wird die Wahrscheinlichkeitsdichte des ortsabhängigen Verschiebungsschätzfehlers im Bild berechnet. Letztere wird schließlich zu der minimalen Bitrate in Beziehung gesetzt, welche für die Codierung des Prädiktionsfehlers benötigt wird. Analog zur obigen Ableitung des Modells für das voll-affine Bewegungsmodell wird ein vereinfachtes affines Bewegungsmodell mit vier Freiheitsgraden untersucht. Ein solches Modell wird derzeit auch im Rahmen der Standardisierung des HEVC-Nachfolgestandards Versatile Video Coding (VVC) evaluiert. Da das vereinfachte Modell bereits die meisten in Luftbildvideosequenzen vorkommenden Bewegungen abbilden kann, ist der Einsatz des vereinfachten affinen Modells in der Videocodierung gerechtfertigt. Beide Modelle liefern wertvolle Informationen über die minimal benötigte Bitrate zur Codierung des Prädiktionsfehlers in Abhängigkeit von der affinen Schätzgenauigkeit. Zwar kann die Bitrate mittels bewegungskompensierter Prädiktion durch Wahl eines geeigneten Bewegungsmodells und akkurater affiner Bewegungsschätzung stark reduziert werden, die verbleibende Gesamtbitrate kann allerdings dennoch relativ hoch sein. Deshalb wird am Beispiel von Luftbildvideosequenzen ein Regionen-von-Interesse- (ROI-) basiertes Codiersystem vorgeschlagen, welches spezielle Eigenschaften solcher Sequenzen ausnutzt. Unter der Annahme, dass eine aufgenommene Szene planar ist, kann ein Bild durch globale Bewegungskompensation in ein anderes projiziert werden. Deshalb müssen vom aktuellen Bild prinzipiell nur noch neu im Bild erscheinende Bereiche codiert werden. Am Decoder werden alle neuen Bildbereiche in einem gemeinsamen Mosaikbild registriert, aus dem schließlich die Einzelbilder der Videosequenz rekonstruiert werden können. Um auch lokale Bewegungen abzubilden, werden bewegte Objekte detektiert und zusätzlich zu neuen Bildbereichen als ROI codiert. Die Leistungsfähigkeit des ROI-Codiersystems wurde insbesondere für sehr niedrige und niedrige Bitraten von 100 bis 5000 kbit/s für Bilder in HD-Auflösung evaluiert. Im Vergleich zu einer gewöhnlichen HEVC-Codierung kann die Bitrate um 90% reduziert werden. Durch subjektive Tests wurde bestätigt, dass das ROI-Codiersystem insbesondere für sehr niedrige Bitraten von unter 1 Mbit/s deutlich leistungsfähiger in Bezug auf Detailauflösung und Gesamteindruck ist als ein herkömmliches HEVC-Referenzsystem. Um Diskontinuitäten in den rekonstruierten Videobildern zu vermeiden, die durch eine durch Linsenverzeichnungen induzierte ungenaue globale Bewegungsschätzung entstehen können, wird eine automatische Radialverzeichnungskorrektur vorgeschlagen. Dabei wird ein unbekannter, jedoch über mehrere Bilder konstanter Korrekturparameter gemeinsam mit der globalen Bewegung geschätzt. Dieser Parameter wird derart optimiert, dass die Projektionen der Bilder in das Mosaik möglichst wenig verzerrt werden. Daraus resultiert eine um 0,27dB verbesserte globale Bewegungskompensation, wodurch weniger Diskontinuitäten in den aus dem Mosaik rekonstruierten Bildern entstehen. Dieses Verfahren ermöglicht zusätzlich die Erstellung von Langzeitmosaiken aus über 1500 Luftbildern mit unbekannter Radialverzeichnung und ohne manuelle Korrektur

    Sharpness-aware document image mosaicing using graphcuts

    No full text
    There are numerous types of documents which are difficult to scan or capture in a single pass due to their physical size or the size of their content. One possible solution that has been proposed is mosaicing multiple overlapping images to capture the complete document. In this paper, we present a novel Graphcut-based document image mosaicing method which seeks to overcome the known limitations of the pre-vious approaches. First, our method does not require any prior knowledge of the content of the given document images, making it more widely applicable and robust. Second, in-formation regarding the geometrical disposition between the overlapping images is exploited to minimize the errors at the boundary regions. Third, our method incorporates a sharp-ness measure which induces cut generation in a way that re-sults in the mosaic including the sharpest pixels. Our method is shown to outperform previous methods, both quantitatively and qualitatively. Index Terms — document, image mosaicing, panorama, Graphcut
    corecore