53,743 research outputs found

    Methods for text segmentation from scene images

    Get PDF
    Camera-captured scene/born-digital image analysis helps in the development of vision for robots to read text, transliterate or translate text, navigate and retrieve search results. However, text in such images does nor follow any standard layout, and its location within the image is random in nature. In addition, motion blur, non-uniform illumination, skew, occlusion and scale-based degradations increase the complexity in locating and recognizing the text in a scene/born-digital image. OTCYMIST method is proposed to segment text from the born-digital images. This method won the first place in ICDAR 2011 and placed in the third position in ICDAR 2013 for its performance on the text segmentation task in robust reading competitions for born-digital image data set. Here, Otsu’s binarization and Canny edge detection are separately carried out on the three colour planes of the image. Connected components (CC’s) obtained from the segmented image are pruned based on thresholds applied on their area and aspect ratio. CC’s with sufficient edge pixels are retained. The centroids of the individual CC’s are used as nodes of a graph. A minimum spanning tree is built using these nodes of the graph. Long edges are broken from the minimum spanning tree of the graph. Pairwise height ratio is used to remove likely non-text components. CC’s are grouped based on their proximity in the horizontal direction to generate bounding boxes (BB’s) of text strings. Overlapping BB’s are removed using an overlap area threshold. Non-overlapping and minimally overlapping BB’s are retained for text segmentation. These BB’s are split vertically to localize text at the word level. A word cropped from a document image can easily be recognized using a traditional optical character recognition (OCR) engine. However, recognizing a word, obtained by manually cropping a scene/born-digital image, is not trivial. Existing OCR engines do not handle these kinds of scene word images effectively. Our intention is to first segment the word image and then pass it to the existing OCR engines for recognition. It is advantageous in two aspects: it avoids building a character classifier from scratch and reduces the word recognition task to a word segmentation task. Here, we propose three bottom-up approaches to segment a cropped word image. These approaches choose different features at the initial stage of segmentation. Power-law transform (PLT) was applied to the pixels of the gray scale born-digital images to non-linearly enhance the histogram. The recognition rate achieved on born-digital word images is 82. 9%, which is 20% more than the top performing entry (61. 5%) in ICDAR 2011 robust reading competition. The recognition rate is 82. 7% and 64. 6% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using PLT. In addition, we applied PLT to the colour planes such as red, green, blue, intensity and lightness plane by varying the gamma value. We call this technique as Nonlinear enhancement and selection of plane (NESP) for optimal segmentation, which is an improvement over PLT. NESP chooses a particular plane with a proper gamma value based on Fisher discrimination factor. The recognition rate is 72. 8% for scene images of ICDAR 2011 robust reading competition, which is 30% higher than the best entry (41. 2%). The recognition rate is 81. 7% and 65. 9% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using NESP. Another technique, midline analysis and propagation of segmentation (MAPS), has also been proposed for word segmentation. Here, the middle row pixels of the gray scale image are first segmented and the statistics of the segmented pixels are used to assign text and non-text labels to the rest of the image pixels using min-cut method. Gaussian model is fitted on the middle row segmented pixels before the assignment of other pixels. In MAPS method, we assume the middle row pixels are least affected by any of the degradations. This assumption is validated by the good word recognition rate of 71. 7% on ICDAR 2011 robust reading competition for scene images. The recognition rate is 83. 8% and 66. 0% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using MAPS. The best reported results for ICDAR 2003 word images is 61. 1% using custom lexicons containing the list of test words. On the other hand, NESP and MAPS achieve 66. 2% and 64. 5% for ICDAR 2003 word images without using any lexicon. By using similar custom lexicon, the recognition rates for ICDAR 2003 word images go up to 74. 9% and 74. 2% for NESP and MAPS methods, respectively. We manually segmented word images and recognized these images using OCR to benchmark maximum possible recognition rate for each database. The recognition rates of the proposed methods and the benchmark results are reported on the seven publicly available word image data sets and compared with the results reported in the literature. We have designed a classifier to recognize Kannada characters and words from Chars74k data set and our own image collection, respectively. Discrete cosine transform (DCT) and block DCT are used as features to train separate classifiers. Kannada words are segmented using the same techniques (MAPS and NESP) and further segmented into groups of components, since a Kannada character may be represented by a single component or a group of components in an image. The recognition rate on Kannada words is reported for different features with and without the use of a lexicon. The obtained recognition performance for Kannada character recognition (11. 4%) is three times the best performance (3. 5%) reported in the literature. This thesis has dealt with the principal aspects of camera captured scene/born-digital text image analysis: text localization, text segmentation, and word recognition. We have benchmarked the recognition rates of five word image data sets. We conducted a multi-script robust reading competition as part of ICDAR 2013. This competition was aimed to determine whether the text localization and segmentation methods were capable of handling any text, independent of the script

    Cracks in the Glass: The Emergence of a New Image Typology from the Spatio-temporal Schisms of the 'Filmic' Virtual Reality Panorama

    Get PDF
    Virtual Reality Panoramas have fascinated me for some time; their interactive nature affording a spectatorial engagement not evident within other forms of painting or digital imagery. This interactivity is not generally linear as is evident in animation or film, nor is the engagement with the image reduced to the physical or visual border of the image, as its limit is never visible to the viewer in its entirety. Further, the time taken to interact and navigate across the Virtual Reality panorama’s surface is not reflected or recorded within the observed image. The procedural construction of the Virtual Reality panorama creates an a-temporal image event that denies the durée of its own index and creation. This is particularly evident in the cinematic experiments conducted by Jeffrey Shaw in the 1990s that ‘spatialised’ time and image through the fusion of the formal typology of the Panorama together with the cinematic moving-image, creating a new kind of image technology. The incorporation of the space enclosed by the panorama’s drum, into the conception and execution of the cinematic event, reveals an interesting conceptual paradox. Space and time infinitely and autonomously repeat upon each other as the linear trajectory of the singular cinematic shot is interrupted by a ‘time schism’ on the surface of the panorama. This paper explores what this conceptual paradox means to the evolution of emerging image-technologies and how Shaw’s ‘mixed-reality’ installation reveals a wholly new image typology that presents techniques and concepts though which to record, interrogate, and represent time and space in Architecture

    Moveable worlds/digital scenographies

    Get PDF
    This is the author's accepted manuscript. The final published article is available from the link below. Copyright @ Intellect Ltd 2010.The mixed reality choreographic installation UKIYO explored in this article reflects an interest in scenographic practices that connect physical space to virtual worlds and explore how performers can move between material and immaterial spaces. The spatial design for UKIYO is inspired by Japanese hanamichi and western fashion runways, emphasizing the research production company's commitment to various creative crossovers between movement languages, innovative wearable design for interactive performance, acoustic and electronic sound processing and digital image objects that have a plastic as well as an immaterial/virtual dimension. The work integrates various forms of making art in order to visualize things that are not in themselves visual, or which connect visual and kinaesthetic/tactile/auditory experiences. The ‘Moveable Worlds’ in this essay are also reflections of the narrative spaces, subtexts and auditory relationships in the mutating matrix of an installation-space inviting the audience to move around and follow its sensorial experiences, drawn near to the bodies of the dancers.Brunel University, the British Council, and the Japan Foundation

    Indexing of fictional video content for event detection and summarisation

    Get PDF
    This paper presents an approach to movie video indexing that utilises audiovisual analysis to detect important and meaningful temporal video segments, that we term events. We consider three event classes, corresponding to dialogues, action sequences, and montages, where the latter also includes musical sequences. These three event classes are intuitive for a viewer to understand and recognise whilst accounting for over 90% of the content of most movies. To detect events we leverage traditional filmmaking principles and map these to a set of computable low-level audiovisual features. Finite state machines (FSMs) are used to detect when temporal sequences of specific features occur. A set of heuristics, again inspired by filmmaking conventions, are then applied to the output of multiple FSMs to detect the required events. A movie search system, named MovieBrowser, built upon this approach is also described. The overall approach is evaluated against a ground truth of over twenty-three hours of movie content drawn from various genres and consistently obtains high precision and recall for all event classes. A user experiment designed to evaluate the usefulness of an event-based structure for both searching and browsing movie archives is also described and the results indicate the usefulness of the proposed approach

    Indexing of fictional video content for event detection and summarisation

    Get PDF
    This paper presents an approach to movie video indexing that utilises audiovisual analysis to detect important and meaningful temporal video segments, that we term events. We consider three event classes, corresponding to dialogues, action sequences, and montages, where the latter also includes musical sequences. These three event classes are intuitive for a viewer to understand and recognise whilst accounting for over 90% of the content of most movies. To detect events we leverage traditional filmmaking principles and map these to a set of computable low-level audiovisual features. Finite state machines (FSMs) are used to detect when temporal sequences of specific features occur. A set of heuristics, again inspired by filmmaking conventions, are then applied to the output of multiple FSMs to detect the required events. A movie search system, named MovieBrowser, built upon this approach is also described. The overall approach is evaluated against a ground truth of over twenty-three hours of movie content drawn from various genres and consistently obtains high precision and recall for all event classes. A user experiment designed to evaluate the usefulness of an event-based structure for both searching and browsing movie archives is also described and the results indicate the usefulness of the proposed approach

    From pixel to mesh: accurate and straightforward 3D documentation of cultural heritage from the Cres/Lošinj archipelago

    Get PDF
    Most people like 3D visualizations. Whether it is in movies, holograms or games, 3D (literally) adds an extra dimension to conventional pictures. However, 3D data and their visualizations can also have scientic archaeological benets: they are crucial in removing relief distortions from photographs, facilitate the interpretation of an object or just support the aspiration to document archaeology as exhaustively as possible. Since archaeology is essentially a spatial discipline, the recording of the spatial data component is in most cases of the utmost importance to perform scientic archaeological research. For complex sites and precious artefacts, this can be a di€cult, time-consuming and very expensive operation. In this contribution, it is shown how a straightforward and cost-eective hard- and software combination is used to accurately document and inventory some of the cultural heritage of the Cres/Lošinj archipelago in three or four dimensions. First, standard photographs are acquired from the site or object under study. Secondly, the resulting image collection is processed with some recent advances in computer technology and so-called Structure from Motion (SfM) algorithms, which are known for their ability to reconstruct a sparse point cloud of scenes that were imaged by a series of overlapping photographs. When complemented by multi-view stereo matching algorithms, detailed 3D models can be built from such photo collections in a fully automated way. Moreover, the software packages implementing these tools are available for free or at very low-cost. Using a mixture of archaeological case studies, it will be shown that those computer vision applications produce excellent results from archaeological imagery with little eort needed. Besides serving the purpose of a pleasing 3D visualization for virtual display or publications, the 3D output additionally allows to extract accurate metric information about the archaeology under study (from single artefacts to entire landscapes)

    An investigation of the facsimile camera response to object motion

    Get PDF
    A general analytical model of the facsimile camera response to object motion is derived as an initial step toward characterizing the resulting image degradation. This model expresses the spatial convolution of a time-varying object radiance distribution and camera point-spread function for each picture element in the image. Time variations and these two functions during each convolution account for blurring of small image detail, and variations between, as well as during, successive convolutions account for geometric image distortions. If the object moves beyond the angular extent of several picture elements while it is being imaged, then geometric distortion tends to dominate blurring as the primary cause of image degradation. The extent of distortion depends not only on object size and velocity but also on the direction of object motion, and is therefore difficult to classify in a general sense

    Indexing of fictional video content for event detection and summarisation

    Get PDF
    This paper presents an approach to movie video indexing that utilises audiovisual analysis to detect important and meaningful temporal video segments, that we term events. We consider three event classes, corresponding to dialogues, action sequences, and montages, where the latter also includes musical sequences. These three event classes are intuitive for a viewer to understand and recognise whilst accounting for over 90% of the content of most movies. To detect events we leverage traditional filmmaking principles and map these to a set of computable low-level audiovisual features. Finite state machines (FSMs) are used to detect when temporal sequences of specific features occur. A set of heuristics, again inspired by filmmaking conventions, are then applied to the output of multiple FSMs to detect the required events. A movie search system, named MovieBrowser, built upon this approach is also described. The overall approach is evaluated against a ground truth of over twenty-three hours of movie content drawn from various genres and consistently obtains high precision and recall for all event classes. A user experiment designed to evaluate the usefulness of an event-based structure for both searching and browsing movie archives is also described and the results indicate the usefulness of the proposed approach
    corecore