15 research outputs found

    Methods for text segmentation from scene images

    Get PDF
    Camera-captured scene/born-digital image analysis helps in the development of vision for robots to read text, transliterate or translate text, navigate and retrieve search results. However, text in such images does nor follow any standard layout, and its location within the image is random in nature. In addition, motion blur, non-uniform illumination, skew, occlusion and scale-based degradations increase the complexity in locating and recognizing the text in a scene/born-digital image. OTCYMIST method is proposed to segment text from the born-digital images. This method won the first place in ICDAR 2011 and placed in the third position in ICDAR 2013 for its performance on the text segmentation task in robust reading competitions for born-digital image data set. Here, Otsu’s binarization and Canny edge detection are separately carried out on the three colour planes of the image. Connected components (CC’s) obtained from the segmented image are pruned based on thresholds applied on their area and aspect ratio. CC’s with sufficient edge pixels are retained. The centroids of the individual CC’s are used as nodes of a graph. A minimum spanning tree is built using these nodes of the graph. Long edges are broken from the minimum spanning tree of the graph. Pairwise height ratio is used to remove likely non-text components. CC’s are grouped based on their proximity in the horizontal direction to generate bounding boxes (BB’s) of text strings. Overlapping BB’s are removed using an overlap area threshold. Non-overlapping and minimally overlapping BB’s are retained for text segmentation. These BB’s are split vertically to localize text at the word level. A word cropped from a document image can easily be recognized using a traditional optical character recognition (OCR) engine. However, recognizing a word, obtained by manually cropping a scene/born-digital image, is not trivial. Existing OCR engines do not handle these kinds of scene word images effectively. Our intention is to first segment the word image and then pass it to the existing OCR engines for recognition. It is advantageous in two aspects: it avoids building a character classifier from scratch and reduces the word recognition task to a word segmentation task. Here, we propose three bottom-up approaches to segment a cropped word image. These approaches choose different features at the initial stage of segmentation. Power-law transform (PLT) was applied to the pixels of the gray scale born-digital images to non-linearly enhance the histogram. The recognition rate achieved on born-digital word images is 82. 9%, which is 20% more than the top performing entry (61. 5%) in ICDAR 2011 robust reading competition. The recognition rate is 82. 7% and 64. 6% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using PLT. In addition, we applied PLT to the colour planes such as red, green, blue, intensity and lightness plane by varying the gamma value. We call this technique as Nonlinear enhancement and selection of plane (NESP) for optimal segmentation, which is an improvement over PLT. NESP chooses a particular plane with a proper gamma value based on Fisher discrimination factor. The recognition rate is 72. 8% for scene images of ICDAR 2011 robust reading competition, which is 30% higher than the best entry (41. 2%). The recognition rate is 81. 7% and 65. 9% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using NESP. Another technique, midline analysis and propagation of segmentation (MAPS), has also been proposed for word segmentation. Here, the middle row pixels of the gray scale image are first segmented and the statistics of the segmented pixels are used to assign text and non-text labels to the rest of the image pixels using min-cut method. Gaussian model is fitted on the middle row segmented pixels before the assignment of other pixels. In MAPS method, we assume the middle row pixels are least affected by any of the degradations. This assumption is validated by the good word recognition rate of 71. 7% on ICDAR 2011 robust reading competition for scene images. The recognition rate is 83. 8% and 66. 0% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using MAPS. The best reported results for ICDAR 2003 word images is 61. 1% using custom lexicons containing the list of test words. On the other hand, NESP and MAPS achieve 66. 2% and 64. 5% for ICDAR 2003 word images without using any lexicon. By using similar custom lexicon, the recognition rates for ICDAR 2003 word images go up to 74. 9% and 74. 2% for NESP and MAPS methods, respectively. We manually segmented word images and recognized these images using OCR to benchmark maximum possible recognition rate for each database. The recognition rates of the proposed methods and the benchmark results are reported on the seven publicly available word image data sets and compared with the results reported in the literature. We have designed a classifier to recognize Kannada characters and words from Chars74k data set and our own image collection, respectively. Discrete cosine transform (DCT) and block DCT are used as features to train separate classifiers. Kannada words are segmented using the same techniques (MAPS and NESP) and further segmented into groups of components, since a Kannada character may be represented by a single component or a group of components in an image. The recognition rate on Kannada words is reported for different features with and without the use of a lexicon. The obtained recognition performance for Kannada character recognition (11. 4%) is three times the best performance (3. 5%) reported in the literature. This thesis has dealt with the principal aspects of camera captured scene/born-digital text image analysis: text localization, text segmentation, and word recognition. We have benchmarked the recognition rates of five word image data sets. We conducted a multi-script robust reading competition as part of ICDAR 2013. This competition was aimed to determine whether the text localization and segmentation methods were capable of handling any text, independent of the script

    Why my photos look sideways or upside down? Detecting Canonical Orientation of Images using Convolutional Neural Networks

    Full text link
    Image orientation detection requires high-level scene understanding. Humans use object recognition and contextual scene information to correctly orient images. In literature, the problem of image orientation detection is mostly confronted by using low-level vision features, while some approaches incorporate few easily detectable semantic cues to gain minor improvements. The vast amount of semantic content in images makes orientation detection challenging, and therefore there is a large semantic gap between existing methods and human behavior. Also, existing methods in literature report highly discrepant detection rates, which is mainly due to large differences in datasets and limited variety of test images used for evaluation. In this work, for the first time, we leverage the power of deep learning and adapt pre-trained convolutional neural networks using largest training dataset to-date for the image orientation detection task. An extensive evaluation of our model on different public datasets shows that it remarkably generalizes to correctly orient a large set of unconstrained images; it also significantly outperforms the state-of-the-art and achieves accuracy very close to that of humans

    Why my photos look sideways or upside down? Detecting Canonical Orientation of Images using Convolutional Neural Networks

    Full text link
    Image orientation detection requires high-level scene understanding. Humans use object recognition and contextual scene information to correctly orient images. In literature, the problem of image orientation detection is mostly confronted by using low-level vision features, while some approaches incorporate few easily detectable semantic cues to gain minor improvements. The vast amount of semantic content in images makes orientation detection challenging, and therefore there is a large semantic gap between existing methods and human behavior. Also, existing methods in literature report highly discrepant detection rates, which is mainly due to large differences in datasets and limited variety of test images used for evaluation. In this work, for the first time, we leverage the power of deep learning and adapt pre-trained convolutional neural networks using largest training dataset to-date for the image orientation detection task. An extensive evaluation of our model on different public datasets shows that it remarkably generalizes to correctly orient a large set of unconstrained images; it also significantly outperforms the state-of-the-art and achieves accuracy very close to that of humans

    Methods for text segmentation from scene images

    Get PDF
    Camera-captured scene/born-digital image analysis helps in the development of vision for robots to read text, transliterate or translate text, navigate and retrieve search results. However, text in such images does nor follow any standard layout, and its location within the image is random in nature. In addition, motion blur, non-uniform illumination, skew, occlusion and scale-based degradations increase the complexity in locating and recognizing the text in a scene/born-digital image. OTCYMIST method is proposed to segment text from the born-digital images. This method won the first place in ICDAR 2011 and placed in the third position in ICDAR 2013 for its performance on the text segmentation task in robust reading competitions for born-digital image data set. Here, Otsu’s binarization and Canny edge detection are separately carried out on the three colour planes of the image. Connected components (CC’s) obtained from the segmented image are pruned based on thresholds applied on their area and aspect ratio. CC’s with sufficient edge pixels are retained. The centroids of the individual CC’s are used as nodes of a graph. A minimum spanning tree is built using these nodes of the graph. Long edges are broken from the minimum spanning tree of the graph. Pairwise height ratio is used to remove likely non-text components. CC’s are grouped based on their proximity in the horizontal direction to generate bounding boxes (BB’s) of text strings. Overlapping BB’s are removed using an overlap area threshold. Non-overlapping and minimally overlapping BB’s are retained for text segmentation. These BB’s are split vertically to localize text at the word level. A word cropped from a document image can easily be recognized using a traditional optical character recognition (OCR) engine. However, recognizing a word, obtained by manually cropping a scene/born-digital image, is not trivial. Existing OCR engines do not handle these kinds of scene word images effectively. Our intention is to first segment the word image and then pass it to the existing OCR engines for recognition. It is advantageous in two aspects: it avoids building a character classifier from scratch and reduces the word recognition task to a word segmentation task. Here, we propose three bottom-up approaches to segment a cropped word image. These approaches choose different features at the initial stage of segmentation. Power-law transform (PLT) was applied to the pixels of the gray scale born-digital images to non-linearly enhance the histogram. The recognition rate achieved on born-digital word images is 82.9%, which is 20% more than the top performing entry (61.5%) in ICDAR 2011 robust reading competition. The recognition rate is 82.7% and 64.6% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using PLT.In addition, we applied PLT to the colour planes such as red, green, blue, intensity and lightness plane by varying the gamma value. We call this technique as Nonlinear enhancement and selection of plane (NESP) for optimal segmentation, which is an improvement over PLT. NESP chooses a particular plane with a proper gamma value based on Fisher discrimination factor. The recognition rate is 72.8% for scene images of ICDAR 2011 robust reading competition, which is 30% higher than the best entry (41.2%). The recognition rate is 81.7% and 65.9% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using NESP.Another technique, midline analysis and propagation of segmentation (MAPS), has also been proposed for word segmentation. Here, the middle row pixels of the gray scale image are first segmented and the statistics of the segmented pixels are used to assign text and non-text labels to the rest of the image pixels using min-cut method. Gaussian model is fitted on the middle row segmented pixels before the assignment of other pixels. In MAPS method, we assume the middle row pixels are least affected by any of the degradations. This assumption is validated by the good word recognition rate of 71.7% on ICDAR 2011 robust reading competition for scene images. The recognition rate is 83.8% and 66.0% for born-digital and scene images of ICDAR 2013 robust reading competition, respectively, using MAPS. The best reported results for ICDAR 2003 word images is 61.1% using custom lexicons containing the list of test words. On the other hand, NESP and MAPS achieve 66.2% and 64.5% for ICDAR 2003 word images without using any lexicon. By using similar custom lexicon, the recognition rates for ICDAR 2003 word images go up to 74.9% and 74.2% for NESP and MAPS methods, respectively.We manually segmented word images and recognized these images using OCR to benchmark maximum possible recognition rate for each database. The recognition rates of the proposed methods and the benchmark results are reported on the seven publicly available word image data sets and compared with the results reported in the literature.We have designed a classifier to recognize Kannada characters and words from Chars74k data set and our own image collection, respectively. Discrete cosine transform (DCT) and block DCT are used as features to train separate classifiers. Kannada words are segmented using the same techniques (MAPS and NESP) and further segmented into groups of components, since a Kannada character may be represented by a single component or a group of components in an image. The recognition rate on Kannada words is reported for different features with and without the use of a lexicon. The obtained recognition performance for Kannada character recognition (11.4%) is three times the best performance (3.5%) reported in the literature.This thesis has dealt with the principal aspects of camera captured scene/born-digital text image analysis: text localization, text segmentation, and word recognition. We have benchmarked the recognition rates of five word image data sets. We conducted a multi-script robust reading competition as part of ICDAR 2013. This competition was aimed to determine whether the text localization and segmentation methods were capable of handling any text, independent of the script

    Automated classification of cricket pitch frames in cricket video

    Get PDF
    The automated detection of the cricket pitch in a video recording of a cricket match is a fundamental step in content-based indexing and summarization of cricket videos. In this paper, we propose visualcontent based algorithms to automate the extraction of video frames with the cricket pitch in focus. As a preprocessing step, we first select a subset of frames with a view of the cricket field, of which the cricket pitch forms a part. This filtering process reduces the search space by eliminating frames that contain a view of the audience, close-up shots of specific players, advertisements, etc. The subset of frames containing the cricket field is then subject to statistical modeling of the grayscale (brightness) histogram (SMoG). Since SMoG does not utilize color or domain-specific information such as the region in the frame where the pitch is expected to be located, we propose an alternative algorithm: component quantization based region of interest extraction (CQRE) for the extraction of pitch frames. Experimental results demonstrate that, regardless of the quality of the input, successive application of the two methods outperforms either one applied exclusively. The SMoG-CQRE combination for pitch frame classification yields an average accuracy of 98:6% in the best case (a high resolution video with good contrast) and an average accuracy of 87:9% in the worst case (a low resolution video with poor contrast). Since, the extraction of pitch frames forms the first step in analyzing the important events in a match, we also present a post-processing step, viz. , an algorithm to detect players in the extracted pitch frames

    Image complexity based fMRI-BOLD visual network categorization across visual datasets using topological descriptors and deep-hybrid learning

    Full text link
    This study proposes a new approach that investigates differences in topological characteristics of visual networks, which are constructed using fMRI BOLD time-series corresponding to visual datasets of COCO, ImageNet, and SUN. A publicly available BOLD5000 dataset is utilized that contains fMRI scans while viewing 5254 images of diverse complexities. The objective of this study is to examine how network topology differs in response to distinct visual stimuli from these visual datasets. To achieve this, 0- and 1-dimensional persistence diagrams are computed for each visual network representing COCO, ImageNet, and SUN. For extracting suitable features from topological persistence diagrams, K-means clustering is executed. The extracted K-means cluster features are fed to a novel deep-hybrid model that yields accuracy in the range of 90%-95% in classifying these visual networks. To understand vision, this type of visual network categorization across visual datasets is important as it captures differences in BOLD signals while perceiving images with different contexts and complexities. Furthermore, distinctive topological patterns of visual network associated with each dataset, as revealed from this study, could potentially lead to the development of future neuroimaging biomarkers for diagnosing visual processing disorders like visual agnosia or prosopagnosia, and tracking changes in visual cognition over time

    Weighted bi-prediction for light field image coding

    Get PDF
    Light field imaging based on a single-tier camera equipped with a microlens array – also known as integral, holoscopic, and plenoptic imaging – has currently risen up as a practical and prospective approach for future visual applications and services. However, successfully deploying actual light field imaging applications and services will require developing adequate coding solutions to efficiently handle the massive amount of data involved in these systems. In this context, self-similarity compensated prediction is a non-local spatial prediction scheme based on block matching that has been shown to achieve high efficiency for light field image coding based on the High Efficiency Video Coding (HEVC) standard. As previously shown by the authors, this is possible by simply averaging two predictor blocks that are jointly estimated from a causal search window in the current frame itself, referred to as self-similarity bi-prediction. However, theoretical analyses for motion compensated bi-prediction have suggested that it is still possible to achieve further rate-distortion performance improvements by adaptively estimating the weighting coefficients of the two predictor blocks. Therefore, this paper presents a comprehensive study of the rate-distortion performance for HEVC-based light field image coding when using different sets of weighting coefficients for self-similarity bi-prediction. Experimental results demonstrate that it is possible to extend the previous theoretical conclusions to light field image coding and show that the proposed adaptive weighting coefficient selection leads to up to 5 % of bit savings compared to the previous self-similarity bi-prediction scheme.info:eu-repo/semantics/acceptedVersio

    Light field coding with field of view scalability and exemplar-based inter-layer prediction

    Get PDF
    Light field imaging based on microlens arrays—a.k.a. holoscopic, plenoptic, and integral imaging—has currently risen up as a feasible and prospective technology for future image and video applications. However, deploying actual light field applications will require identifying more powerful representations and coding solutions that support arising new manipulation and interaction functionalities. In this context, this paper proposes a novel scalable coding solution that supports a new type of scalability, referred to as field-of-view scalability. The proposed scalable coding solution comprises a base layer compliant with the High Efficiency Video Coding (HEVC) standard, complemented by one or more enhancement layers that progressively allow richer versions of the same light field content in terms of content manipulation and interaction possibilities. In addition, to achieve high-compression performance in the enhancement layers, novel exemplar-based interlayer coding tools are also proposed, namely: 1) a direct prediction based on exemplar texture samples from lower layers and 2) an interlayer compensated prediction using a reference picture that is built relying on an exemplar-based algorithm for texture synthesis. Experimental results demonstrate the advantages of the proposed scalable coding solution to cater to users with different preferences/requirements in terms of interaction functionalities, while providing better rate- distortion performance (independently of the optical setup used for acquisition) compared to HEVC and other scalable light field coding solutions in the literature.info:eu-repo/semantics/acceptedVersio

    NIAS Annual Report 2022-2023

    Get PDF

    Applications of Advanced Imaging Methods: Macro-Scale Studies of Woven Composites and Micro-Scale Measurements on Heated IC Packages

    Get PDF
    As a representative advanced imaging technique, the digital image correlation (DIC) method has been well established and widely used for deformation measurements in experimental mechanics. This methodology, both 2D and 3D, provides qualitative and quantitative information regarding the specimen’s non-uniform deformation response. Its full-field capabilities and non-contacting approach are especially advantageous when applied to heterogeneous material systems such as fiber-reinforced composites and integrated chip (IC) packages. To increase understanding of damage evolution in advanced composite material systems, a series of large deflection bending-compression experiments and model predictions have been performed for a woven glass-epoxy composite material system. Stereo digital image correlation has been integrated with a compression-bending mechanical loading system to simultaneously quantify full-field deformations along the length of the specimen. Specifically, the integrated system is employed to experimentally study the highly non-uniform full-field strain fields on both compression and tension surfaces of the heterogeneous specimen undergoing compression-bending loading. Theoretical developments employing both small and large deformation models are performed. Results show (a) that the Euler–Bernoulli beam theory for small deformations is adequate to describe the shape and deformations when the axial and transverse displacement are quite small, (b) that a modified Drucker’s equation effectively extends the theoretical predictions to the large deformation region, providing an accurate estimate for the buckling load, the post-buckling axial load-axial displacement response of the specimen and the axial strain along the beam centerline, even in the presence of observed anticlastic (double) specimen curvature near mid-length for all fiber angles (that is not modeled), (c) for the first time show that the quantities σeff - εeff are linearly related on both the compression and tension surfaces of a beam-compression specimen in the range 0 ≤ εeff \u3c 0.005 as the specimen undergoes combined bending-compression loading. In addition, computational studies also show the consistency with the experimental σeff - εeff results on both surfaces. In a separate set of studies, SEM-based imaging at high magnification is used with 2D-DIC to measure thermal deformations at the nano-scale on cross-sections of IC package to improve understanding of the highly heterogeneous nature of the deformations in IC chips. Full-field thermal deformation experiments on different materials within an IC chip cross-section have been successfully obtained for areas from 50x50 μm2 to 10x10 μm2 and at temperatures from RT to ≈ 200oC using images obtained with a Zeiss Ultraplus Thermal Field Emission SEM. Initially, polishing methods for heterogeneous electronic packages containing silicon, Cu bump, WPR layer, substrate and FLI (First level interconnect) were evaluated with the goal of achieving sub-micron surface flatness. Studies have shown that surface flatness of 700nm is achievable, though this level is unacceptable when using e-beam photolithography for nanoscale patterning. Fortunately, a novel self-assembly technique was identified and used to obtain a dense, randomly isotropic, high contrast pattern over the surface of the entire heterogeneous region on an IC package for SEM imaging and DIC. Experiments performed on baseline materials for temperatures in the range 25°C to 200°C demonstrates that the complete process is effective for quantifying the thermal coefficient of expansion for nickel, aluminum and brass. The experiments on IC cross-sections were performed when viewing 25μm x 25μm areas and correcting image distortions using software developed at USC. The results clearly show the heterogeneous nature of the specimen surface and non-uniform strain field across the complex material constituents for temperatures ranging from RT to 200°C. Experimental results confirm that the method is capable of measuring local thermal expansion in selected regions, improving our understanding of these heterogeneous material systems under controlled thermal-environmental conditions
    corecore