10 research outputs found
Subword-based Stochastic Segment Modeling for Offline Arabic Handwriting Recognition
In this paper, we describe several experiments in which we use a stochastic segment model (SSM) to improve offline handwriting recognition (OHR) performance. We use the SSM to re-rank (re-score) multiple decoder hypotheses. Then, a probabilistic multi-class SVM is trained to model stochastic segments obtained from force aligning transcriptions with the underlying image. We extract multiple features from the stochastic segments that are sensitive to larger context span to train the SVM. Our experiments show that using confidence scores from the trained SVM within the SSM framework can significantly improve OHR performance. We also show that OHR performance can be improved by using a combination of character-based and parts-of-Arabic-words (PAW)-based SSMs
Recommended from our members
Automatic Identification of Errors in Arabic Handwriting Recognition
Arabic handwriting recognition (HR) is a challenging problem due to Arabic's connected letter forms, consonantal diacritics and rich morphology. In this paper we isolate the task of identification of erroneous words in HR from the task of producing corrections for these words. We consider a variety of linguistic (morphological and syntactic) and non-linguistic features to automatically identify these errors. We also consider a learning curve varying in two dimensions: number of segments and number of n-best hypotheses to train on. We additionally evaluate the performance on different test sets with different degrees of errors in them. Our best approach achieves a roughly ~20% absolute increase in F-score over a simple but reasonable baseline. A detailed error analysis shows that linguistic features, such as lemma models, help improve HR-error detection precisely where we expect them to: semantically inconsistent error words
Recommended from our members
Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing.
Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms.
This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems.
Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques.
Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time.
Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images.
In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected.
The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase.
Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%.
Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved.
To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.King Fahd University of Petroleum and Minerals (KFUPM
Scene and Video Understanding
There have been significant improvements in the accuracy of scene understanding due to a shift from recognizing objects ``in isolation'' to context based recognition systems. Such systems improve recognition rates by augmenting appearance based models of individual objects with contextual information based on pairwise relationships between objects. These pairwise relations incorporate common sense world knowledge such as co-occurrences and spatial arrangements of objects, temporal consistency, scene layout, etc. However, these relations, even though consistent in the 3D world, change due to viewpoint of the scene. In this thesis, we investigate incorporating contextual information from three different perspectives for scene and video understanding (a) ``what'' contextual relations are useful and ``how'' they should be incorporated into Markov network during inference, (b) jointly solving the segmentation and recognition problem using a multiple segmentation framework based on contextual information in conjunction with appearance matching, and (c) proposing a discriminative spatio-temporal patch based representation for videos which incorporates contextual information for video understanding.
Our work departs from traditional view of incorporating context into scene understanding where a fixed model for context is learned. We argue that context is scene dependent and propose a data-driven approach to predict the importance of relationships and construct a Markov network for image analysis based on statistical models of global and local image features. Since all contextual information is not equally important, we also address the related problem of predicting the feature weights associated with each edge of a Markov network for evaluation of context. We then address the problem of fixed segmentation while modeling context by using a multiple segmentation framework and formulating the problem as ``a jigsaw puzzle''. We formulate the labeling problem as segment selection from a pool of segments (jigsaws), assigning each selected segment a class label. Previous multiple segmentation approaches used local appearance matching to select segments in a greedy manner. In contrast, our approach is based on a cost function that combines contextual information with appearance matching. A relaxed form of the cost function is minimized using an efficient quadratic programming solver.
Lastly, we propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What define these spatiotemporal patches are their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos. We propose a cost function that incorporates co-occurrence statistics and temporal context along with appearance matching to select subset of these patches for label transfer. Furthermore, these patches can be used as a discriminative vocabulary for action classification
Massachusetts Domestic and Foreign Corporations Subject to an Excise: For the Use of Assessors (2004)
International audienc
Bbn Viser Trecvid 2011 Multimedia Event Detection System
We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual co-occurrence patterns in videos. For the low-level features, we rigorously analyzed several coding and pooling strategies, and also used state-of-the-art spatio-temporal pooling strategies to model relationships between different features. The system also uses high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Furthermore, the VISER system exploits multimodal information by analyzing available spoken and videotext content using BBN\u27s state-of-the-art Byblos automatic speech recognition (ASR) and video text recognition systems. These diverse streams of information are combined into a single, fixed dimensional vector for each video. We explored two different combination strategies: early fusion and late fusion. Early fusion was implemented through a fast kernel-based fusion framework and late fusion was performed using both Bayesian model combination (BAYCOM) as well as an innovative a weighted-average framework. Consistent with the previous MED\u2710 evaluation, low-level visual features exhibit strong performance and form the basis of our system. However, high-level information from speech, video-text, and object detection provide consistent and significant performance improvements. Overall, BBN\u27s VISER system exhibited the best performance among all the submitted systems with an average ANDC score of 0.46 across the 10 MED\u2711 test events when the threshold was optimized for the NDC score, and \u3c30% missed detection rate when the threshold was optimized to minimize missed detections at 6% false alarm rate. Description of Submitted Runs BBNVISER-LLFeat: Uses a combination of 6 high-performing, multimodal, and complementary low-level features, namely, appearance, color, motion based, MFCC, and audio energy. We combine these low-level features using an early fusion strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion1: Combines several sub-systems, each based on some combination of low-level features, ASR, video text OCR, and other high-level concepts using a late-fusion, Bayesian model combination strategy. The threshold is estimated to minimize the NDC score. BBNVISER-Fusion2: Combines same set of subsystems as BBNVISER-Fusion1. Instead of BAYCOM, it uses a novel weighted average fusion strategy. The fusion weights (for each sub-system) are estimated for each video automatically at runtime. BBNVISER-Fusion3: Combines all the sub-systems used in BBNVISER-Fusion3 with separate end-to-end systems from Columbia and UCF. In all, 18 sub-systems were combined using weighted average fusion. The threshold is estimated to minimize the probability of missed detection in the neighborhood of ALADDIN\u27s Year 1 false alarm rate ceiling