9 research outputs found
A Framework for Devanagari Script-based Captcha
Human Interactive Proofs (HIPs) are automatic reverse Turing tests designed
to distinguish between various groups of users. Completely Automatic Public
Turing test to tell Computers and Humans Apart (CAPTCHA) is a HIP system that
distinguish between humans and malicious computer programs. Many CAPTCHAs have
been proposed in the literature that text-graphical based, audio-based,
puzzle-based and mathematical questions-based. The design and implementation of
CAPTCHAs fall in the realm of Artificial Intelligence. We aim to utilize
CAPTCHAs as a tool to improve the security of Internet based applications. In
this paper we present a framework for a text-based CAPTCHA based on Devanagari
script which can exploit the difference in the reading proficiency between
humans and computer programs. Our selection of Devanagari script-based CAPTCHA
is based on the fact that it is used by a large number of Indian languages
including Hindi which is the third most spoken language. There is potential for
an exponential rise in the applications that are likely to be developed in that
script thereby making it easy to secure Indian language based applications.Comment: 10 pages, 8 Figures, CCSEA 2011 - First International Conference,
Chennai, July 15-17, 201
Adaptive Algorithms for Automated Processing of Document Images
Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections.
We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures.
Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features.
Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts
Recommended from our members
Arabic text recognition of printed manuscripts. Efficient recognition of off-line printed Arabic text using Hidden Markov Models, Bigram Statistical Language Model, and post-processing.
Arabic text recognition was not researched as thoroughly as other natural languages. The need for automatic Arabic text recognition is clear. In addition to the traditional applications like postal address reading, check verification in banks, and office automation, there is a large interest in searching scanned documents that are available on the internet and for searching handwritten manuscripts. Other possible applications are building digital libraries, recognizing text on digitized maps, recognizing vehicle license plates, using it as first phase in text readers for visually impaired people and understanding filled forms.
This research work aims to contribute to the current research in the field of optical character recognition (OCR) of printed Arabic text by developing novel techniques and schemes to advance the performance of the state of the art Arabic OCR systems.
Statistical and analytical analysis for Arabic Text was carried out to estimate the probabilities of occurrences of Arabic character for use with Hidden Markov models (HMM) and other techniques.
Since there is no publicly available dataset for printed Arabic text for recognition purposes it was decided to create one. In addition, a minimal Arabic script is proposed. The proposed script contains all basic shapes of Arabic letters. The script provides efficient representation for Arabic text in terms of effort and time.
Based on the success of using HMM for speech and text recognition, the use of HMM for the automatic recognition of Arabic text was investigated. The HMM technique adapts to noise and font variations and does not require word or character segmentation of Arabic line images.
In the feature extraction phase, experiments were conducted with a number of different features to investigate their suitability for HMM. Finally, a novel set of features, which resulted in high recognition rates for different fonts, was selected.
The developed techniques do not need word or character segmentation before the classification phase as segmentation is a byproduct of recognition. This seems to be the most advantageous feature of using HMM for Arabic text as segmentation tends to produce errors which are usually propagated to the classification phase.
Eight different Arabic fonts were used in the classification phase. The recognition rates were in the range from 98% to 99.9% depending on the used fonts. As far as we know, these are new results in their context. Moreover, the proposed technique could be used for other languages. A proof-of-concept experiment was conducted on English characters with a recognition rate of 98.9% using the same HMM setup. The same techniques where conducted on Bangla characters with a recognition rate above 95%.
Moreover, the recognition of printed Arabic text with multi-fonts was also conducted using the same technique. Fonts were categorized into different groups. New high recognition results were achieved.
To enhance the recognition rate further, a post-processing module was developed to correct the OCR output through character level post-processing and word level post-processing. The use of this module increased the accuracy of the recognition rate by more than 1%.King Fahd University of Petroleum and Minerals (KFUPM
Recommended from our members
Radical Immersions: Navigating between virtual/physical environments and information bubbles - DRHA 2019 Conference Proceedings
This publication consists of the peer-reviewed papers and posters presented at the DRHA 2019 Conference "Radical Immersions: navigating between virtual / physical environments and information bubbles".
The conference was held at Watermans Arts Centre, London (8-10 September 2019).
For further information: http://www.2019.drha.uk
Smoking and Second Hand Smoking in Adolescents with Chronic Kidney Disease: A Report from the Chronic Kidney Disease in Children (CKiD) Cohort Study
The goal of this study was to determine the prevalence of smoking and second hand smoking [SHS] in adolescents with CKD and their relationship to baseline parameters at enrollment in the CKiD, observational cohort study of 600 children (aged 1-16 yrs) with Schwartz estimated GFR of 30-90 ml/min/1.73m2. 239 adolescents had self-report survey data on smoking and SHS exposure: 21 [9%] subjects had “ever” smoked a cigarette. Among them, 4 were current and 17 were former smokers. Hypertension was more prevalent in those that had “ever” smoked a cigarette (42%) compared to non-smokers (9%), p\u3c0.01. Among 218 non-smokers, 130 (59%) were male, 142 (65%) were Caucasian; 60 (28%) reported SHS exposure compared to 158 (72%) with no exposure. Non-smoker adolescents with SHS exposure were compared to those without SHS exposure. There was no racial, age, or gender differences between both groups. Baseline creatinine, diastolic hypertension, C reactive protein, lipid profile, GFR and hemoglobin were not statistically different. Significantly higher protein to creatinine ratio (0.90 vs. 0.53, p\u3c0.01) was observed in those exposed to SHS compared to those not exposed. Exposed adolescents were heavier than non-exposed adolescents (85th percentile vs. 55th percentile for BMI, p\u3c 0.01). Uncontrolled casual systolic hypertension was twice as prevalent among those exposed to SHS (16%) compared to those not exposed to SHS (7%), though the difference was not statistically significant (p= 0.07). Adjusted multivariate regression analysis [OR (95% CI)] showed that increased protein to creatinine ratio [1.34 (1.03, 1.75)] and higher BMI [1.14 (1.02, 1.29)] were independently associated with exposure to SHS among non-smoker adolescents. These results reveal that among adolescents with CKD, cigarette use is low and SHS is highly prevalent. The association of smoking with hypertension and SHS with increased proteinuria suggests a possible role of these factors in CKD progression and cardiovascular outcomes