Search CORE

1,526 research outputs found

Open Data Platform for Knowledge Access in Plant Health Domain : VESPA Mining

Author: Andro Mathieu
Corbière Roselyne
Phan Tien T.
Turenne Nicolas
Publication venue
Publication date: 01/01/2015
Field of study

Important data are locked in ancient literature. It would be uneconomic to produce these data again and today or to extract them without the help of text mining technologies. Vespa is a text mining project whose aim is to extract data on pest and crops interactions, to model and predict attacks on crops, and to reduce the use of pesticides. A few attempts proposed an agricultural information access. Another originality of our work is to parse documents with a dependency of the document architecture

arXiv.org e-Print Archive

CiteSeerX

ProdInra

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL-Rennes 1

HAL - UPEC / UPEM

A study of feature extraction for Arabic calligraphy characters recognition

Author: Chaker Ilham
Errebiai Chaimae
Zarghili Arsalane
Zoizou Abdelhay
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/02/2024
Field of study

Optical character recognition (OCR) is one of the widely used pattern recognition systems. However, the research on ancient Arabic writing recognition has suffered from a lack of interest for decades, despite the availability of thousands of historical documents. One of the reasons for this lack of interest is the absence of a standard dataset, which is fundamental for building and evaluating an OCR system. In 2022, we published a database of ancient Arabic words as the only public dataset of characters written in Al-Mojawhar Moroccan calligraphy. Therefore, such a database needs to be studied and evaluated. In this paper, we explored the proposed database and investigated the recognition of Al-Mojawhar Arabic characters. We studied feature extraction by using the most popular descriptors used in Arabic OCR. The studied descriptors were associated with different machine learning classifiers to build recognition models and verify their performance. In order to compare the learned and handcrafted features on the proposed dataset, we proposed a deep convolutional neural network for character recognition. Regarding the complexity of the character shapes, the results obtained were very promising, especially by using the convolutional neural network model, which gave the highest accuracy score

Institute of Advanced Engineering and Science

Unravelling the voice of Willem Frederik Hermans: an oral history indexing case study

Author: Huijbregts Marijn
Jong Franciska de
Ordelman Roeland
Publication venue: University of Twente, Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2009
Field of study

University of Twente Research Information

Recognition of compound characters in Kannada language

Author: Rangarajan Lalitha
Tumkur Narasimhaiah Sridevi
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/01/2014
Field of study

Recognition of degraded printed compound Kannada characters is a challenging research problem. It has been verified experimentally that noise removal is an essential preprocessing step. Proposed are two methods for degraded Kannada character recognition problem. Method 1 is conventionally used histogram of oriented gradients (HOG) feature extraction for character recognition problem. Extracted features are transformed and reduced using principal component analysis (PCA) and classification performed. Various classifiers are experimented with. Simple compound character classification is satisfactory (more than 98% accuracy) with this method. However, the method does not perform well on other two compound types. Method 2 is deep convolutional neural networks (CNN) model for classification. This outperforms HOG features and classification. The highest classification accuracy is found as 98.8% for simple compound character classification. The performance of deep CNN is far better for other two compound types. Deep CNN turns out to better for pooled character classes

University of Mysore - Digital Repository of Research, Innovation and Scholarship (ePrints@UoM)

ZENODO

Institute of Advanced Engineering and Science

Binarization of Ancient Document Images based on Multipeak Histogram Assumption

Author: Arnia Fitri
Munadi Khairul
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/09/2017
Field of study

In document binarization, text is segmented from the background. This is an important step, since the binarization outcome determines the success rate of the optical character recognition (OCR). In ancient documents, that are commonly noisy, binarization becomes more difficult. The noise can reduce binarization performance, and thus the OCR rate. This paper proposes a new binarization approach based on an assumption that the histograms of noisy documents consist of multipeaks. The proposed method comprises three steps: histogram calculation, histogram smoothing, and the use of the histogram to track the first valley and determine the binarization threshold. In our simulations we used a set of Jawi ancient document images with natural noises. This set is composed of 24 document tiles containing two noise types: show-through and uneven background. To measure performance, we designed and implemented a point compilation scheme. On average, the proposed method performed better than the Otsu method, with the total point score obtained by the former being 7.5 and that of the latter 4.5. Our results show that as long as the histogram fulfills the multipeak assumption, the proposed method can perform satisfactorily.

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System