1,526 research outputs found

    Open Data Platform for Knowledge Access in Plant Health Domain : VESPA Mining

    Get PDF
    Important data are locked in ancient literature. It would be uneconomic to produce these data again and today or to extract them without the help of text mining technologies. Vespa is a text mining project whose aim is to extract data on pest and crops interactions, to model and predict attacks on crops, and to reduce the use of pesticides. A few attempts proposed an agricultural information access. Another originality of our work is to parse documents with a dependency of the document architecture

    A study of feature extraction for Arabic calligraphy characters recognition

    Get PDF
    Optical character recognition (OCR) is one of the widely used pattern recognition systems. However, the research on ancient Arabic writing recognition has suffered from a lack of interest for decades, despite the availability of thousands of historical documents. One of the reasons for this lack of interest is the absence of a standard dataset, which is fundamental for building and evaluating an OCR system. In 2022, we published a database of ancient Arabic words as the only public dataset of characters written in Al-Mojawhar Moroccan calligraphy. Therefore, such a database needs to be studied and evaluated. In this paper, we explored the proposed database and investigated the recognition of Al-Mojawhar Arabic characters. We studied feature extraction by using the most popular descriptors used in Arabic OCR. The studied descriptors were associated with different machine learning classifiers to build recognition models and verify their performance. In order to compare the learned and handcrafted features on the proposed dataset, we proposed a deep convolutional neural network for character recognition. Regarding the complexity of the character shapes, the results obtained were very promising, especially by using the convolutional neural network model, which gave the highest accuracy score

    Unravelling the voice of Willem Frederik Hermans: an oral history indexing case study

    Get PDF

    Recognition of compound characters in Kannada language

    Get PDF
    Recognition of degraded printed compound Kannada characters is a challenging research problem. It has been verified experimentally that noise removal is an essential preprocessing step. Proposed are two methods for degraded Kannada character recognition problem. Method 1 is conventionally used histogram of oriented gradients (HOG) feature extraction for character recognition problem. Extracted features are transformed and reduced using principal component analysis (PCA) and classification performed. Various classifiers are experimented with. Simple compound character classification is satisfactory (more than 98% accuracy) with this method. However, the method does not perform well on other two compound types. Method 2 is deep convolutional neural networks (CNN) model for classification. This outperforms HOG features and classification. The highest classification accuracy is found as 98.8% for simple compound character classification. The performance of deep CNN is far better for other two compound types. Deep CNN turns out to better for pooled character classes

    Binarization of Ancient Document Images based on Multipeak Histogram Assumption

    Get PDF
    In document binarization, text is segmented from the background. This is an important step, since the binarization outcome determines the success rate of the optical character recognition (OCR). In ancient documents, that are commonly noisy, binarization becomes more difficult. The noise can reduce binarization performance, and thus the OCR rate. This paper proposes a new binarization approach based on an assumption that the histograms of noisy documents consist of multipeaks. The proposed method comprises three steps: histogram calculation, histogram smoothing, and the use of the histogram to track the first valley and determine the binarization threshold. In our simulations we used a set of Jawi ancient document images with natural noises. This set is composed of 24 document tiles containing two noise types: show-through and uneven background. To measure performance, we designed and implemented a point compilation scheme. On average, the proposed method performed better than the Otsu method, with the total point score obtained by the former being 7.5 and that of the latter 4.5. Our results show that as long as the histogram fulfills the multipeak assumption, the proposed method can perform satisfactorily.
    corecore