Search CORE

244 research outputs found

Modeling and training options for handwritten Arabic text recognition

Author: Ahmad Irfan
Publication venue
Publication date: 01/01/2016
Field of study

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Recommended from our members

A high level approach to Arabic sentence recognition

Author: Krayem AG
Publication venue
Publication date: 01/09/2013
Field of study

The aim of this work is to develop sentence recognition system inspired by the human reading process. Cognitive studies observed that the human tended to read a word as a whole at a time. He considers the global word shapes and uses contextual knowledge to infer and discriminate a word among other possible words. The sentence recognition system is a fully integrated system; a word level recogniser (baseline system) integrated with linguistic knowledge post-processing module. The presented baseline system is holistic word-based recognition approach characterised as probabilistic ranked task. The output of the system is multiple recognition hypotheses (N-best word lattice). The basic unit is the word rather than the character; it does not rely on any segmentation or require baseline detection. The considered linguistic knowledge to re-rank the output of the existing baseline system is the standard n-gram Statistical Language Models (SLMs). The candidates are re-ranked through exploiting phrase perplexity score. The system is an OCR system that depends on HMM models utilizing the HTK Toolkit. The baseline system supported by global transformation features extracted from binary word images. The adopted features' extraction technique is the block-based Discrete Cosine Transform (DCT) applied to the whole word image. Feature vectors extracted using block-based DCT with non-overlapping sub-block of size 8x8 pixels. The applied HMMs to the task are mono-model discrete one-dimensional HMMs (Bakis Model). A balanced actual scanned and synthetic database of word-image has been constructed to ensure an even distribution of word samples. The Arabic words are typewritten in five fonts having a size 14 points in a plain style. The statistical language models and lexicon words are extracted from The Holy Qur‟an. The systems are applied on word images with no overlap between the training and testing datasets. The actual scanned database is used to evaluate the word recogniser. The synthetic database is a large amount of data acquired for a reliable training of sentence recognition systems. This word recogniser evaluated in mono-font and multi-font contexts. The two types of word recogniser have been used to achieve a final recognition accuracy of99.30% and 73.47% in mono-font and multi-font, respectively. The achieved average accuracy by the sentence recogniser is 67.24% improved to 78.35% on average when using 5-gram post-processing. The complexity and accuracy of the post-processing module are evaluated and found that 4-gram is more suitable than 5-gram; it is much faster at an average improvement of 76.89%

Nottingham Trent Institutional Repository (IRep)

Using multiclass classification algorithms to improve text categorization tool:NLoN

Author: Xu J. (Jianwen)
Publication venue: University of Oulu
Publication date: 12/08/2021
Field of study

Abstract. Natural language processing (NLP) and machine learning techniques have been widely utilized in the mining software repositories (MSR) field in recent years. Separating natural language from source code is a pre-processing step that is needed in both NLP and the MSR domain for better data quality. This paper presents the design and implementation of a multi-class classification approach that is based on the existing open-source R package Natural Language or Not (NLoN). This article also reviews the existing literature on MSR and NLP. The review classified the information sources and approaches of MSR in detail, and also focused on the text representation and classification tasks of NLP. In addition, the design and implementation methods of the original paper are briefly introduced. Regarding the research methodology, since the research goal is technology-oriented, i.e., to improve the design and implementation of existing technologies, this article adopts the design science research methodology and also describes how the methodology was adopted. This research implements an open-source Python library, namely NLoN-PY. This is an open-source library hosted on GitHub, and users can also directly use the tools published to the PyPI library. Since NLoN has achieved comparable performance on two-class classification tasks with the Lasso regression model, this study evaluated other multi-class classification algorithms, i.e., Naive Bayes, k-Nearest Neighbours, and Support Vector Machine. Using 10-fold cross-validation, the expanded classifier achieved AUC performance of 0.901 for the 5-class classification task and the AUC performance of 0.92 for the 2-class task. Although the design of this study did not show a significant performance improvement compared to the original design, the impact of unbalanced data distribution on performance was detected and the category of the classification problem was also refined in the process. These findings on the multi-class classification design can provide a research foundation or direction for future research

University of Oulu Repository - Jultika

Markov models for offline handwriting recognition: a survey

Author: A. Brakensiek
A. El-Yacoubi
A. Kundu
A. Vinciarelli
A. Vinciarelli
A. Vinciarelli
A. Viterbi
A.H.R. Ko
A.P. Dempster
A.W. Senior
E. Bocchieri
G.A. Fink
G.A. Fink
Gernot A. Fink
H. Bunke
H. Fujisawa
H. Fujisawa
H. Xue
J. Cai
J. Coetzer
J.A. Pittman
L. Baum
L. Baum
L. Likforman-Sulem
L.M. Lorigo
M. Wienecke
N. Arica
N. Arica
N. Arica
O.D. Trier
P. Natarajan
P. Natarajan
P.D. Gader
R. Davis
R. Nopsuwanchai
R. Plamondon
R.M. Bozinovic
R.O. Duda
S. Günter
S. Madhvanath
S. Young
S.F. Chen
T. Steinherz
Thomas Plötz
U.V. Marti
U.V. Marti
W. Cho
X.D. Huang
X.D. Huang
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Machine learning for ancient languages: a survey

Author: Androutsopoulos Ion
Assael Yannis
Bodel John
Dyer Chris
Freitas Nando de
Pavlopoulos John
Prag Jonathan
Senior Andrew
Sommerschield Thea
Stefanak Vanessa
Publication venue: MIT Press
Publication date: 10/08/2023
Field of study

Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning

Oxford University Research Archive

Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks

Author: Giles C Lee
Kandimalla Bharath
Rohatgi Shaurya
Wu Jian
Publication venue
Publication date: 27/07/2020
Field of study

Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category information can be used for building faceted search for digital library search engines. This can significantly assist users in narrowing down their search space of relevant documents. Unfortunately, many academic papers do not have such information as part of their metadata. Existing methods for solving this task usually focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using 9 million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro-F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50-0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.Comment: submitted to "Frontiers Mining Scientific Papers Volume II: Knowledge Discovery and Data Exploitation

arXiv.org e-Print Archive

Old Dominion University

Development of Part-of-Arabic Word Corpus for Handwriting Text Recognition

Author
Publication venue
Publication date
Field of study

KFUPM ePrints