Search CORE

260 research outputs found

Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

Author: Majid Nishatul
Publication venue: 'IUScholarWorks'
Publication date: 01/08/2020
Field of study

This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead

Character Recognition

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

FEATURE EXTRACTION AND CLASSIFICATION THROUGH ENTROPY MEASURES

Author: M. Aktaruzzaman
Publication venue: Università degli Studi di Milano
Publication date: 13/03/2015
Field of study

Entropy is a universal concept that represents the uncertainty of a series of random events. The notion \u201centropy" is differently understood in different disciplines. In physics, it represents the thermodynamical state variable; in statistics it measures the degree of disorder. On the other hand, in computer science, it is used as a powerful tool for measuring the regularity (or complexity) in signals or time series. In this work, we have studied entropy based features in the context of signal processing. The purpose of feature extraction is to select the relevant features from an entity. The type of features depends on the signal characteristics and classification purpose. Many real world signals are nonlinear and nonstationary and they contain information that cannot be described by time and frequency domain parameters, instead they might be described well by entropy. However, in practice, estimation of entropy suffers from some limitations and is highly dependent on series length. To reduce this dependence, we have proposed parametric estimation of various entropy indices and have derived analytical expressions (when possible) as well. Then we have studied the feasibility of parametric estimations of entropy measures on both synthetic and real signals. The entropy based features have been finally employed for classification problems related to clinical applications, activity recognition, and handwritten character recognition. Thus, from a methodological point of view our study deals with feature extraction, machine learning, and classification methods. The different versions of entropy measures are found in the literature for signals analysis. Among them, approximate entropy (ApEn), sample entropy (SampEn) followed by corrected conditional entropy (CcEn) are mostly used for physiological signals analysis. Recently, entropy features are used also for image segmentation. A related measure of entropy is Lempel-Ziv complexity (LZC), which measures the complexity of a time-series, signal, or sequences. The estimation of LZC also relies on the series length. In particular, in this study, analytical expressions have been derived for ApEn, SampEn, and CcEn of an auto-regressive (AR) models. It should be mentioned that AR models have been employed for maximum entropy spectral estimation since many years. The feasibility of parametric estimates of these entropy measures have been studied on both synthetic series and real data. In feasibility study, the agreement between numeral estimates of entropy and estimates obtained through a certain number of realizations of the AR model using Montecarlo simulations has been observed. This agreement or disagreement provides information about nonlinearity, nonstationarity, or nonGaussinaity presents in the series. In some classification problems, the probability of agreement or disagreement have been proved as one of the most relevant features. VII After feasibility study of the parametric entropy estimates, the entropy and related measures have been applied in heart rate and arterial blood pressure variability analysis. The use of entropy and related features have been proved more relevant in developing sleep classification, handwritten character recognition, and physical activity recognition systems. The novel methods for feature extraction researched in this thesis give a good classification or recognition accuracy, in many cases superior to the features reported in the literature of concerned application domains, even with less computational costs

Sinhala and Tamil : a case of contact-induced restructuring

Author: Thampoe Harold Dharmasenan
Publication venue: Newcastle University
Publication date: 01/01/2017
Field of study

PhD ThesisThe dissertation presents a comparative synchronic study of the morphosyntactic features of modern spoken Sinhala and Tamil, the two main languages of Sri Lanka. The main motivation of the research is that Sinhala and Tamil, two languages of diverse origins—the New Indo-Aryan (NIA) and Dravidian families respectively—share a wide spectrum of morphosyntactic features. Sinhala has long been isolated from the other NIA languages and co-existed with Tamil in Sri Lanka ever since both reached Sri Lanka from India. This coexistence, it is believed, led to what is known as the contact-induced restructuring that Sinhala morphosyntax has undergone on the model of Tamil, while retaining its NIA lexicon. Moreover, as languages of South Asia, the two languages share the areal features of this region. The research seeks to address the following questions: (i) What features do the two languages share and what features do they not share?; (ii) Are the features that they share areal features of the region or those diffused into one another owing to contact?; (iii) If the features that they share are due to contact, has diffusion taken place unidirectionally or bidirectionally?; and (iv) Does contact have any role to play with respect to features that they do not share? The claim that this research intends to substantiate is that Sinhala has undergone morphosyntactic restructuring on the model of Tamil. The research, therefore, attempts to answer another question: (v) Can the morphosyntactic restructuring that Sinhala has undergone be explained in syntactic terms? The morphosyntactic features of the two languages are analyzed at macro- and micro-levels. At the macro-level, a wide range of morphosyntactic features of Tamil and Sinhala, and those of seven other languages of the region are compared with a view to determining the origins of these features and showing the large scale morphosyntactic convergence between Sinhala and Tamil and the divergence between Sinhala and other NIA languages. At the micro-level the dissertation analyzes in detail two morphosyntactic phenomena, namely null arguments and focus constructions. It examines whether subject/verb agreement, which is different across the two languages, plays a role in the licensing of null arguments in each language. It also examines the nature of the changes Sinhala morphosyntax has undergone because of the two kinds of Tamil focus constructions that Sinhala has replicated. It is hoped, that this dissertation will make a significant contribution to the knowledge and understanding of the morphosyntax of the two languages, the effects of language contact on morphosyntax, and more generally, the nature of linguistic variation.Scholarship Programme of the Higher Education for the Twenty First Century (HETC) Project, Ministry of Higher Education, Sri Lanka

Newcastle University eTheses

North East Indian linguistics 6

Author: North East Indian Linguistics Society
Publication venue: Asia-Pacific Linguistics, School of Culture, History and Language, College of Asia and the Pacific, The Australian National University
Publication date: 10/12/2015
Field of study

The papers for this volume were initially presented at the sixth and seventh meetings of the North East Indian Linguistics Society, held in Guwahati, India, in 2011 and 2012. As with previous conferences, these meetings were held at the Don Bosco Institute in Guwahati, Assam, and hosted in collaboration with Gauhati University. The present collection of papers are testament to the ongoing interest in North East India and continued success and growth in the community of North East Indian linguists. As in previous volumes, all the papers here were reviewed by leading international specialists in the relevant subfields. This volume, in particular, highlights the recent research of many scholars from the region. Out of eleven contributions, eight are from North East Indian scholars themselves. This book therefore brightly shines light on the work being done by North East Indian linguists on the languages of their own region. The remaining contributions are authored by international scholars from Australia, Singapore, Germany/USA, and Nepal

The Australian National University

Gurmukhi printing types: an historical analysis of British design, development, and distribution in the nineteenth and twentieth centuries

Author: Afshar Sahar
Publication venue
Publication date: 31/01/2023
Field of study

This thesis focuses on the role of British entities involved in the founding and development of printing in the Gurmukhi script, from the inception of printing in this writing system with movable type in 1800, until the beginnings of the digital era in the twentieth century. It traces the material production of Gurmukhi printing types under the changing technologies during this time frame and considers the impacts of various technological limitations on the appearance of the script when printed. Furthermore, it identifies the intent and objectives of those producing founts in a script foreign to them, and considers their approaches for overcoming various cultural, social, and economic obstacles, to determine how successful they were in realising their aims for printing in this writing system. Finally, it presents a comparative analysis of the founts developed during this period to highlight key typographic developments in the printing of Gurmukhi by the individuals and companies under consideration, and determines significant design decisions that influenced and informed subsequent developments. The research draws on largely unexplored primary resources housed in various archives across Britain, that provide a window into the practises and networks for the British type founders under consideration, shedding light on the establishment, organisation, and development of these actors’ operations, the modus operandi, and the networks that enabled and sustained it. This work aims to document a substantial gap in the history of Gurmukhi typographic development and printing, and serve as a contribution to the interrelated fields of typography, printing history, and culture alike

A Microparametric Approach to the Head-Initial/Head-Final Parameter

Author: Cinque Guglielmo
Publication venue
Publication date: 01/01/2017
Field of study

The fact that even the most rigid head-final and head-initial languages show inconsistencies and, more crucially, that the very languages which come closest to the ideal types (the “rigid” SOV and the VOS languages) are apparently a minority among the languages of the world, makes it plausible to explore the possibility of a microparametric approach for what is often taken to be one of the prototypical examples of macroparameter, the ‘head-initial/head-final parameter’. From this perspective, the features responsible for the different types of movement of the constituents of the unique structure of Merge from which all canonical orders derive are determined by lexical specifications of different generality: from those present on a single lexical item, to those present on lexical items belonging to a specific subclass of a certain category, or to every subclass of a certain category, or to every subclass of two or more, or all, categories, (always) with certain exceptions

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

Author: AI4Bharat
AK Raghavan
Chitale Pranjal A.
Dabre Raj
Doddapaneni Sumanth
Gala Jay
Gumma Varun
Khapra Mitesh M.
Kumar Aswanth
Kumar Pratyush
Kunchukuttan Anoop
Nawale Janki
Puduppully Ratish
Raghavan Vivek
Sujatha Anupama
Publication venue
Publication date: 17/06/2023
Field of study

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

arXiv.org e-Print Archive

International Journal of Interpreter Education, Volume 14, Issue 1

Author
Publication venue: Clemson University Libraries
Publication date: 01/07/2022
Field of study