9 research outputs found

    Gurmukhi printing types: an historical analysis of British design, development, and distribution in the nineteenth and twentieth centuries

    Get PDF
    This thesis focuses on the role of British entities involved in the founding and development of printing in the Gurmukhi script, from the inception of printing in this writing system with movable type in 1800, until the beginnings of the digital era in the twentieth century. It traces the material production of Gurmukhi printing types under the changing technologies during this time frame and considers the impacts of various technological limitations on the appearance of the script when printed. Furthermore, it identifies the intent and objectives of those producing founts in a script foreign to them, and considers their approaches for overcoming various cultural, social, and economic obstacles, to determine how successful they were in realising their aims for printing in this writing system. Finally, it presents a comparative analysis of the founts developed during this period to highlight key typographic developments in the printing of Gurmukhi by the individuals and companies under consideration, and determines significant design decisions that influenced and informed subsequent developments. The research draws on largely unexplored primary resources housed in various archives across Britain, that provide a window into the practises and networks for the British type founders under consideration, shedding light on the establishment, organisation, and development of these actors’ operations, the modus operandi, and the networks that enabled and sustained it. This work aims to document a substantial gap in the history of Gurmukhi typographic development and printing, and serve as a contribution to the interrelated fields of typography, printing history, and culture alike

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Full text link
    India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

    Methodology and algorithms for Urdu language processing in a conversational agent

    Get PDF
    This thesis presents the research and development of a novel text based goal-orientated conversational agent (CA) for the Urdu language called UMAIR (Urdu Machine for Artificially Intelligent Recourse). A CA is a computer program that emulates a human in order to facilitate a conversation with the user. The aim is investigate the Urdu language and its lexical and grammatical features in order to, design a novel engine to handle the language unique features of Urdu. The weakness in current Conversational Agent (CA) engines is that they are not suited to be implemented in other languages which have grammar rules and structure totally different to English. From a historical perspective CA’s including the design of scripting engines, scripting methodologies, resources and implementation procedures have been implemented for the most part in English and other Western languages (i.e. German and Spanish). The development of an Urdu conversational agent has required the research and development of new CA framework which incorporates methodologies and components in order overcome the language unique features of Urdu such as free word order, inconsistent use of space, diacritical marks and spelling. The new CA framework was utilised to implement UMAIR. UMAIR is a customer service agent for National Database and Registration Authority (NADRA) designed to answer user queries related to ID card and Passport applications. UMAIR is able to answer user queries related to the domain through discourse with the user by leading the conversation using questions and offering appropriate advice with the intention of leading the discourse to a pre-determined goal. The research and development of UMAIR led to the creation of several novel CA components, namely a new rule based Urdu CA engine which combines pattern matching and sentence/string similarity techniques along with new algorithms to process user utterances. Furthermore, a CA evaluation framework has been researched and tested which addresses the gap in research to develop the evaluation of natural language systems in general. Empirical end user evaluation has validated the new algorithms and components implemented in UMAIR. The results show that UMAIR is effective as an Urdu CA, with the majority of conversations leading to the goal of the conversation. Moreover the results also revealed that the components of the framework work well to mitigate the challenges of free word order and inconsistent word segmentation

    Minimally-supervised Methods for Arabic Named Entity Recognition

    Get PDF
    Named Entity Recognition (NER) has attracted much attention over the past twenty years, as a main task of Information Extraction. The current dominant techniques for addressing NER are supervised methods that can achieve high performance, but require new manually annotated data for every new domain and/or genre change. Our work focuses on approaches that make it possible to tackle new domains with minimal human intervention to identify Named Entities (NEs) in Arabic text. Specifically, we investigate two minimally-supervised methods: semi-supervised learning and distant learning. Our semi-supervised algorithm for identifying NEs does not require annotated training data or gazetteers. It only requires, for each NE type, a seed list of a few instances to initiate the learning process. Novel aspects of our algorithm include (i) a new way to produce and generalise the extraction patterns (ii) a new filtering criterion to remove noisy patterns (iii) a comparison of two ranking measures for determining the most reliable candidate NEs. Next, we present our methodology to exploit Wikipedia structure to automatically develop an Arabic NE annotated corpus. A novel mechanism is introduced, based on the high coverage of Wikipedia, in order to address two challenges particular to tagging NEs in Arabic text: rich morphology and the absence of capitalisation. Neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised algorithms tend to have high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. Therefore, we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We used a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure, recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best minimally-supervised classifier

    Digital writing technologies in higher education : theory, research, and practice

    Get PDF
    This open access book serves as a comprehensive guide to digital writing technology, featuring contributions from over 20 renowned researchers from various disciplines around the world. The book is designed to provide a state-of-the-art synthesis of the developments in digital writing in higher education, making it an essential resource for anyone interested in this rapidly evolving field. In the first part of the book, the authors offer an overview of the impact that digitalization has had on writing, covering more than 25 key technological innovations and their implications for writing practices and pedagogical uses. Drawing on these chapters, the second part of the book explores the theoretical underpinnings of digital writing technology such as writing and learning, writing quality, formulation support, writing and thinking, and writing processes. The authors provide insightful analysis on the impact of these developments and offer valuable insights into the future of writing. Overall, this book provides a cohesive and consistent theoretical view of the new realities of digital writing, complementing existing literature on the digitalization of writing. It is an essential resource for scholars, educators, and practitioners interested in the intersection of technology and writing

    Towards a Generic Framework for the Development of Unicode Based Digital Sindhi Dictionaries

    No full text
    Dictionaries are essence of any language providing vital linguistic recourse for the language learners, researchers and scholars. This paper focuses on the methodology and techniques used in developing software architecture for a UBSESD (Unicode Based Sindhi to English and English to Sindhi Dictionary). The proposed system provides an accurate solution for construction and representation of Unicode based Sindhi characters in a dictionary implementing Hash Structure algorithm and a custom java Object as its internal data structure saved in a file. The System provides facilities for Insertion, Deletion and Editing of new records of Sindhi. Through this framework any type of Sindhi to English and English to Sindhi Dictionary (belonging to different domains of knowledge, e.g. engineering, medicine, computer, biology etc.) could be developed easily with accurate representation of Unicode Characters in font independent manner

    Unsupervised learning for text-to-speech synthesis

    Get PDF
    This thesis introduces a general method for incorporating the distributional analysis of textual and linguistic objects into text-to-speech (TTS) conversion systems. Conventional TTS conversion uses intermediate layers of representation to bridge the gap between text and speech. Collecting the annotated data needed to produce these intermediate layers is a far from trivial task, possibly prohibitively so for languages in which no such resources are in existence. Distributional analysis, in contrast, proceeds in an unsupervised manner, and so enables the creation of systems using textual data that are not annotated. The method therefore aids the building of systems for languages in which conventional linguistic resources are scarce, but is not restricted to these languages. The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects. This space is then partitioned during the training of acoustic models for synthesis, so that the models generalise over objects' surface forms in a way that is acoustically relevant. The method is applied to three levels of textual analysis: to the characterisation of sub-syllabic units, word units and utterances. Entire systems for three languages (English, Finnish and Romanian) are built with no reliance on manually labelled data or language-specific expertise. Results of a subjective evaluation are presented
    corecore