5,394 research outputs found

    Optimizing digital archiving: An artificial intelligence approach for OCR error correction

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThis thesis research scopes the knowledge gap for effective ways to address OCR errors and the importance to have training datasets adequated size and quality, to promote digital documents OCR recognition efficiency. The main goal is to examine the effects regarding the following dimensions of sourcing data: input size vs performance vs time efficiency, and to propose a new design that includes a machine translation model, to automate the errors correction caused by OCR scan. The study implemented various LSTM, with different thresholds, to recover errors generated by OCR systems. However, the results did not overcomed the performance of existing OCR systems, due to dataset size limitations, a step further was achieved. A relationship between performance and input size was established, providing meaningful insights for future digital archiving systems optimisation. This dissertation creates a new approach, to deal with OCR problems and implementation considerations, that can be further followed, to optimise digital archive systems efficiency and results

    Embedding Based Link Prediction for Knowledge Graph Completion

    Get PDF
    Knowledge Graphs (KGs) are the most widely used representation of structured information about a particular domain consisting of billions of facts in the form of entities (nodes) and relations (edges) between them. Besides, the KGs also encapsulate the semantic type information of the entities. The last two decades have witnessed a constant growth of KGs in various domains such as government, scholarly data, biomedical domains, etc. KGs have been used in Machine Learning based applications such as entity linking, question answering, recommender systems, etc. Open KGs are mostly heuristically created, automatically generated from heterogeneous resources such as text, images, etc., or are human-curated. However, these KGs are often incomplete, i.e., there are missing links between the entities and missing links between the entities and their corresponding entity types. This thesis focuses on addressing these two challenges of link prediction for Knowledge Graph Completion (KGC): \textbf{(i)} General Link Prediction in KGs that include head and tail prediction, triple classification, and \textbf{(ii)} Entity Type Prediction. Most of the graph mining algorithms are proven to be of high complexity, deterring their usage in KG-based applications. In recent years, KG embeddings have been trained to represent the entities and relations in the KG in a low-dimensional vector space preserving the graph structure. In most published works such as the translational models, convolutional models, semantic matching, etc., the triple information is used to generate the latent representation of the entities and relations. In this dissertation, it is argued that contextual information about the entities obtained from the random walks, and textual entity descriptions, are the keys to improving the latent representation of the entities for KGC. The experimental results show that the knowledge obtained from the context of the entities supports the hypothesis. Several methods have been proposed for KGC and their effectiveness is shown empirically in this thesis. Firstly, a novel multi-hop attentive KG embedding model MADLINK is proposed for Link Prediction. It considers the contextual information of the entities by using random walks as well as textual entity descriptions of the entities. Secondly, a novel architecture exploiting the information contained in a pre-trained contextual Neural Language Model (NLM) is proposed for Triple Classification. Thirdly, the limitations of the current state-of-the-art (SoTA) entity type prediction models have been analysed and a novel entity typing model CAT2Type is proposed that exploits the Wikipedia Categories which is one of the most under-treated features of the KGs. This model can also be used to predict missing types of unseen entities i.e., the newly added entities in the KG. Finally, another novel architecture GRAND is proposed to predict the missing entity types in KGs using multi-label, multi-class, and hierarchical classification by leveraging different strategic graph walks in the KGs. The extensive experiments and ablation studies show that all the proposed models outperform the current SoTA models and set new baselines for KGC. The proposed models establish that the NLMs and the contextual information of the entities in the KGs together with the different neural network architectures benefit KGC. The promising results and observations open up interesting scopes for future research involving exploiting the proposed models in domain-specific KGs such as scholarly data, biomedical data, etc. Furthermore, the link prediction model can be exploited as a base model for the entity alignment task as it considers the neighbourhood information of the entities

    Scallop: A Language for Neurosymbolic Programming

    Full text link
    We present Scallop, a language which combines the benefits of deep learning and logical reasoning. Scallop enables users to write a wide range of neurosymbolic applications and train them in a data- and compute-efficient manner. It achieves these goals through three key features: 1) a flexible symbolic representation that is based on the relational data model; 2) a declarative logic programming language that is based on Datalog and supports recursion, aggregation, and negation; and 3) a framework for automatic and efficient differentiable reasoning that is based on the theory of provenance semirings. We evaluate Scallop on a suite of eight neurosymbolic applications from the literature. Our evaluation demonstrates that Scallop is capable of expressing algorithmic reasoning in diverse and challenging AI tasks, provides a succinct interface for machine learning programmers to integrate logical domain knowledge, and yields solutions that are comparable or superior to state-of-the-art models in terms of accuracy. Furthermore, Scallop's solutions outperform these models in aspects such as runtime and data efficiency, interpretability, and generalizability

    Automated retrieval and analysis of published biomedical literature through natural language processing for clinical applications

    Get PDF
    The size of the existing academic literature corpus and the incredible rate of new publications offers a great need and opportunity to harness computational approaches to data and knowledge extraction across all research fields. Elements of this challenge can be met by developments in automation for retrieval of electronic documents, document classification and knowledge extraction. In this thesis, I detail studies of these processes in three related chapters. Although the focus of each chapter is distinct, they contribute to my aim of developing a generalisable pipeline for clinical applications in Natural Language Processing in the academic literature. In chapter one, I describe the development of “Cadmus”, An open-source system developed in Python to generate corpora of biomedical text from the published literature. Cadmus comprises three main steps: Search query & meta-data collection, document retrieval, and parsing of the retrieved text. I present an example of full-text retrieval for a corpus of over two hundred thousand articles using a gene-based search query with quality control metrics for this retrieval process and a high-level illustration of the utility of full text over metadata for each article. For a corpus of 204,043 articles, the retrieval rate was 85.2% with institutional subscription access and 54.4% without. Chapter Two details developing a custom-built Naïve Bayes supervised machine learning document classifier. This binary classifier is based on calculating the relative enrichment of biomedical terms between two classes of documents in a training set. The classifier is trained and tested upon a manually classified set of over 8000 abstract and full-text articles to identify articles containing human phenotype descriptions. 10-fold cross-validation of the model showed a performance of recall of 85%, specificity of 99%, Precision of 0.76%, f1 score of 0.82 and accuracy of 90%. Chapter three illustrates the clinical applications of automated retrieval, processing, and classification by considering the published literature on Paediatric COVID-19. Case reports and similar articles were classified into “severe” and “non-severe” classes, and term enrichment was evaluated to find biomarkers associated with, or predictive of, severe paediatric COVID-19. Time series analysis was employed to illustrate emerging disease entities like the Multisystem Inflammatory Syndrome in Children (MIS-C) and consider unrecognised trends through literature-based discovery

    Syntactic Generation of Research Thesis Sketches Across Disciplines Using Formal Grammars

    Get PDF
    A part of the prerequisites for granting a degree in higher education institutions, students at postgraduate levels normally carry out research, which they do report in the form of theses or dissertations. Study has shown that students tend to go through difficulties in writing research thesis across all disciplines because they do not fully comprehend what constitutes a research thesis. This project proposes the syntactic generation of research thesis sketches across disciplines using formal grammars. Sketching is a synthesis technique which enables users to deliver high-level intuitions into a synthesis snag while leaving low-level details to synthesis tools. This work extends sketching to document generation for research thesis documents. Context-free grammar rules were designed and implemented for this task. A link to 10,000 generated thesis sketches was presented

    Sport team leadership coaching and captaincy in elite level rugby union football

    Get PDF
    A wide range of literature exists on coaching but it is concerned predominantly with the high school and college levels, is based upon athlete or coach perceptions, or is confined to observations of training or competition. As leaders of sports teams, coaches and captains have rarely been studied at the highest level of national or international sports competition. In the present study, the team leadership roles of the coach and captain in elite rugby union football in New Zealand were examined using participant observation and other qualitative research methods. Elite was defined as New Zealand rugby’s highest internal level of competition: (a) the national provincial championships and (b) international test matches of the national team, the All Blacks. The study explored the roles of the elite rugby coach and captain in vivo in a wide variety of team situations. It was felt that this could provide first-hand information on particular team leader behaviours, on what a coach and captain actually do, and how they are perceived by those around them. The main objective, however, was to use grounded theory techniques to create a model of elite rugby team leadership that might guide developmental programmes on such leadership. The research phases undertaken were those of participant observation with a Provincial Team for five matches, a survey of provincial teams’ coaches and captains on their leadership associated with actual matches, three years’ participant observation with the All Blacks (including observation in eight test match weeks), multiple perspectives on elite team leadership from past rugby test players in New Zealand and overseas, and interviews with national team leaders in sports other than rugby. Participant observation, interviews, questionnaires and document analysis generated data from the research settings. These data were considered in terms of symbolic interactionism and subjected to a grounded theory process. This led to a set of elite rugby team leadership categories and properties which, in turn, generated a comprehensive set of theoretical propositions. The propositions became the basis for a model of elite rugby team leadership. This model was then considered as the basis for a programme to develop elite rugby team leaders. Significant aspects of the research findings which have not featured in previous research literature included the coach’s vision, team culture, centrality of the game plan, match week build-up, the importance of the captain’s playing example, the coach's ability to utilise teaching precepts, the coach’s personal qualities, and the need to develop and evaluate team leaders. The model, and the developmental programme principles emanating from it, are seen as relevant for developing elite level leaders in team sports other than rugby

    SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation

    Full text link
    Document layout analysis is a known problem to the documents research community and has been vastly explored yielding a multitude of solutions ranging from text mining, and recognition to graph-based representation, visual feature extraction, etc. However, most of the existing works have ignored the crucial fact regarding the scarcity of labeled data. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain and thus making data annotation a tedious task. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches which use text mining and textual labels, we use a complete vision-based approach in pre-training without any ground-truth label or its derivative. Instead, we generate pseudo-layouts from the document images to pre-train an image encoder to learn the document object representation and localization in a self-supervised framework before fine-tuning it with an object detection model. We show that our pipeline sets a new benchmark in this context and performs at par with the existing methods and the supervised counterparts, if not outperforms. The code is made publicly available at: https://github.com/MaitySubhajit/SelfDocSegComment: Accepted at The 17th International Conference on Document Analysis and Recognition (ICDAR 2023

    Novel Heuristic Recurrent Neural Network Framework to Handle Automatic Telugu Text Categorization from Handwritten Text Image

    Get PDF
    In the near future, the digitization and processing of the current paper documents describe efficient role in the creation of a paperless environment. Deep learning techniques for handwritten recognition have been extensively studied by various researchers. Deep neural networks can be trained quickly thanks to a lot of data and other algorithmic advancements. Various methods for extracting text from handwritten manuscripts have been developed in literature. To extract features from written Telugu Text image having some other neural network approaches like convolution neural network (CNN), recurrent neural networks (RNN), long short-term memory (LSTM). Different deep learning related approaches are widely used to identification of handwritten Telugu Text; various techniques are used in literature for the identification of Telugu Text from documents. For automatic identification of Telugu written script efficiently to eliminate noise and other semantic features present in Telugu Text, in this paper, proposes Novel Heuristic Advanced Neural Network based Telugu Text Categorization Model (NHANNTCM) based on sequence-to-sequence feature extraction procedure. Proposed approach extracts the features using RNN and then represents Telugu Text in sequence-to-sequence format for the identification advanced neural network performs both encoding and decoding to identify and explore visual features from sequence of Telugu Text in input data. The classification accuracy rates for Telugu words, Telugu numerals, Telugu characters, Telugu sentences, and the corresponding Telugu sentences were 99.66%, 93.63%, 91.36%, 99.05%, and 97.73% consequently. Experimental evaluation describe extracted with revealed which are textured i.e. TENG shown considerable operations in applications such as private information protection, security defense, and personal handwriting signature identification
    corecore