13,695 research outputs found

    A study on creating a custom South Sotho spellchecking and correcting software desktop application

    Get PDF
    Thesis (B. Tech.) - Central University of Technology, Free State, 200

    Enhanced Integrated Scoring for Cleaning Dirty Texts

    Full text link
    An increasing number of approaches for ontology engineering from text are gearing towards the use of online sources such as company intranet and the World Wide Web. Despite such rise, not much work can be found in aspects of preprocessing and cleaning dirty texts from online sources. This paper presents an enhancement of an Integrated Scoring for Spelling error correction, Abbreviation expansion and Case restoration (ISSAC). ISSAC is implemented as part of a text preprocessing phase in an ontology engineering system. New evaluations performed on the enhanced ISSAC using 700 chat records reveal an improved accuracy of 98% as compared to 96.5% and 71% based on the use of only basic ISSAC and of Aspell, respectively.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

    Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

    Get PDF
    In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

    Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

    Get PDF
    Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good.  The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: Sentence level n-gram, real-word spelling error, spell checker, unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date:September 30th 202

    StyleCheck: An Automated Stylistic Analysis Tool

    Get PDF
    StyleCheck is a user-friendly tool with multiple functions designed to aid in the production of quality writing. Its features include stylistic analysis (on both document-wide and individual-sentence scales) and spelling and grammar check, as well as generating suggested replacements for all types of errors. In addition, StyleCheck includes the capability to identify the famous author (out of a limited corpus) with the style most similar to the user\u27s. The source code for StyleCheck is available online at: https://github.com/alexpwelton/StyleChec

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
    • …
    corecore