64 research outputs found
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the systemās methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models
Word representation has always been an important research area in the history
of natural language processing (NLP). Understanding such complex text data is
imperative, given that it is rich in information and can be used widely across
various applications. In this survey, we explore different word representation
models and its power of expression, from the classical to modern-day
state-of-the-art word representation language models (LMS). We describe a
variety of text representation methods, and model designs have blossomed in the
context of NLP, including SOTA LMs. These models can transform large volumes of
text into effective vector representations capturing the same semantic
information. Further, such representations can be utilized by various machine
learning (ML) algorithms for a variety of NLP related tasks. In the end, this
survey briefly discusses the commonly used ML and DL based classifiers,
evaluation metrics and the applications of these word embeddings in different
NLP tasks
Strategies for Handling Out-of-Vocabulary Words in Automatic Speech Recognition
Nowadays, most ASR (automatic speech recognition) systems deployed in industry are closed-vocabulary systems, meaning we have a limited vocabulary of words the system can recognize, and where pronunciations are provided to the system. Words out of this vocabulary are called out-of-vocabulary (OOV) words, for which either pronunciations or both spellings and pronunciations are not known to the system. The basic motivations of developing strategies to handle OOV words are: First, in the training phase, missing or wrong pronunciations of words in training data results in poor acoustic models. Second, in the test phase, words out of the vocabulary cannot be recognized at all, and mis-recognition of OOV words may affect recognition performance of its in-vocabulary neighbors as well. Therefore, this dissertation is dedicated to exploring strategies of handling OOV words in closed-vocabulary ASR.
First, we investigate dealing with OOV words in ASR training data, by introducing an acoustic-data driven pronunciation learning framework using a likelihood-reduction based criterion for selecting pronunciation candidates from multiple sources, i.e. standard grapheme-to-phoneme algorithms (G2P) and phonetic decoding, in a greedy fashion. This framework effectively expands a small hand-crafted pronunciation lexicon to cover OOV words, for which the learned pronunciations have higher quality than approaches using G2P alone or using other baseline pruning criteria. Furthermore, applying the proposed framework to generate alternative pronunciations for in-vocabulary (IV) words improves both recognition performance on relevant words and overall acoustic model performance.
Second, we investigate dealing with OOV words in ASR test data, i.e. OOV detection and recovery. We first conduct a comparative study of a hybrid lexical model (HLM) approach for OOV detection, and several baseline approaches, with the conclusion that the HLM approach outperforms others in both OOV detection and first pass OOV recovery performance. Next, we introduce a grammar-decoding framework for efficient second pass OOV recovery, showing that with properly designed schemes of estimating OOV unigram probabilities, the framework significantly improves OOV recovery and overall decoding performance compared to first pass decoding.
Finally we propose an open-vocabulary word-level recurrent neural network language model (RNNLM) re-scoring framework, making it possible to re-score lattices containing recovered OOVs using a single word-level RNNLM, that was ignorant of OOVs when it was trained. Above all, the whole OOV recovery pipeline shows the potential of a highly efficient open-vocabulary word-level ASR decoding framework, tightly integrated into a standard WFST decoding pipeline
Robust Text Correction for Grammar and Fluency
Grammar is one of the most important properties of natural language. It is a set of structural (i.e., syntactic and morphological) rules that are shared among native speakers in order to engage smooth communication. Automated grammatical error correction (GEC) is a natural language processing (NLP) application, which aims to correct grammatical errors in a given source sentence by computational models.
Since the data-driven statistical methods began in 1990s and early 2000s, the GEC com- munity has worked on establishing a common framework for its evaluation (i.e., dataset and metric for benchmarking) in order to compare GEC modelsā performance quantitatively. A series of shared tasks since early 2010s is a good example of this.
In the first half of this thesis, I propose character-level and token-level error correction algorithms. For the character-level error correction, I introduce a semi-character recurrent neural network, which is motivated by a finding in psycholinguistics, called the Cmabrigde Uinervtisy (Cambridge University) effect or typoglycemia. For word-level error correc- tion, I propose an error-repair dependency parsing algorithm for ungrammatical texts. The algorithm can parse sentences and correct grammatical errors simultaneously.
However, it is important to note that grammatical errors are not usually limited to mor- phological or syntactic errors. For example, collocational errors such as *quick/fast food and *fast/quick meal are not fully explained by only syntactic rules. This is another im- portant property of natural language, called fluency (or acceptability). Fluency is a level of mastery that goes beyond knowledge of how to follow the rules, and includes know- ing when they can be broken or flouted. In fact, the GEC community has also extended the scope of error types from closed class errors (e.g., noun numbers, verb forms) to the fluency-oriented errors.
The second half of this thesis investigates GEC while considering fluency as well as grammaticality. When it comes to āwhole-sentenceā correction, by extending the scope of errors considering fluency as well as grammaticality, the GEC community has overlooked the reliability and validity of the task scheme (i.e., evaluation metric and dataset for bench- marking). Thus, I reassess the goals of GEC as a āwhole-sentenceā rewriting task while considering fluency. Following the fluency-oriented GEC framework, I introduce a new benchmark corpus that is more diverse in various aspects such as proficiency, topics, and learnersā native languages.
Based on the fluency-oriented metric and dataset, I propose a new āwhole-sentenceā error correction model with neural reinforcement learning. Unlike conventional maximum likelihood estimation (MLE), the model directly optimizes toward an objective that consid- ers a sentence-level, task-specific evaluation metric. I demonstrate that the proposed model outperforms MLE in human and automated evaluation metrics.
Finally, I conclude the thesis and outline ideas and suggestions for future GEC research
Computer analysis of children's non-native English speech for language learning and assessment
Children's ASR appears to be more challenging than adults' and it's even more diļ¬cult when it comes to non-native children's speech. This research investigates diļ¬erent techniques to compensate for the eļ¬ects of non-native and children on the performance of ASR systems. The study mainly utilises hybrid DNN-HMM systems with conventional DNNs, LSTMs and more advanced TDNN models. This work uses the CALL-ST corpus and TLT-school corpus to study children's non-native English speech.
Initially, data augmentation was explored on the CALL-ST corpus to address the lack of data problem using the AMI corpus and PF-STAR German corpus. Feature selection, acoustic model adaptation and selection were also investigated on CALL-ST. More aspects of the ASR system, including pronunciation modelling, acoustic modelling, language modelling and system fusion, were explored on the TLT-school corpus as this corpus has a bigger amount of data. Then, the relationships between the CALL-ST and TLT-school corpora were studied and utilised to improve ASR performance.
The other part of the present work is text processing for non-native children's English speech. We focused on providing accept/reject feedback to learners based on the text generated by the ASR system from learners' spoken responses. A rule-based and a machine learning-based system were proposed for making the judgement, several aspects of the systems were evaluated. The inļ¬uence of the ASR system on the text processing system was explored
Recommended from our members
Adapting Automatic Summarization to New Sources of Information
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages ā documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages
Artificial Intelligence and Cognitive Computing
Artificial intelligence (AI) is a subject garnering increasing attention in both academia and the industry today. The understanding is that AI-enhanced methods and techniques create a variety of opportunities related to improving basic and advanced business functions, including production processes, logistics, financial management and others. As this collection demonstrates, AI-enhanced tools and methods tend to offer more precise results in the fields of engineering, financial accounting, tourism, air-pollution management and many more. The objective of this collection is to bring these topics together to offer the reader a useful primer on how AI-enhanced tools and applications can be of use in todayās world. In the context of the frequently fearful, skeptical and emotion-laden debates on AI and its value added, this volume promotes a positive perspective on AI and its impact on society. AI is a part of a broader ecosystem of sophisticated tools, techniques and technologies, and therefore, it is not immune to developments in that ecosystem. It is thus imperative that inter- and multidisciplinary research on AI and its ecosystem is encouraged. This collection contributes to that
- ā¦