66 research outputs found
Text Classification: A Review, Empirical, and Experimental Evaluation
The explosive and widespread growth of data necessitates the use of text
classification to extract crucial information from vast amounts of data.
Consequently, there has been a surge of research in both classical and deep
learning text classification methods. Despite the numerous methods proposed in
the literature, there is still a pressing need for a comprehensive and
up-to-date survey. Existing survey papers categorize algorithms for text
classification into broad classes, which can lead to the misclassification of
unrelated algorithms and incorrect assessments of their qualities and behaviors
using the same metrics. To address these limitations, our paper introduces a
novel methodological taxonomy that classifies algorithms hierarchically into
fine-grained classes and specific techniques. The taxonomy includes methodology
categories, methodology techniques, and methodology sub-techniques. Our study
is the first survey to utilize this methodological taxonomy for classifying
algorithms for text classification. Furthermore, our study also conducts
empirical evaluation and experimental comparisons and rankings of different
algorithms that employ the same specific sub-technique, different
sub-techniques within the same technique, different techniques within the same
category, and categorie
Information Access Using Neural Networks For Diverse Domains And Sources
The ever-increasing volume of web-based documents poses a challenge in efficiently accessing specialized knowledge from domain-specific sources, requiring a profound understanding of the domain and substantial comprehension effort. Although natural language technologies, such as information retrieval and machine reading compression systems, offer rapid and accurate information retrieval, their performance in specific domains is hindered by training on general domain datasets. Creating domain-specific training datasets, while effective, is time-consuming, expensive, and heavily reliant on domain experts. This thesis presents a comprehensive exploration of efficient technologies to address the challenge of information access in specific domains, focusing on retrieval-based systems encompassing question answering and ranking.
We begin with a comprehensive introduction to the information access system. We demonstrated the structure of a information access system through a typical open-domain question-answering task. We outline its two major components: retrieval and reader models, and the design choice for each part. We focus on mainly three points: 1) the design choice of the connection of the two components. 2) the trade-off associated with the retrieval model and the best frontier in practice. 3) a data augmentation method to adapt the reader model, trained initially on closed-domain datasets, to effectively answer questions in the retrieval-based setting.
Subsequently, we discuss various methods enabling system adaptation to specific domains. Transfer learning techniques are presented, including generation as data augmentation, further pre-training, and progressive domain-clustered training. We also present a novel zero-shot re-ranking method inspired by the compression-based distance. We summarize the conclusions and findings gathered from the experiments.
Moreover, the exploration extends to retrieval-based systems beyond textual corpora. We explored the search system for an e-commerce database, wherein natural language queries are combined with user preference data to facilitate the retrieval of relevant products. To address the challenges, including noisy labels and cold start problems, for the retrieval-based e-commerce ranking system, we enhanced model training through cascaded training and adversarial sample weighting. Another scenario we investigated is the search system in the math domain, characterized by the unique role of formulas and distinct features compared to textual searches. We tackle the math related search problem by combining neural ranking models with structual optimized algorithms.
Finally, we summarize the research findings and future research directions
Representation Learning for Texts and Graphs: A Unified Perspective on Efficiency, Multimodality, and Adaptability
[...] This thesis is situated between natural language processing and graph representation learning and investigates selected connections. First, we introduce matrix embeddings as an efficient text representation sensitive to word order. [...] Experiments with ten linguistic probing tasks, 11 supervised, and five unsupervised downstream tasks reveal that vector and matrix embeddings have complementary strengths and that a jointly trained hybrid model outperforms both. Second, a popular pretrained language model, BERT, is distilled into matrix embeddings. [...] The results on the GLUE benchmark show that these models are competitive with other recent contextualized language models while being more efficient in time and space. Third, we compare three model types for text classification: bag-of-words, sequence-, and graph-based models. Experiments on five datasets show that, surprisingly, a wide multilayer perceptron on top of a bag-of-words representation is competitive with recent graph-based approaches, questioning the necessity of graphs synthesized from the text. [...] Fourth, we investigate the connection between text and graph data in document-based recommender systems for citations and subject labels. Experiments on six datasets show that the title as side information improves the performance of autoencoder models. [...] We find that the meaning of item co-occurrence is crucial for the choice of input modalities and an appropriate model. Fifth, we introduce a generic framework for lifelong learning on evolving graphs in which new nodes, edges, and classes appear over time. [...] The results show that by reusing previous parameters in incremental training, it is possible to employ smaller history sizes with only a slight decrease in accuracy compared to training with complete history. Moreover, weighting the binary cross-entropy loss function is crucial to mitigate the problem of class imbalance when detecting newly emerging classes. [...
Recommended from our members
Domain adaptation for neural machine translation
The development of deep learning techniques has allowed Neural Machine Translation (NMT) models to become extremely powerful, given sufficient training data and training time. However, such translation models struggle when translating text of a specific domain. A domain may consist of text on a well-defined topic, or text of unknown provenance with an identifiable vocabulary distribution, or language with some other stylometric feature. While NMT models can achieve good translation performance on domain-specific data via simple tuning on a representative training corpus, such data-centric approaches have negative side-effects. These include over-fitting, brittleness, and `catastrophic forgetting' of previous training examples.
In this thesis we instead explore more robust approaches to domain adaptation for NMT. We consider the case where a system is adapted to a specified domain of interest, but may also need to accommodate new language, or domain-mismatched sentences. We explore techniques relating to data selection and curriculum, model parameter adaptation procedure, and inference procedure. We show that iterative fine-tuning can achieve strong performance over multiple related domains, and that Elastic Weight Consolidation can be used to mitigate catastrophic forgetting in NMT domain adaptation across multiple sequential domains. We develop a robust variant of Minimum Risk Training which allows more beneficial use of small, highly domain-specific tuning sets than simple cross-entropy fine-tuning, and can mitigate exposure bias resulting from domain over-fitting. We extend Bayesian Interpolation inference schemes to Neural Machine Translation, allowing adaptive weighting of NMT ensembles to translate text from an unknown domain.
Finally we demonstrate the benefit of multi-domain adaptation approaches for other lines of NMT research. We show that NMT systems using multiple forms of data representation can benefit from multi-domain inference approaches. We also demonstrate a series of domain adaptation approaches to mitigating the effects of gender bias in machine translation
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations
The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov
Recommended from our members
Learning meaning representations for text generation with deep generative models
This thesis explores conditioning a language generation model with auxiliary variables. By doing so, we hope to be able to better control the output of the language generator. We explore several kinds of auxiliary variables in this thesis, from unstructured continuous, to discrete, to structured discrete auxiliary variables, and evaluate their advantages and disadvantages. We consider three primary axes of variation: how interpretable the auxiliary variables are, how much control they provide over the generated text, and whether the variables can be induced from unlabelled data. The latter consideration is particularly interesting: if we can show that induced latent variables correspond to the semantics of the generated utterance, then by manipulating the variables, we have fine-grained control over the meaning of the generated utterance, thereby learning simple meaning representations for text generation.
We investigate three language generation tasks: open domain conversational response generation, sentence generation from a semantic topic, and generating surface form realisations of meaning representations. We use a different type of auxiliary variable for each task, describe the reasons for choosing that type of variable, and critically discuss how much the task benefited from an auxiliary variable decomposition. All of the models that we use combine a high-level graphical model with a neural language model text generator. The graphical model lets us specify the structure of the text generating process, while the neural text generator can learn how to generate fluent text from a large corpus of examples. We aim to show the utility of such \textit{deep generative models} of text for text generation in the following work
WiFi-Based Human Activity Recognition Using Attention-Based BiLSTM
Recently, significant efforts have been made to explore human activity recognition (HAR) techniques that use information gathered by existing indoor wireless infrastructures through WiFi signals without demanding the monitored subject to carry a dedicated device. The key intuition is that different activities introduce different multi-paths in WiFi signals and generate different patterns in the time series of channel state information (CSI). In this paper, we propose and evaluate a full pipeline for a CSI-based human activity recognition framework for 12 activities in three different spatial environments using two deep learning models: ABiLSTM and CNN-ABiLSTM. Evaluation experiments have demonstrated that the proposed models outperform state-of-the-art models. Also, the experiments show that the proposed models can be applied to other environments with different configurations, albeit with some caveats. The proposed ABiLSTM model achieves an overall accuracy of 94.03%, 91.96%, and 92.59% across the 3 target environments. While the proposed CNN-ABiLSTM model reaches an accuracy of 98.54%, 94.25% and 95.09% across those same environments
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
- …