29,089 research outputs found

    Global-local word embedding for text classification

    Get PDF
    Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recent word embedding approaches have drawn much attention to text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages, namely sole reliance on pre-trained word vectors that may neglect the local context and increase word ambiguity. In this thesis, four new document representation approaches are introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. The proposed approaches, which are frameworks for document representation while using word embedding learning features for the task of text classification, are: Content Tree Word Embedding; Composed Maximum Spanning Content Tree; Embedding-based Word Clustering; and Autoencoder-based Word Embedding. The results show improvement in the F_score accuracy measure for a document classification task applied to IMDB Movie Reviews, Hate Speech Identification, 20 Newsgroups, Reuters-21578, and AG News as benchmark datasets in comparison to using three deep learning-based word embedding approaches, namely GloVe, Word2Vec, and fastText, as well as two other document representations: LSA and Random word embedding

    Global-local word embedding for text classification

    Get PDF
    Only humans can understand and comprehend the actual meaning that underlies natural written language, whereas machines can form semantic relationships only after humans have provided the parameters that are necessary to model the meaning. To enable computer models to access the underlying meaning in written language, accurate and sufficient document representation is crucial. Recent word embedding approaches have drawn much attention to text mining research. One of the main benefits of such approaches is the use of global corpuses with the generation of pre-trained word vectors. Although very effective, these approaches have their disadvantages, namely sole reliance on pre-trained word vectors that may neglect the local context and increase word ambiguity. In this thesis, four new document representation approaches are introduced to mitigate the risk of word ambiguity and inject a local context into globally pre-trained word vectors. The proposed approaches, which are frameworks for document representation while using word embedding learning features for the task of text classification, are: Content Tree Word Embedding; Composed Maximum Spanning Content Tree; Embedding-based Word Clustering; and Autoencoder-based Word Embedding. The results show improvement in the F_score accuracy measure for a document classification task applied to IMDB Movie Reviews, Hate Speech Identification, 20 Newsgroups, Reuters-21578, and AG News as benchmark datasets in comparison to using three deep learning-based word embedding approaches, namely GloVe, Word2Vec, and fastText, as well as two other document representations: LSA and Random word embedding

    Hyperbolic Interaction Model For Hierarchical Multi-Label Classification

    Full text link
    Different from the traditional classification tasks which assume mutual exclusion of labels, hierarchical multi-label classification (HMLC) aims to assign multiple labels to every instance with the labels organized under hierarchical relations. Besides the labels, since linguistic ontologies are intrinsic hierarchies, the conceptual relations between words can also form hierarchical structures. Thus it can be a challenge to learn mappings from word hierarchies to label hierarchies. We propose to model the word and label hierarchies by embedding them jointly in the hyperbolic space. The main reason is that the tree-likeness of the hyperbolic space matches the complexity of symbolic data with hierarchical structures. A new Hyperbolic Interaction Model (HyperIM) is designed to learn the label-aware document representations and make predictions for HMLC. Extensive experiments are conducted on three benchmark datasets. The results have demonstrated that the new model can realistically capture the complex data structures and further improve the performance for HMLC comparing with the state-of-the-art methods. To facilitate future research, our code is publicly available
    • …
    corecore