Search CORE

4 research outputs found

ClassiNet -- Predicting Missing Features for Short-Text Classification

Author: Atanasov Vincent
Bollegala D
Kawarabayashi Ken-ichi
Maehara Takanori
Publication venue: Special Interest Group on Computer Graphics, Association for Computing Machinery
Publication date: 14/04/2018
Field of study

The fundamental problem in short-text classification is \emph{feature sparseness} -- the lack of feature overlap between a trained model and a test instance to be classified. We propose \emph{ClassiNet} -- a network of classifiers trained for predicting missing features in a given instance, to overcome the feature sparseness problem. Using a set of unlabeled training instances, we first learn binary classifiers as feature predictors for predicting whether a particular feature occurs in a given instance. Next, each feature predictor is represented as a vertex

v_i

in the ClassiNet where a one-to-one correspondence exists between feature predictors and vertices. The weight of the directed edge

e_{ij}

connecting a vertex

v_i

to a vertex

v_j

represents the conditional probability that given

v_i

exists in an instance,

v_j

also exists in the same instance. We show that ClassiNets generalize word co-occurrence graphs by considering implicit co-occurrences between features. We extract numerous features from the trained ClassiNet to overcome feature sparseness. In particular, for a given instance

\vec{x}

, we find similar features from ClassiNet that did not appear in

\vec{x}

, and append those features in the representation of

\vec{x}

. Moreover, we propose a method based on graph propagation to find features that are indirectly related to a given short-text. We evaluate ClassiNets on several benchmark datasets for short-text classification. Our experimental results show that by using ClassiNet, we can statistically significantly improve the accuracy in short-text classification tasks, without having to use any external resources such as thesauri for finding related features.Comment: Accepted to ACM TKD

arXiv.org e-Print Archive

University of Liverpool Repository

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

Author: Burcu Bakir-Gungor
Daniel Voskergian
Malik Yousef
Malik Yousef
Publication venue: Frontiers Media S.A.
Publication date: 01/10/2023
Field of study

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles’ content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers

Directory of Open Access Journals

ClassiNet -- Predicting Missing Features for Short-Text Classification

Author: Atanasov Vincent
Bollegala D
Kawarabayashi Ken-ichi
Maehara Takanori
Publication venue: Special Interest Group on Computer Graphics, Association for Computing Machinery
Publication date
Field of study

University of Liverpool Repository