Search CORE

16 research outputs found

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

Author: He Haibo
Kay Steven
Tang Bo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/02/2016
Field of study

Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (

MD

) and

MD-\chi^2

methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data Engineering. 14 pages, 5 figure

arXiv.org e-Print Archive

Crossref

DigitalCommons@URI

Automated COVID-19 Dialogue System Using a New Deep Learning Network

Author: Alhussayni Khaldoon H.
Alshamery Eman S.
Publication venue: 'International University of Sarajevo'
Publication date: 13/04/2021
Field of study

During the coronavirus disease 2019 (COVID-19) pandemic outbreak, it is necessary to apply social distancing measurements and search for an alternative to physical contact due to the spread of viral infection. The interest in task-oriented dialogue systems has grown remarkably in healthcare, using natural language in the dialogue between patients and doctors. However, the doctor’s advice is implicit and unclear in most conversations, and the patient may also be nervous when describing symptoms or may have difficulty describing them. Therefore, the patient’s description of symptoms is insufficient for a diagnosis by doctors. This study aims to provide suitable medical advice based on the patients’ symptoms during the conversation between doctors and patients by proposing a new deep learning method for automated medical dialogue systems. The model is based on an encoder and two stages of learning to make reliable decisions. The encoder extracts important words using text normalization, resulting in two vectors: symptom vectors and doctor utterance vectors. The symptom vectors are represented as a weighted bag-of-words feature. The first stage is used to cluster the patients’ utterances by applying Hopfield network while considering the semantic similarity, whereas the second stage extracts an implicit label as a template of advice using clustering. Additionally, the external evaluation model used the applied feedforward neural network classification algorithm using labels obtained in the second stage. The CovidDialog-English dataset is used to evaluate the model. The experimental results indicate the high performance of the feedforward neural network with an F1-score of 0.972 and presents a comparison of three clusters using the k-nearest neighbours and naïve Bayes-based models

Periodicals of Engineering and Natural Sciences (PEN - International University of Sarajevo)

Pre Processing Techniques for Arabic Documents Clustering

Author: Alhanjouri Mohammed A.
Publication venue: 'Vandana Publications'
Publication date: 01/01/2017
Field of study

Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced

Institutional Repository of the Islamic University of Gaza

Automatic 3D modeling and reconstruction of cultural heritage sites from Twitter images

Author: Doulamis Anastasios
Doulamis Nikolaos
Makantasis Konstantinos
Protopapadakis Eftychios
Voulodimos Athanasios
Publication venue: MDPI AG
Publication date: 01/01/2020
Field of study

This paper presents an approach for leveraging the abundance of images posted on social media like Twitter for large scale 3D reconstruction of cultural heritage landmarks. Twitter allows users to post short messages, including photos, describing a plethora of activities or events, e.g., tweets are used by travelers on vacation, capturing images from various cultural heritage assets. As such, a great number of images are available online, able to drive a successful 3D reconstruction process. However, reconstruction of any asset, based on images mined from Twitter, presents several challenges. There are three main steps that have to be considered: (i) tweets’ content identification, (ii) image retrieval and filtering, and (iii) 3D reconstruction. The proposed approach first extracts key events from unstructured tweet messages and then identifies cultural activities and landmarks. The second stage is the application of a content-based filtering method so that only a small but representative portion of cultural images are selected to support fast 3D reconstruction. The proposed methods are experimentally evaluated using real-world data and comparisons verify the effectiveness of the proposed scheme.peer-reviewe

OAR@UM

Role of semantic indexing for text classification.

Author: Sani Sadiq
Publication venue
Publication date: 30/09/2014
Field of study

The Vector Space Model (VSM) of text representation suffers a number of limitations for text classification. Firstly, the VSM is based on the Bag-Of-Words (BOW) assumption where terms from the indexing vocabulary are treated independently of one another. However, the expressiveness of natural language means that lexically different terms often have related or even identical meanings. Thus, failure to take into account the semantic relatedness between terms means that document similarity is not properly captured in the VSM. To address this problem, semantic indexing approaches have been proposed for modelling the semantic relatedness between terms in document representations. Accordingly, in this thesis, we empirically review the impact of semantic indexing on text classification. This empirical review allows us to answer one important question: how beneficial is semantic indexing to text classification performance. We also carry out a detailed analysis of the semantic indexing process which allows us to identify reasons why semantic indexing may lead to poor text classification performance. Based on our findings, we propose a semantic indexing framework called Relevance Weighted Semantic Indexing (RWSI) that addresses the limitations identified in our analysis. RWSI uses relevance weights of terms to improve the semantic indexing of documents. A second problem with the VSM is the lack of supervision in the process of creating document representations. This arises from the fact that the VSM was originally designed for unsupervised document retrieval. An important feature of effective document representations is the ability to discriminate between relevant and non-relevant documents. For text classification, relevance information is explicitly available in the form of document class labels. Thus, more effective document vectors can be derived in a supervised manner by taking advantage of available class knowledge. Accordingly, we investigate approaches for utilising class knowledge for supervised indexing of documents. Firstly, we demonstrate how the RWSI framework can be utilised for assigning supervised weights to terms for supervised document indexing. Secondly, we present an approach called Supervised Sub-Spacing (S3) for supervised semantic indexing of documents. A further limitation of the standard VSM is that an indexing vocabulary that consists only of terms from the document collection is used for document representation. This is based on the assumption that terms alone are sufficient to model the meaning of text documents. However for certain classification tasks, terms are insufficient to adequately model the semantics needed for accurate document classification. A solution is to index documents using semantically rich concepts. Accordingly, we present an event extraction framework called Rule-Based Event Extractor (RUBEE) for identifying and utilising event information for concept-based indexing of incident reports. We also demonstrate how certain attributes of these events e.g. negation, can be taken into consideration to distinguish between documents that describe the occurrence of an event, and those that mention the non-occurrence of that event

Open Access Institutional Repository at Robert Gordon University

Recommended from our members

High performance latent dirichlet allocation for text mining

Author: Liu Zelong
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space

Brunel University Research Archive