3 research outputs found

    Text representation using canonical data model

    Get PDF
    Developing digital technology and the World Wide Web has led to the increase of digital documents that are used for various purposes such as publishing, in turn, appears to be connected to raise the awareness for the requirement of effective techniques that can help during the search and retrieval of text. Text representation plays a crucial role in representing text in a meaningful way. The clarity of representation depends tightly on the selection of the text representation methods. Traditional methods of text representation model documents such as term-frequency invers document frequency (TF-IDF) ignores the relationship and meanings of words in documents. As a result the sparsity and semantic problem that is predominant in textual document are not resolved. In this research, the problem of sparsity and semantic is reduced by proposing Canonical Data Model (CDM) for text representation. CDM is constructed through an accumulation of syntactic and semantic analysis. A number of 20 news group dataset were used in this research to test CDM validity for text representation. The text documents goes through a number of pre-processing process and syntactic parsing in order to identify the sentence structure. Text documents goes through a number of preprocessing steps and syntactic parsing in order to identify the sentence structure and then TF-IDF method is used to represent the text through CDM. The findings proved that CDM was efficient to represent text, based on the model validation through language experts‟ review and the percentage of the similarity measurement methods

    An enhanced sequential exception technique for semantic-based text anomaly detection

    Get PDF
    The detection of semantic-based text anomaly is an interesting research area which has gained considerable attention from the data mining community. Text anomaly detection identifies deviating information from general information contained in documents. Text data are characterized by having problems related to ambiguity, high dimensionality, sparsity and text representation. If these challenges are not properly resolved, identifying semantic-based text anomaly will be less accurate. This study proposes an Enhanced Sequential Exception Technique (ESET) to detect semantic-based text anomaly by achieving five objectives: (1) to modify Sequential Exception Technique (SET) in processing unstructured text; (2) to optimize Cosine Similarity for identifying similar and dissimilar text data; (3) to hybridize modified SET with Latent Semantic Analysis (LSA); (4) to integrate Lesk and Selectional Preference algorithms for disambiguating senses and identifying text canonical form; and (5) to represent semantic-based text anomaly using First Order Logic (FOL) and Concept Network Graph (CNG). ESET performs text anomaly detection by employing optimized Cosine Similarity, hybridizing LSA with modified SET, and integrating it with Word Sense Disambiguation algorithms specifically Lesk and Selectional Preference. Then, FOL and CNG are proposed to represent the detected semantic-based text anomaly. To demonstrate the feasibility of the technique, four selected datasets namely NIPS data, ENRON, Daily Koss blog, and 20Newsgroups were experimented on. The experimental evaluation revealed that ESET has significantly improved the accuracy of detecting semantic-based text anomaly from documents. When compared with existing measures, the experimental results outperformed benchmarked methods with an improved F1-score from all datasets respectively; NIPS data 0.75, ENRON 0.82, Daily Koss blog 0.93 and 20Newsgroups 0.97. The results generated from ESET has proven to be significant and supported a growing notion of semantic-based text anomaly which is increasingly evident in existing literatures. Practically, this study contributes to topic modelling and concept coherence for the purpose of visualizing information, knowledge sharing and optimized decision making

    A Text Representation Method Based on Harmonic Series

    No full text
    corecore