13 research outputs found

    Survey on Enhancing Clustering Output using Side Information by the Textual Extraction Mechanism

    Get PDF
    In text mining, more operations are based on the statistical analysis of a term, word or phrase. Clustering is a popular technique for automatically organizing a large collection of text; it is also used to text classification. Many text mining applications contains side information with text documents in the form of web documents, user access web-log, and different links attached with text files. This side information is helpful for clustering purpose but sometime it is risky to use side information because it may add noise to procedure. So we need a better technique for text mining to improve quality of presentation. In this paper, we are using different algorithms for enhancement of the clustering quality with the document-based, sentence-based, corpus-based, and combined approach concept analysis design, so as to maximize the benefits from using side information

    Concept Based Labeling of Text Documents Using Support Vector Machine

    Get PDF
    Classification plays a vital role in many information management and retrieval tasks . Text classification uses labeled training data to learn the classification system and then automatically classifies the remaining text using the lear ned system. Classification follows various techniques such as text processing, feature extraction, feature vector construction and final classification. The proposed mining model consists of sentence - based concept analysis, document - based concept analysis, corpus - based concept - analysis, and concept - based similarity measure. The proposed model can efficiently find significant matching concepts between documents, according to the semantics of their sentences. The similarity between documents is calculate d bas ed on a n similarity measure. Then we analyze the term that contributes to the sentence semantics on the sentence, document, and corpus levels rather than the traditional analysis of the document only. With the extracted feature vector for each new document, Support Vector Machine (SVM) algorithm is applied for document classification. The approach enhances the text classification accuracy

    A survey on Data Extraction and Data Duplication Detection

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. This Paper review the literature on duplicate detection and data fusion (remov e and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Dynamic Document Annotation for Efficient Data Retrieval

    Get PDF
    Document annotation is considered as one of the most popular methods, where metadata present in document is used to search documents from a large text documents database. Few application domains such as scientific networks, blogs share information in a large amount is usually in unstructured data text documents. Manual annotation of each document becomes a tedious task. Annotations facilitate the task of finding the document topic and assist the reader to quickly overview and understand document. Dynamic document annotation provides a solution to such type of problems. Dynamic annotation of documents is generally considered as a semi-supervised learning task. The documents are dynamically assigned to one of a set of predefined classes based on the features extracted from their textual content. This paper proposes survey on Collaborative Adaptive Data sharing platform (CADS) for document annotation and use of query workload to direct the annotation process. A key novelty of CADS is that it learns with time the most important data attributes of the application, and uses this knowledge to guide the data insertion and querying

    A Novel Approach for Text Classification

    Get PDF
    Abstract Text Classification (TC) is the process of associating text documents with the classes considered most appropriate, thereby distinguishing topics such as particle physics from optical physics. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre -defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correctly the training data patterns into two groups; i. e., + 1 and -1, thereby producing accurate and correct results. The proposed model works efficiently and effectively with great performance and high -accuracy results


    Get PDF
    ABSTRACT KEYWORD Comparative Summarization, concept segmentation, query based summar

    Text Classification: A Review, Empirical, and Experimental Evaluation

    Full text link
    The explosive and widespread growth of data necessitates the use of text classification to extract crucial information from vast amounts of data. Consequently, there has been a surge of research in both classical and deep learning text classification methods. Despite the numerous methods proposed in the literature, there is still a pressing need for a comprehensive and up-to-date survey. Existing survey papers categorize algorithms for text classification into broad classes, which can lead to the misclassification of unrelated algorithms and incorrect assessments of their qualities and behaviors using the same metrics. To address these limitations, our paper introduces a novel methodological taxonomy that classifies algorithms hierarchically into fine-grained classes and specific techniques. The taxonomy includes methodology categories, methodology techniques, and methodology sub-techniques. Our study is the first survey to utilize this methodological taxonomy for classifying algorithms for text classification. Furthermore, our study also conducts empirical evaluation and experimental comparisons and rankings of different algorithms that employ the same specific sub-technique, different sub-techniques within the same technique, different techniques within the same category, and categorie