217,986 research outputs found

    Text Clustering and Classification Techniques using Data Mining

    Get PDF
    Text classification is the task of automatically sorting a set of documents into categories from a predefined set. Text Classification is a data mining technique used to predict group membership for data instances within a given dataset. It is used for classifying data into different classes by considering some constrains. Instead of traditional feature selection techniques used for text document classification. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Automated Text categorization and class prediction is important for text categorization to reduce the feature size and to speed up the learning process of classifiers

    Efficient Hybrid Machine Learning Algorithm for text Classification

    Get PDF
    Text Mining and Text Classification are the most important and challenging task. Deriving high quality and relevant information form text is Text Mining and categorizing the text documents is done using the Text Classification. The real challenge in these areas is to address the problems like handling large text corpora, similarity of words in text documents, and association of text documents with a subset of class categories. The feature extraction and classi?cation of such text documents require an efficient machine learning algorithm which performs automatic text classification. The major drawback encountered in text classification and retrieval is determining whether a text is pertinent to the query. This work focuses on text classification by using the data mining techniques. A hybrid algorithm is proposed for classifying the text. The proposed algorithm combines the concepts of KNN, SVM and NB. The results obtained support the proposed hybrid algorithm in text classification

    Classifying Web Exploits with Topic Modeling

    Full text link
    This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and Expert Systems Applications (DEXA). http://ieeexplore.ieee.org/abstract/document/8049693

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Sentiment Analysis of Public Responses on Indonesia Government Using Naïve Bayes and Support Vector Machine

    Get PDF
    Many people are interested in knowing how the public views President Joko Widodo's administration. Text Mining analysis can be one way to collect and analyze text data about Joko Widodo's administration and extract relevant information from the data. Data was obtained by collecting tweet data about Joko Widodo's government in 2022 on Twitter using Netlyitic. Then the Text Mining analysis of Joko Widodo's government was carried out using the Navie Bayes (NVB) classification and Support Vector Machine (SVM). This classification can be used to predict sentiment or public views of the government based on the tweets collected.  Based on a case study of the classification results of President Joko Widodo using Naive Bayesian classification, we obtained a precision value of 79%, a recall value of 91% and a precision value of 82%. And by using SVM, we get 85% precision, 95% recall, and 83% precision. Due to the high accuracy, recall, and precision, it can be said that SVM classification is more accurate than NVB

    Semantic Learning and Web Image Mining with Image Recognition and Classification

    Get PDF
    Image mining is more than just an extension of data mining to image domain. Web Image mining is a technique commonly used to extract knowledge directly from images on WWW. Since main targets of conventional Web mining are numerical and textual data, Web mining for image data is on demand. There are huge image data as well as text data on the Web. However, mining image data from the Web is paid less attention than mining text data, since treating semantics of images are much more difficult. This paper proposes a novel image recognition and image classification technique using a large number of images automatically gathered from the Web as learning images. For classification the system uses imagefeature- based search exploited in content-based image retrieval(CBIR), which do not restrict target images unlike conventional image recognition methods and support vector machine(SVM), which is one of the most efficient & widely used statistical method for generic image classification that fit to the learning tasks. By the experiments it is observed that the proposed system outperforms some existing search system
    corecore