25,044 research outputs found

    Blog Analysis with Fuzzy TFIDF

    Get PDF
    These days blogs are becoming increasingly popular because it allows anyone to share their personal diary, opinions, and comments on the World Wide Wed. Many blogs contain valuable information, but it is a difficult task to extract this information from a high number of blog comments. The goal is to analyze a high number of blog comments by clustering all blog comments by their similarity based on keyword relevance into smaller groups. TF-IDF weight has been used in classifying documents by measuring appearance frequency of each keyword in a document, but it is not effective in differentiating semantic similarities between words. By applying fuzzy semantic to TF-IDF, TF-IDF becomes fuzzy TF-IDF and has the ability to rank semantic relevancy. Fuzzy VSM can be effective in exploring hidden relationship between blog comments by adapting fuzzy TF-IDF and fuzzy semantic for extending Vector Space Model to fuzzy VSM. Therefore, fuzzy VSM can cluster a high number of blog comments into small number of groups based on document similarity and semantic relevancy

    Classification of metamorphic virus using n-grams signatures

    Get PDF
    Metamorphic virus has a capability to change, translate, and rewrite its own code once infected the system to bypass detection. The computer system then can be seriously damage by this undetected metamorphic virus. Due to this, it is very vital to design a metamorphic virus classification model that can detect this virus. This paper focused on detection of metamorphic virus using Term Frequency Inverse Document Frequency (TF-IDF) technique. This research was conducted using Second Generation virus dataset. The first step is the classification model to cluster the metamorphic virus using TF-IDF technique. Then, the virus cluster is evaluated using Naïve Bayes algorithm in terms of accuracy using performance metric. The types of virus classes and features are extracted from bi-gram assembly language. The result shows that the proposed model was able to classify metamorphic virus using TF-IDF with optimal number of virus class with average accuracy of 94.2%

    Comparison of two methods on vector space model for trust in social commerce

    Get PDF
    The study of dealing with searching information in documents within web pages is Information retrieval (IR). The user needs to describe information with comments or reviews that consists of a number of words. Discovering weight of an inquiry term is helpful to decide the significance of a question. Estimation of term significance is a basic piece of most information retrieval approaches and it is commonly chosen through term frequency-inverse document frequency (TF-IDF). Also, improved TF-IDF method used to retrieve information in web documents. This paper presents comparison of TF-IDF method and improved TF-IDF method for information retrieval. Cosine similarity method calculated on both methods. The results of cosine similarity method on both methods compared on the desired threshold value. The relevance documents of TF-IDF method are more extracted than improved TF-IDF method

    Fisher's exact test explains a popular metric in information retrieval

    Full text link
    Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the one-tailed version of Fisher's exact test, also known as the hypergeometric test, corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. We then set forth a mathematical argument that suggests the tf-idf variant approximates the negative logarithm of the one-tailed Fisher's exact test P-value (i.e., a hypergeometric distribution tail probability). The Fisher's exact test interpretation of this common tf-idf variant furnishes the working statistician with a ready explanation of tf-idf's long-established effectiveness.Comment: 26 pages, 4 figures, 1 tables, minor revision

    Classification of Closely Related Indonesian Article based on Latent Semantic Analysis

    Get PDF
    Latent Semantic Analysis (LSA) is one of the most popular methods used in classification task. LSA is used to extract and represent the contextual-usage meaning of words in the document. Commonly, TF-IDF is used as the method to build a term-document matrix or to generate the feature before applying Singular Value Decomposition (SVD) in LSA. Based on the initial experiment, TF-IDF feature in LSA has not performed well to classify the similar text such as between aqidah and ibadah articles as well as between national and regional articles. This could happen due to the gap of TF-IDF to capture semantic information in the article. Referring to this issue, this study contributes to the use of semantic vector as word representation in text classification with word vector, word2vec which is then processed to the LSA method. This study reveals that the result obtained is better than the previous method. In national and regional article classification, the use of word2vec feature in LSA successfully increased the f-score from 74% (LSA with TF-IDF) to 75% (LSA with word2vec) as well as in the accuracy scores that increases from 74% (LSA with TF-IDF) to 78% (LSA with word2vec), meanwhile in aqidah and ibadah article classification the f-score improved from 63% (LSA with TF-IDF) to 73% (LSA with word2vec) as well as in the accuracy score improved significantly from 49% (LSA with TF-IDF) to 72% (LSA with word2vec)

    A Comparative Study of Document Representation Methods

    Get PDF
    Document representation learning is crucial for downstream machine learning tasks such as document classification. Recent neural network approaches such as Doc2Vec and its variants are popular. Regarding its comparison with traditional representation methods such as the TF-IDF method, the results are not very conclusive due to several factors-- Doc2vec has many hyper-parameters, resulting in performance fluctuation; traditional methods have space to improve. More importantly, document length and data size have impacts on the result. This thesis conducts a comparative study of these methods, and propose to improve the TF-IDF weighting with mutual information(MI). We find that Doc2vec works good only for short documents, and only when the data size (the number of documents) is large. For long documents and small data size, MI performs better. The experiments are conducted extensively on 11 data sets that are of a variety of combinations of document length and data size. In addition, we study the relationship between TF-IDF and MI weighting. We find that their correlation is high overall (Pearson correlation coefficient is over 0.9 on all the data sets used in our thesis). For medium frequency words, the MI weighting is always smaller than the TF-IDF weighting. However, for rare words and popular words, MI diverges from TF-IDF greatly, and the weighting of MI is higher than TF-IDF for popular words but lower than TF-IDF for rare words

    CALCULATING THE SIMILARITY OF INDONESIAN SENTENCES USING LATENT SEMANTIC INDEXING BASED ON KBBI

    Get PDF
    Calculating the semantic similarity between sentences is a long-discussed issue in the field of language processing. The field of semantic analysis has an important role in research related to text analysis. In this study, the data used is the definition of each word that has the same meaning by Indonesian dictionary. In this study also, the author presents two methodologies with semantic similarity using TF-IDF algorithm traditional and LSI method using TF-IDF with distribution terms of definition. All definitions are calculated by other definition, by analyzing the definition of which has a high similarity value along with accuracy. Each weight calculation and vectors generated by the TF-IDF algorithm each method, calculation steps to reduce the dimension of vector greatly affect the outcome of similarity values. Evident of this research method to add distribution terms, scores greater weight and influence the value of similarity. LSI method also strengthen the TF-IDF algorithm to determine the similarity distribution a better sentence, taking into account also the character of KBBI. The results obtained in this study, the value of an accuracy of 75.9% for the semantic similarity using traditional TF-IDF and 80% for LSI method and TF-IDF is a weighted term distribution

    The accessibility dimension for structured document retrieval

    Get PDF
    Structured document retrieval aims at retrieving the document components that best satisfy a query, instead of merely retrieving pre-defined document units. This paper reports on an investigation of a tf-idf-acc approach, where tf and idf are the classical term frequency and inverse document frequency, and acc, a new parameter called accessibility, that captures the structure of documents. The tf-idf-acc approach is defined using a probabilistic relational algebra. To investigate the retrieval quality and estimate the acc values, we developed a method that automatically constructs diverse test collections of structured documents from a standard test collection, with which experiments were carried out. The analysis of the experiments provides estimates of the acc values
    corecore