68 research outputs found


    Get PDF
    Web mining in searching based on keywords by automatic clustering is a document searching method by classifying documents based on its keyword. Following is the clustering by centroid linkage hierarchical method (CLHM) to the number of keywords from each document. In clustering, initialization is commonly required for the number of cluster to be formed first, however, in some clustering cases, the user cannot determine how many clusters can be built. Therefore, on this paper, the Valley tracing method is applied as a constraint which identifies variants movement from each cluster formation step and also analyzes its pattern to form automatic clustering. Document data used are from text mining process on documents. Based on 424 documents, this research shows that clustering method using CLHM algorithm can be generally used to classifying documents with exact number automatically

    An Attribtue-Based Statistic Model for Privacy Impact Assessment

    Get PDF
    Personally Identifiable Information (PII) includes any information that can be used to distinguish or trace an individual\u27s identity such as name, social security number, date and place of birth, mother\u27s maiden name, or biometric records. It also includes other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information. PII is often the target of attacks, and loss of PII could result in identity theft. According to the U.S. Department of Justice, the average number of U.S. identity fraud victims annually is 11,571,900 [1]. The total financial loss attributed to identity theft in 2013 was 21billiondollars,comparedto13.2 billion total loss in 2010[1]

    An Attribute-based Statistic Model for Privacy Impact Assessment

    Get PDF
    Personally Identifiable Information (PII) includes any information that can be used to distinguish or trace an individual’s identity such as name, social security number, date and place of birth, mother’s maiden name, or biometric records. It also includes other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information. PII is often the target of attacks, and loss of PII could result in identity theft. According to the U.S. Department of Justice, the average number of U.S. identity fraud victims annually is 11,571,900 [1]. The total financial loss attributed to identity theft in 2013 was 21billiondollars,comparedto21 billion dollars, compared to 13.2 billion total loss in 2010 [1]

    KNN with TF-IDF based Framework for Text Categorization

    Get PDF
    AbstractKNN is a very popular algorithm for text classification. This paper presents the possibility of using KNN algorithm with TF-IDF method and framework for text classification. Framework enables classification according to various parameters, measurement and analysis of results. Evaluation of framework was focused on the speed and quality of classification. The results of testing showed the good and bad features of algorithm, providing guidance for the further development of similar frameworks

    System of Information Feedback on Archive Using Term Frequency-Inverse Document Frequency and Vector Space Model Methods

    Get PDF
    The archive is one of the examples of documents that important. Archives are stored systematically with a view to helping and simplifying the storage and retrieval of the archive. In the information retrieval (Information retrieval) the process of retrieving relevant documents and not retrieving documents that are not relevant. To retrieve the relevant documents, a method is needed. Using the Term Frequency-Inverse Document and Vector Space Model methods can find relevant documents according to the level of closeness or similarity, in addition to applying the Nazief-Adriani stemming algorithm can improve information retrieval performance by transforming words in a document or text to the basic word form. then the system indexes the document to simplify and speed up the search process. Relevance is determined by calculating the similarity values between existing documents by querying and represented in certain forms. The documents obtained, then the system sort by the level of relevance to the query

    Разработка и исследование предметно независимого классификатора текстов по тональности

    Get PDF
    The paper presents a method of constructing a sentiment classifier on two and three classes (positive and negative, positive, neutral and negative texts). It also presented the results of experiments showing the high accuracy of the proposed method on text which are not belong to any pre specified domains. The effectiveness of the presented method is confirmed by experiments' results on the text collection of blogs from ROMIP 2012 seminar. It was used following metrics for classifier evaluation: precision, recall, accuracy and F-measure. The value of F-measure of the proposed method for classification into 2 classes is up to 93%. In addition to blog collection ROMIP 2012 for experiments were used a collection of news and a collection of short-texts from social networks.В статье представляется метод построения классификатора для классификации текстов по тональности на два и на три класса (положительные и негативные; положительные, нейтральные и негативные тексты). Представляются результаты экспериментов, показывающие высокую точность работы метода не зависимо от предметной области к которой принадлежит текст. Эффективность представленного метода подтверждается экспериментами на текстовой коллекции блогов с разметкой по оценочной тональности семинара РОМИП-2012. Для оценки используются метрики: precision, recall, accuracy и F-меры. Значение F-меры для предлагаемого метода при классификации на 2 класса составляет 93%. Помимо блоговой коллекции РОМИП-2012, используются коллекция новостей и коллекция текстов социальных сетей

    Data Driven Creation of Sentiment Dictionaries for Corporate Credit Risk Analysis

    Get PDF
    It has been shown, that German-language user generated content can improve corporate credit risk assessment, when sentiment analysis is applied. However, the approaches have only been conducted by human coders. In order to automate the analysis, we construct 20 domain-dependent sentiment dictionaries based on parts of a manually classified corpus from Twitter. Then, we apply the dictionaries to the remaining part of the corpus and rank the dictionaries based on their accuracy. Results from McNemar’s tests indicate, that the three best dictionaries do not differ significantly, but significant difference can be assured regarding the first and the fourth dictionary in the ranking. In addition to that, a general German-language dictionary is inferior compared to the constructed dictionaries. The results emphasize the importance of domain-dependent dictionaries in German-language sentiment analysis for future research. Furthermore, practitioners can utilize the dictionaries in order to create an additional indicator for corporate credit risk assessment

    k-TVT: a flexible and effective method for early depression detection

    Get PDF
    The increasing use of social media allows the extraction of valuable information to early prevent some risks. Such is the case of the use of blogs to early detect people with signs of depression. In order to address this problem, we describe k-temporal variation of terms (k-TVT), a method which uses the variation of vocabulary along the different time steps as concept space to represent the documents. An interesting particularity of this approach is the possibility of setting a parameter (the k value) depending on the urgency (earliness) level required to detect the risky (depressed) cases. Results on the early detection of depression data set from eRisk 2017 seem to confirm the robustness of k-TVT for different urgency levels using SVM as classifier. Besides, some recent results on an extension of this collection would confirm the effectiveness of k-TVT as one of the state-of-the-art methods for early depression detection.XVI Workshop Bases de Datos y Minería de Datos.Red de Universidades con Carreras en Informátic