9,432 research outputs found
Recommended from our members
Hierarchical classification for multiple, distributed web databases
The proliferation of online information resources increases the importance of effective and efficient distributed searching. Our research aims to provide an alternative hierarchical categorization and search capability based on a Bayesian network learning algorithm. Our proposed approach, which is grounded on automatic textual analysis of subject content of online web databases, attempts to address the database selection problem by first classifying web databases into a hierarchy of topic categories. The experimental results reported demonstrate that such a classification approach not only effectively reduces the class search space, but also helps to significantly improve the accuracy of classification performance
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Software repositories contain large amounts of textual data, ranging from
source code comments and issue descriptions to questions, answers, and comments
on Stack Overflow. To make sense of this textual data, topic modelling is
frequently used as a text-mining tool for the discovery of hidden semantic
structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used
topic model that aims to explain the structure of a corpus by grouping texts.
LDA requires multiple parameters to work well, and there are only rough and
sometimes conflicting guidelines available on how these parameters should be
set. In this paper, we contribute (i) a broad study of parameters to arrive at
good local optima for GitHub and Stack Overflow text corpora, (ii) an
a-posteriori characterisation of text corpora related to eight programming
languages, and (iii) an analysis of corpus feature importance via per-corpus
LDA configuration. We find that (1) popular rules of thumb for topic modelling
parameter configuration are not applicable to the corpora used in our
experiments, (2) corpora sampled from GitHub and Stack Overflow have different
characteristics and require different configurations to achieve good model fit,
and (3) we can predict good configurations for unseen corpora reliably. These
findings support researchers and practitioners in efficiently determining
suitable configurations for topic modelling when analysing textual data
contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International
Conference on Mining Software Repositorie
Using online linear classifiers to filter spam Emails
The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering
Extracting Business Intelligence from Online Product Reviews: An Experiment of Automatic Rule-Induction
Online product reviews are a major source of business intelligence (BI) that helps managers and market researchers make important decisions on product development and promotion. However, the large volume of online product review data creates significant information overload problems, making it difficult to analyze users’ concerns. In this paper, we employ a design science paradigm to develop a new framework for designing BI systems that correlate the textual content and the numerical ratings of online product reviews. Based on the framework, we developed a prototype for extracting the relationship between the user ratings and their textual comments posted on Amazon.com’s Web site. Two data mining algorithms were implemented to extract automatically decision rules that guide the understanding of the relationship. We report on experimental results of using the prototype to extract rules from online reviews of three products and discuss the managerial implications
A New Approach of Rough Set Theory for Feature Selection and Bayes Net Classifier Applied on Heart Disease Dataset
درسنا في هذا البحث اختيار الصفات بالاعتماد على نهج جديد من خوارزمية مجموعة التقريب حيث تعتمد هذه الطريقة على اختيار الصفات الأكثر تاثيرا. لجئنا الى انتقاء الصفات اختصارا للوقت , وجود الصفة تؤثر على دقة النتائج او قد تكون الصفة غير متوفرة . تم تطبيق الخوارزمية على بيانات امراض القلب لاختيار افضل الصفات المؤثرة. ان المشكلة الرئيسية هو كيفية تشخيص الإصابة فيما لو كان مصاب بمرض القلب من عدمه.هذه المشكلة تمثل تحدي لان لا نسطيع اتخاذ القرار بصورة مباشرة. تعتمد الطريقة المقترحة على ترميز البيانات الاصلية .ان الناتج من هذه الخوارزميه هي الصفات الأكثر أهمية حيث تهمل الصفات السيئة والغير ضرورية.وتم تطبيق النتائج على خوارزمية شكبة بيزينت كخوارزمية للتنبؤ بالمرض وقد حصلنا على النتائج 82.17 , 83.49 , 74.58 عند استخدام جميع الصفات ,12 , 7 طول الصفات على التوالي.وتم تطبيق نتائج خوارزمية مجموعة التقريب الاصلية على خوارزمية البيزين وحصلنا على النتائج 58.41 ,81.51 عند استخدام 2 , 12 طول الصفات على التواليIn this paper a new approach of rough set features selection has been proposed. Feature selection has been used for several reasons a) decrease time of prediction b) feature possibly is not found c) present of feature case bad prediction. Rough set has been used to select most significant features. The proposed rough set has been applied on heart diseases data sets. The main problem is how to predict patient has heart disease or not depend on given features. The problem is challenge, because it cannot determine decision directly .Rough set has been modified to get attributes for prediction by ignored unnecessary and bad features. Bayes net has been used for classified method. 10-fold cross validation is used for evaluation. The Correct Classified Instances were 82.17, 83.49, and 74.58 when use full, 12, 7 length of attributes respectively. Traditional rough set has been applied, the minimum Correct Classified Instances were 58.41 and 81.51 when use 2 length of attributes respectivel
- …