9,428 research outputs found

    Predicting Good Configurations for GitHub and Stack Overflow Topic Models

    Full text link
    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International Conference on Mining Software Repositorie

    Using online linear classifiers to filter spam Emails

    Get PDF
    The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering

    Extracting Business Intelligence from Online Product Reviews: An Experiment of Automatic Rule-Induction

    Get PDF
    Online product reviews are a major source of business intelligence (BI) that helps managers and market researchers make important decisions on product development and promotion. However, the large volume of online product review data creates significant information overload problems, making it difficult to analyze users’ concerns. In this paper, we employ a design science paradigm to develop a new framework for designing BI systems that correlate the textual content and the numerical ratings of online product reviews. Based on the framework, we developed a prototype for extracting the relationship between the user ratings and their textual comments posted on Amazon.com’s Web site. Two data mining algorithms were implemented to extract automatically decision rules that guide the understanding of the relationship. We report on experimental results of using the prototype to extract rules from online reviews of three products and discuss the managerial implications

    A New Approach of Rough Set Theory for ‎Feature Selection and Bayes Net Classifier ‎Applied on Heart Disease Dataset

    Get PDF
    درسنا في هذا البحث اختيار الصفات بالاعتماد على نهج جديد من  خوارزمية مجموعة التقريب حيث تعتمد هذه الطريقة على اختيار الصفات الأكثر تاثيرا. لجئنا الى انتقاء الصفات اختصارا للوقت , وجود الصفة تؤثر على دقة النتائج او قد تكون الصفة غير متوفرة . تم تطبيق الخوارزمية على بيانات امراض القلب لاختيار افضل الصفات المؤثرة. ان المشكلة الرئيسية هو كيفية تشخيص الإصابة فيما لو كان مصاب بمرض القلب من عدمه.هذه المشكلة تمثل تحدي لان لا نسطيع اتخاذ القرار بصورة مباشرة. تعتمد الطريقة المقترحة على ترميز البيانات الاصلية .ان الناتج من هذه الخوارزميه هي الصفات الأكثر أهمية حيث تهمل الصفات السيئة والغير ضرورية.وتم تطبيق النتائج على خوارزمية شكبة بيزينت كخوارزمية للتنبؤ بالمرض وقد حصلنا على النتائج 82.17 , 83.49 , 74.58 عند استخدام جميع الصفات ,12 , 7 طول الصفات على التوالي.وتم تطبيق نتائج خوارزمية مجموعة التقريب الاصلية على خوارزمية البيزين وحصلنا على النتائج 58.41 ,81.51  عند استخدام 2 , 12 طول الصفات على التواليIn this paper a new approach of rough set features selection has been proposed. Feature selection has been used for several reasons a) decrease time of prediction b) feature possibly is not found c) present of feature case bad prediction. Rough set has been used to select most significant features. The proposed rough set has been applied on heart diseases data sets. The main problem is how to predict patient has heart disease or not depend on given features. The problem is challenge, because it cannot determine decision directly .Rough set has been modified to get attributes for prediction by ignored unnecessary and bad features. Bayes net has been used for classified method. 10-fold cross validation is used for evaluation. The Correct Classified Instances were 82.17, 83.49, and 74.58 when use full, 12, 7 length of attributes respectively. Traditional rough set has been applied, the minimum Correct Classified Instances were 58.41 and 81.51 when use 2 length of attributes respectivel
    corecore