115,679 research outputs found

    Dropout Sampling for Robust Object Detection in Open-Set Conditions

    Full text link
    Dropout Variational Inference, or Dropout Sampling, has been recently proposed as an approximation technique for Bayesian Deep Learning and evaluated for image classification and regression tasks. This paper investigates the utility of Dropout Sampling for object detection for the first time. We demonstrate how label uncertainty can be extracted from a state-of-the-art object detection system via Dropout Sampling. We evaluate this approach on a large synthetic dataset of 30,000 images, and a real-world dataset captured by a mobile robot in a versatile campus environment. We show that this uncertainty can be utilized to increase object detection performance under the open-set conditions that are typically encountered in robotic vision. A Dropout Sampling network is shown to achieve a 12.3% increase in recall (for the same precision score as a standard network) and a 15.1% increase in precision (for the same recall score as the standard network).Comment: to appear in IEEE International Conference on Robotics and Automation 2018 (ICRA 2018

    COMPARISON OF ANN METHOD AND LOGISTIC REGRESSION METHOD ON SINGLE NUCLEOTIDE POLYMORPHISM GENETIC DATA

    Get PDF
    This study aims to determine the goodness of classification using the ANN method on Asthma genetic data in the R program package, namely SNPassoc. SNP genetic data was transformed using codominant genetic traits, namely for genetic data AA, AC, CC were given a score of 0, 0.5 and 1, respectively, while CC, CT and TT were scored 0, 0.5 and 1, respectively. The scoring is based on the smallest alphabetical order given a low score. The average accuracy, precision, recall and F1 score were determined using the neural network method if the genetic code was used with variations in the proportion of test data 10%, 20%, 30% and 40% and repeated B = 1000 times. The results obtained were compared with the logistic regression method. If 20% test data is used and the ANN method is used, the accuracy, precision, recall and F1 scores are 0.7756, 0.7844, 0.9844 and 0.8728, respectively. When all information from various countries is used in the Asthma genetic data, the logistic regression method gives higher average accuracy, precision and F1 scores than the ANN method, but the average recall is the opposite. When a separate analysis is performed for each country, the logistic regression method gives higher accuracy, precision, recall and F1 scores in the ANN method compared to the logistic regression method

    التنبؤ بمرض السكري وأنواعه باستخدام تنقيب البيانات

    Get PDF
    The research problem lies in predicting diabetes and using data mining to predict type 1 and type 2 diabetes. Data mining and analysis has become a widespread study in recent times and it can be applied to various fields where this method extracts unspecified data elements. The researcher is studying the possibility of using data mining to predict diabetes of the first and second types, and determining the appropriate method for predicting diabetes using the descriptive and analytical approach by mining the data. There are models used in the prediction process in general. We will choose from them the decision tree and the linear regression and make a comparison between them. In accuracy, precision, Recall and F measure using Rapid Miner. The researcher used the data (Pima Indians diabetics) that contain 769 records and 9 characteristics. When executing the linear regression algorithm inside the Rapidminer، we get a (accuracy = 76.09%)، (precision = 79.14%)، (Recall = 86.00%) and (F measure = 82.43%) and upon implementing the decision tree we got (accuracy = 70.87%)، (precision = 71.28%)، ( Recall = 92.67%) and (F measure = 80.58%). By comparing the results we obtained، we find that linear regression is better than the decision tree in predicting the type of diabetes. Keywords: data mining، rapidminer، decision tree، linear regressio

    Selective Regression Testing based on Big Data: Comparing Feature Extraction Techniques

    Get PDF
    Regression testing is a necessary activity in continuous integration (CI) since it provides confidence that modified parts of the system are correct at each integration cycle. CI provides large volumes of data which can be used to support regression testing activities. By using machine learning, patterns about faulty changes in the modified program can be induced, allowing test orchestrators to make inferences about test cases that need to be executed at each CI cycle. However, one challenge in using learning models lies in finding a suitable way for characterizing source code changes and preserving important information. In this paper, we empirically evaluate the effect of three feature extraction algorithms on the performance of an existing ML-based selective regression testing technique. We designed and performed an experiment to empirically investigate the effect of Bag of Words (BoW), Word Embeddings (WE), and content-based feature extraction (CBF). We used stratified cross validation on the space of features generated by the three FE techniques and evaluated the performance of three machine learning models using the precision and recall metrics. The results from this experiment showed a significant difference between the models\u27 precision and recall scores, suggesting that the BoW-fed model outperforms the other two models with respect to precision, whereas a CBF-fed model outperforms the rest with respect to recall

    Student performance prediction based on data mining classification techniques

    Get PDF
    The process of predicting student performance has become a crucial factor in academic environment and plays significant role in producing quality graduates. Several statistical and machine learning algorithms have been proposed for analyzing, predicting and classifying student performance. However, these classification algorithms still posed issue in terms of the performance classification. This paper presents a method to predict student performance using Iterative dichotomiser 3 (ID3), C4.5 and Classification and Regression tree (CART). The experiment was performed on Waikato Environment for Knowledge Analysis (Weka). The experimental results showed that an ID3 accuracy of 95.9% , specificity of 95.9%, precision of 95.9%, recall of 95.9%, f-measure of 95.9% and incorrectly classified instance of 3.83. The C4.5 gave an accuracy of 98.3%, specificity of 98.3%, precision of 98.4%, recall of 98.3%, f-measure of 98.3% and incorrectly classified instance of 1.70. The CART results showed an accuracy of 98.3%, specificity of 98.3%, precision of 98.4%, recall of 98.3%, f-measure of 98.3% and incorrectly classified instance of 1.70. The time taken to build the model of ID3 is 0.05 seconds, C4.5 is 0.03 seconds and CART of 0.58 seconds. Experimental results revealed that C4.5 outperforms other classifiers and requires reasonable amount of time to build the model.Keywords: Student performance, ID3, C4.5, CART, classification, Education data minin

    Assessing the predictive ability of the Suicide Crisis Inventory for near-term suicidal behavior using machine learning approaches

    Get PDF
    OBJECTIVE: This study explores the prediction of near-term suicidal behavior using machine learning (ML) analyses of the Suicide Crisis Inventory (SCI), which measures the Suicide Crisis Syndrome, a presuicidal mental state. METHODS: SCI data were collected from high-risk psychiatric inpatients (N = 591) grouped based on their short-term suicidal behavior, that is, those who attempted suicide between intake and 1-month follow-up dates (N = 20) and those who did not (N = 571). Data were analyzed using three predictive algorithms (logistic regression, random forest, and gradient boosting) and three sampling approaches (split sample, Synthetic minority oversampling technique, and enhanced bootstrap). RESULTS: The enhanced bootstrap approach considerably outperformed the other sampling approaches, with random forest (98.0% precision; 33.9% recall; 71.0% Area under the precision-recall curve [AUPRC]; and 87.8% Area under the receiver operating characteristic [AUROC]) and gradient boosting (94.0% precision; 48.9% recall; 70.5% AUPRC; and 89.4% AUROC) algorithms performing best in predicting positive cases of near-term suicidal behavior using this dataset. CONCLUSIONS: ML can be useful in analyzing data from psychometric scales, such as the SCI, and for predicting near-term suicidal behavior. However, in cases such as the current analysis where the data are highly imbalanced, the optimal method of measuring performance must be carefully considered and selected

    Exploring Public Sentiment: A Sentiment Analysis of GST Discourse on Twitter using Supervised Machine Learning Classifiers

    Get PDF
    A key economic move that resulted in heated disputes was India's introduction of the Goods and Services Tax (GST). Social media channels offered a widely used forum for the people to express their views on the GST, providing insightful data for gauging mood and guiding next revisions. The emotion of 5629 GST-related tweets was assessed using the VADER lexicon after being obtained using the Twitter Developer API. The tf-idf feature was used for text vectorization, with 80% of the data going toward training and the remaining 20% going toward testing. In this study, six well-known classifiers—the Ridge Classifier, Logistic Regression, Linear SVC, Perceptron, Decision Tree, and K-Nearest Neighbor—were thoroughly compared to evaluate their performance in a range of circumstances. Accuracy, precision, recall, f-score, training, and testing times were all included in the performance measurements. The study presented novel pre-processing methods and examined the training/testing times before coming to the conclusion that the Ridge Classifier performed better than the others in terms of accuracy, precision, and efficiency. In this study, six well-known classifiers—the Ridge Classifier, Logistic Regression, Linear SVC, Perceptron, Decision Tree, and K-Nearest Neighbor—were thoroughly compared to evaluate their performance in a range of circumstances. Accuracy, precision, recall, f-score, training, and testing times were all included in the performance measurements. The study presented novel pre-processing methods and examined the training/testing times before coming to the conclusion that the Ridge Classifier performed better than the others in terms of accuracy, precision, and efficiency

    Application of Machine Learning in Cancer Research

    Full text link
    This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression. Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class. In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor (GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are: · Detailed process for data cleaning and binary classification of 338,596 breast cancer patients. · Computational approach for omitting predictors and categorical predictors based on VIF and GVIF. · Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall. · An application of Edited Nearest Neighbor to obtain the highest F1-measure. In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers

    Applying CHAID for logistic regression diagnostics and classification accuracy improvement

    Get PDF
    In this study a CHAID-based approach to detecting classification accuracy heterogeneity across segments of observations is proposed. This helps to solve some important problems, facing a model-builder: 1. How to automatically detect segments in which the model significantly underperforms? 2. How to incorporate the knowledge about classification accuracy heterogeneity across segments to partition observations in order to achieve better predictive accuracy? The approach was applied to churn data from the UCI Repository of Machine Learning Databases. By splitting the dataset into 4 parts, which are based on the decision tree, and building a separate logistic regression scoring model for each segment we increased the accuracy by more than 7 percentage points on the test sample. Significant increase in recall and precision was also observed. It was shown that different segments may have absolutely different churn predictors. Therefore such a partitioning gives a better insight into factors influencing customer behavior.CHAID; logistic regression; churn prediction; performance improvement; segmentwise prediction; decision tree; classification tree
    corecore