199,148 research outputs found

    KLASIFIKASI MALWARE ANDROID DENGAN MENGGUNAKAN METODE CATBOOST ALGORITMA

    Get PDF
    In 2008, Android was introduced as a popular open source project due to its customizability and low hardware requirements. Mid-2021 statistics from GlobalStat Counter shows that Android dominates the mobile operating system market with 72.74%. Despite its popularity, Android is becoming a target for malware attacks in the context of cyber crime. This problem prompted this research to be carried out with the aim of identifying and classifying Android malware which is continuously developing by applying machine learning logic, especially using the methodCatBoost. This method was chosen based on its effectiveness in previous research which has been proven to provide high accuracy. Performance evaluation involves comparisons betweenCatBoost and several previous researchers' methods, inclKNN (K-Nearest Neighbors), SVM (Support Vector Machine), LR (Logistic Regression), RF (Random Forest), ET (Extra Trees), XG (XGBoost), AB (Adaboost), and BG (Bagging), using common metrics such asValidation Accuracy, Detection Accuracy, and F1-Score. The research results show thatCatBoost managed to achieveValidation Accuracy amounting to 96.66%,Detection Accuracy 96,87%, andF1-Score of 96.81% puts it in a competitive position with most other methods, exceptRF (Random Forest). CatBoost's consistent superiority in this comparison shows its potential as an effective and consistent solution in Android malware detection and classification

    An analysis of feature relevance in the classification of astronomical transients with machine learning methods

    Get PDF
    The exploitation of present and future synoptic (multi-band and multi-epoch) surveys requires an extensive use of automatic methods for data processing and data interpretation. In this work, using data extracted from the Catalina Real Time Transient Survey (CRTS), we investigate the classification performance of some well tested methods: Random Forest, MLPQNA (Multi Layer Perceptron with Quasi Newton Algorithm) and K-Nearest Neighbors, paying special attention to the feature selection phase. In order to do so, several classification experiments were performed. Namely: identification of cataclysmic variables, separation between galactic and extra-galactic objects and identification of supernovae.Comment: Accepted by MNRAS, 11 figures, 18 page

    MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification

    Full text link
    Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. The python version of the code is available here: https://github.com/farshidrayhanuiu/Comment: SKIMA-201

    On PAC-Bayesian Bounds for Random Forests

    Full text link
    Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments
    corecore