Search CORE

199,148 research outputs found

KLASIFIKASI MALWARE ANDROID DENGAN MENGGUNAKAN METODE CATBOOST ALGORITMA

Author: IRSYADUDDIN YUSUF
Publication venue
Publication date: 30/12/2023
Field of study

In 2008, Android was introduced as a popular open source project due to its customizability and low hardware requirements. Mid-2021 statistics from GlobalStat Counter shows that Android dominates the mobile operating system market with 72.74%. Despite its popularity, Android is becoming a target for malware attacks in the context of cyber crime. This problem prompted this research to be carried out with the aim of identifying and classifying Android malware which is continuously developing by applying machine learning logic, especially using the methodCatBoost. This method was chosen based on its effectiveness in previous research which has been proven to provide high accuracy. Performance evaluation involves comparisons betweenCatBoost and several previous researchers' methods, inclKNN (K-Nearest Neighbors), SVM (Support Vector Machine), LR (Logistic Regression), RF (Random Forest), ET (Extra Trees), XG (XGBoost), AB (Adaboost), and BG (Bagging), using common metrics such asValidation Accuracy, Detection Accuracy, and F1-Score. The research results show thatCatBoost managed to achieveValidation Accuracy amounting to 96.66%,Detection Accuracy 96,87%, andF1-Score of 96.81% puts it in a competitive position with most other methods, exceptRF (Random Forest). CatBoost's consistent superiority in this comparison shows its potential as an effective and consistent solution in Android malware detection and classification

UMM Institutional Repository

An analysis of feature relevance in the classification of astronomical transients with machine learning methods

Author: Brescia Massimo
Cavuoti Stefano
D'Isanto Antonio
Djorgovski Stanislav G.
Donalek Ciro
Longo Giuseppe
Riccio Giuseppe
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

The exploitation of present and future synoptic (multi-band and multi-epoch) surveys requires an extensive use of automatic methods for data processing and data interpretation. In this work, using data extracted from the Catalina Real Time Transient Survey (CRTS), we investigate the classification performance of some well tested methods: Random Forest, MLPQNA (Multi Layer Perceptron with Quasi Newton Algorithm) and K-Nearest Neighbors, paying special attention to the feature selection phase. In order to do so, several classification experiments were performed. Namely: identification of cataclysmic variables, separation between galactic and extra-galactic objects and identification of supernovae.Comment: Accepted by MNRAS, 11 figures, 18 page

arXiv.org e-Print Archive

Archivio della ricerca - Università degli studi di Napoli Federico II

Crossref

OA@INAF - Istituto Nazionale di Astrofisica

Caltech Authors

MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification

Author: Ahmed Sajid
Farid Dewan Md.
Jani Md. Rafsan
Mahbub Asif
Rahman Chowdhury Mofizur
Rayhan Farshid
Shatabda Swakkhar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2017
Field of study

Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. The python version of the code is available here: https://github.com/farshidrayhanuiu/Comment: SKIMA-201

arXiv.org e-Print Archive

Crossref

On PAC-Bayesian Bounds for Random Forests

Author: Igel Christian
Lorenzen Stephan Sloth
Seldin Yevgeny
Publication venue
Publication date: 01/01/2019
Field of study

Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments

arXiv.org e-Print Archive

Copenhagen University Research Information System