199,148 research outputs found
KLASIFIKASI MALWARE ANDROID DENGAN MENGGUNAKAN METODE CATBOOST ALGORITMA
In 2008, Android was introduced as a popular open source project due to its customizability and low hardware requirements. Mid-2021 statistics from GlobalStat Counter shows that Android dominates the mobile operating system market with 72.74%. Despite its popularity, Android is becoming a target for malware attacks in the context of cyber crime. This problem prompted this research to be carried out with the aim of identifying and classifying Android malware which is continuously developing by applying machine learning logic, especially using the methodCatBoost. This method was chosen based on its effectiveness in previous research which has been proven to provide high accuracy. Performance evaluation involves comparisons betweenCatBoost and several previous researchers' methods, inclKNN (K-Nearest Neighbors), SVM (Support Vector Machine), LR (Logistic Regression), RF (Random Forest), ET (Extra Trees), XG (XGBoost), AB (Adaboost), and BG (Bagging), using common metrics such asValidation Accuracy, Detection Accuracy, and F1-Score. The research results show thatCatBoost managed to achieveValidation Accuracy amounting to 96.66%,Detection Accuracy 96,87%, andF1-Score of 96.81% puts it in a competitive position with most other methods, exceptRF (Random Forest). CatBoost's consistent superiority in this comparison shows its potential as an effective and consistent solution in Android malware detection and classification
An analysis of feature relevance in the classification of astronomical transients with machine learning methods
The exploitation of present and future synoptic (multi-band and multi-epoch)
surveys requires an extensive use of automatic methods for data processing and
data interpretation. In this work, using data extracted from the Catalina Real
Time Transient Survey (CRTS), we investigate the classification performance of
some well tested methods: Random Forest, MLPQNA (Multi Layer Perceptron with
Quasi Newton Algorithm) and K-Nearest Neighbors, paying special attention to
the feature selection phase. In order to do so, several classification
experiments were performed. Namely: identification of cataclysmic variables,
separation between galactic and extra-galactic objects and identification of
supernovae.Comment: Accepted by MNRAS, 11 figures, 18 page
MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification
Class imbalance problem has been a challenging research problem in the fields
of machine learning and data mining as most real life datasets are imbalanced.
Several existing machine learning algorithms try to maximize the accuracy
classification by correctly identifying majority class samples while ignoring
the minority class. However, the concept of the minority class instances
usually represents a higher interest than the majority class. Recently, several
cost sensitive methods, ensemble models and sampling techniques have been used
in literature in order to classify imbalance datasets. In this paper, we
propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost
mixes two different weak learners with boosting to improve the performance on
imbalanced datasets. MEBoost is an alternative to the existing techniques such
as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been
evaluated on 12 benchmark imbalanced datasets with state of the art ensemble
methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost.
Experimental results show significant improvement over the other methods and it
can be concluded that MEBoost is an effective and promising algorithm to deal
with imbalance datasets. The python version of the code is available here:
https://github.com/farshidrayhanuiu/Comment: SKIMA-201
On PAC-Bayesian Bounds for Random Forests
Existing guarantees in terms of rigorous upper bounds on the generalization
error for the original random forest algorithm, one of the most frequently used
machine learning methods, are unsatisfying. We discuss and evaluate various
PAC-Bayesian approaches to derive such bounds. The bounds do not require
additional hold-out data, because the out-of-bag samples from the bagging in
the training process can be exploited. A random forest predicts by taking a
majority vote of an ensemble of decision trees. The first approach is to bound
the error of the vote by twice the error of the corresponding Gibbs classifier
(classifying with a single member of the ensemble selected at random). However,
this approach does not take into account the effect of averaging out of errors
of individual classifiers when taking the majority vote. This effect provides a
significant boost in performance when the errors are independent or negatively
correlated, but when the correlations are strong the advantage from taking the
majority vote is small. The second approach based on PAC-Bayesian C-bounds
takes dependencies between ensemble members into account, but it requires
estimating correlations between the errors of the individual classifiers. When
the correlations are high or the estimation is poor, the bounds degrade. In our
experiments, we compute generalization bounds for random forests on various
benchmark data sets. Because the individual decision trees already perform
well, their predictions are highly correlated and the C-bounds do not lead to
satisfactory results. For the same reason, the bounds based on the analysis of
Gibbs classifiers are typically superior and often reasonably tight. Bounds
based on a validation set coming at the cost of a smaller training set gave
better performance guarantees, but worse performance in most experiments
- …
