184 research outputs found
Feature Selection to Enhance Phishing Website Detection Based On URL Using Machine Learning Techniques
The detection of phishing websites based on machine learning has gained much attention due to its ability to detect newly generated phishing URLs. To detect phishing websites, most techniques combine URLs, web page content, and external features. However, the content of the web page and external features are time-consuming, require large computing power, and are not suitable for resource-constrained devices. To overcome this problem, this study applies feature selection techniques based on the URL to improve the detection process. The methodology for this study consists of seven stages, including data preparation, preprocessing, splitting the dataset into training and validation, feature selection, 10-fold cross-validation, validating the model, and finally performance evaluation. Two public datasets were used to validate the method. TreeSHAP and Information Gain were used to rank features and select the top 10, 15, and 20. These features are fed into three machine learning classifiers which are Naïve Bayes, Random Forest, and XGBoost. Their performance is evaluated based on accuracy, precision, and recall. As a result, the features ranked by TreeSHAP contributed most to improving detection accuracy. The highest accuracy of 98.59 percent was achieved by XGBoost for the first dataset with 15 features. For the second dataset, the highest accuracy is 90.21 percent using 20 features and Random Forest. As for Naïve Bayes, the highest accuracy recorded is 98.49 percent using the first dataset
How reliable are SHAP values when trying to explain machine learning models
With the rapid growth and application of machine learning and artificial intelligence models that can not be understood by humans, there is a growing movement calling for an increase in interpretability. There are numerous methods that attempt to explain these models that vary drastically in the process of evaluating models. In this paper, we investigate a local post-hoc method called SHAP. SHAP utilizes Shapley values from game theory to attribute an importance value to each input in a model at each datapoint. Shapley values can require significant computation time, especially as the number of inputs increases. In order to shorten the computation time, samples are used as the background datasets. In this paper, we investigate the variation in the Shapley values calculated for numerous background samples. We test multiple different SHAP explainers, or calculation methods, for tree and logistic models. In most of our datasets, the explainers that were based on the same model tend to return very similar results. We find that KernelSHAP for the logistic model tends to perform the best, leading to the smallest amount of variance between the background datasets of all the explainers for both models
On marginal feature attributions of tree-based models
Due to their power and ease of use, tree-based machine learning models have
become very popular. To interpret these models, local feature attributions
based on marginal expectations e.g. marginal (interventional) Shapley, Owen or
Banzhaf values may be employed. Such feature attribution methods are true to
the model and implementation invariant, i.e. dependent only on the input-output
function of the model. By taking advantage of the internal structure of
tree-based models, we prove that their marginal Shapley values, or more
generally marginal feature attributions obtained from a linear game value, are
simple (piecewise-constant) functions with respect to a certain finite
partition of the input space determined by the trained model. The same is true
for feature attributions obtained from the famous TreeSHAP algorithm.
Nevertheless, we show that the "path-dependent" TreeSHAP is not implementation
invariant by presenting two (statistically similar) decision trees computing
the exact same function for which the algorithm yields different rankings of
features, whereas the marginal Shapley values coincide. Furthermore, we discuss
how the fact that marginal feature attributions are simple functions can
potentially be utilized to compute them. An important observation, showcased by
experiments with XGBoost, LightGBM and CatBoost libraries, is that only a
portion of all features appears in a tree from the ensemble; thus the
complexity of computing marginal Shapley (or Owen or Banzhaf) feature
attributions may be reduced. In particular, in the case of CatBoost models, the
trees are oblivious (symmetric) and the number of features in each of them is
no larger than the depth. We exploit the symmetry to derive an explicit formula
with improved complexity for marginal Shapley (and Banzhaf and Owen) values
which is only in terms of the internal parameters of the CatBoost model.Comment: 48 pages, 7 figure
A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies
When conducting user studies to ascertain the usefulness of model
explanations in aiding human decision-making, it is important to use real-world
use cases, data, and users. However, this process can be resource-intensive,
allowing only a limited number of explanation methods to be evaluated.
Simulated user evaluations (SimEvals), which use machine learning models as a
proxy for human users, have been proposed as an intermediate step to select
promising explanation methods. In this work, we conduct the first SimEvals on a
real-world use case to evaluate whether explanations can better support
ML-assisted decision-making in e-commerce fraud detection. We study whether
SimEvals can corroborate findings from a user study conducted in this fraud
detection context. In particular, we find that SimEvals suggest that all
considered explainers are equally performant, and none beat a baseline without
explanations -- this matches the conclusions of the original user study. Such
correspondences between our results and the original user study provide initial
evidence in favor of using SimEvals before running user studies. We also
explore the use of SimEvals as a cheap proxy to explore an alternative user
study set-up. We hope that this work motivates further study of when and how
SimEvals should be used to aid in the design of real-world evaluations.Comment: 9 pages, 2 figure
mSHAP: SHAP Values for Two-Part Models
Two-part models are important to and used throughout insurance and actuarial
science. Since insurance is required for registering a car, obtaining a
mortgage, and participating in certain businesses, it is especially important
that the models which price insurance policies are fair and non-discriminatory.
Black box models can make it very difficult to know which covariates are
influencing the results. SHAP values enable interpretation of various black box
models, but little progress has been made in two-part models. In this paper, we
propose mSHAP (or multiplicative SHAP), a method for computing SHAP values of
two-part models using the SHAP values of the individual models. This method
will allow for the predictions of two-part models to be explained at an
individual observation level. After developing mSHAP, we perform an in-depth
simulation study. Although the kernelSHAP algorithm is also capable of
computing approximate SHAP values for a two-part model, a comparison with our
method demonstrates that mSHAP is exponentially faster. Ultimately, we apply
mSHAP to a two-part ratemaking model for personal auto property damage
insurance coverage. Additionally, an R package (mshap) is available to easily
implement the method in a wide variety of applications
Explainability and Fairness in Machine Learning
Over the years, when speaking about prediction models, the focus has been set on
improving their accuracy, at the cost of loosing any comprehension of how the model
predicts. Consequently, it has also been lost the ability of knowing if the behavior of
the model is correct. Moreover, due to the fact that the addresses of the predictions
do not have information about how ethic or fair the model is when predicting, persons
become reticent to use such type of models. Therefore, in the last years there have
been developed investigations aiming to explain such predictions in order to make
them intelligible for humans, using techniques like LIME or SHAP, responsible for
explaining in an interpretable way what happens behind the prediction. This work
addresses this issue and reviews recent literature on the topic.A lo largo de los años, en el ámbito de los modelos de predicción, el foco se ha
centrado en mejorar las predicciones realizadas por los modelos, perdiendo a cambio
toda comprensión a cerca de cómo el modelo realiza la predicción. La pérdida de comprensión conlleva además el desconocimiento del correcto funcionamiento del modelo,
así como reticencias a usar dicho modelo de las personas destinatarias de las predicciones al no poseer información acerca de aspectos éticos y justos a la hora de realizar
las predicciones. Es por ello que en los últimos años se ha investigado cómo explicar
éstas para así hacerlas de nuevo intelegibles para el ser humano, desarrollando técnicas como LIME y SHAP, encargadas de exponer en una forma interpretable por el ser
humano lo que sucede detrás de la predicción. En este trabajo abordamos este tema, y
revisamos la literatura existente sobre el mismo.Universidad de Sevilla. Grado en Física y Matématica
Interpretable Time Series Clustering Using Local Explanations
This study focuses on exploring the use of local interpretability methods for
explaining time series clustering models. Many of the state-of-the-art
clustering models are not directly explainable. To provide explanations for
these clustering algorithms, we train classification models to estimate the
cluster labels. Then, we use interpretability methods to explain the decisions
of the classification models. The explanations are used to obtain insights into
the clustering models. We perform a detailed numerical study to test the
proposed approach on multiple datasets, clustering models, and classification
models. The analysis of the results shows that the proposed approach can be
used to explain time series clustering models, specifically when the underlying
classification model is accurate. Lastly, we provide a detailed analysis of the
results, discussing how our approach can be used in a real-life scenario
Confident Feature Ranking
Interpretation of feature importance values often relies on the relative
order of the features rather than on the value itself, referred to as ranking.
However, the order may be unstable due to the small sample sizes used in
calculating the importance values. We propose that post-hoc importance methods
produce a ranking and simultaneous confident intervals for the rankings. Based
on pairwise comparisons of the feature importance values, our method is
guaranteed to include the ``true'' (infinite sample) ranking with high
probability and allows for selecting top-k sets
- …