184 research outputs found

    Feature Selection to Enhance Phishing Website Detection Based On URL Using Machine Learning Techniques

    Get PDF
    The detection of phishing websites based on machine learning has gained much attention due to its ability to detect newly generated phishing URLs. To detect phishing websites, most techniques combine URLs, web page content, and external features. However, the content of the web page and external features are time-consuming, require large computing power, and are not suitable for resource-constrained devices. To overcome this problem, this study applies feature selection techniques based on the URL to improve the detection process. The methodology for this study consists of seven stages, including data preparation, preprocessing, splitting the dataset into training and validation, feature selection, 10-fold cross-validation, validating the model, and finally performance evaluation. Two public datasets were used to validate the method. TreeSHAP and Information Gain were used to rank features and select the top 10, 15, and 20. These features are fed into three machine learning classifiers which are Naïve Bayes, Random Forest, and XGBoost. Their performance is evaluated based on accuracy, precision, and recall. As a result, the features ranked by TreeSHAP contributed most to improving detection accuracy. The highest accuracy of 98.59 percent was achieved by XGBoost for the first dataset with 15 features. For the second dataset, the highest accuracy is 90.21 percent using 20 features and Random Forest. As for Naïve Bayes, the highest accuracy recorded is 98.49 percent using the first dataset

    How reliable are SHAP values when trying to explain machine learning models

    Get PDF
    With the rapid growth and application of machine learning and artificial intelligence models that can not be understood by humans, there is a growing movement calling for an increase in interpretability. There are numerous methods that attempt to explain these models that vary drastically in the process of evaluating models. In this paper, we investigate a local post-hoc method called SHAP. SHAP utilizes Shapley values from game theory to attribute an importance value to each input in a model at each datapoint. Shapley values can require significant computation time, especially as the number of inputs increases. In order to shorten the computation time, samples are used as the background datasets. In this paper, we investigate the variation in the Shapley values calculated for numerous background samples. We test multiple different SHAP explainers, or calculation methods, for tree and logistic models. In most of our datasets, the explainers that were based on the same model tend to return very similar results. We find that KernelSHAP for the logistic model tends to perform the best, leading to the smallest amount of variance between the background datasets of all the explainers for both models

    On marginal feature attributions of tree-based models

    Full text link
    Due to their power and ease of use, tree-based machine learning models have become very popular. To interpret these models, local feature attributions based on marginal expectations e.g. marginal (interventional) Shapley, Owen or Banzhaf values may be employed. Such feature attribution methods are true to the model and implementation invariant, i.e. dependent only on the input-output function of the model. By taking advantage of the internal structure of tree-based models, we prove that their marginal Shapley values, or more generally marginal feature attributions obtained from a linear game value, are simple (piecewise-constant) functions with respect to a certain finite partition of the input space determined by the trained model. The same is true for feature attributions obtained from the famous TreeSHAP algorithm. Nevertheless, we show that the "path-dependent" TreeSHAP is not implementation invariant by presenting two (statistically similar) decision trees computing the exact same function for which the algorithm yields different rankings of features, whereas the marginal Shapley values coincide. Furthermore, we discuss how the fact that marginal feature attributions are simple functions can potentially be utilized to compute them. An important observation, showcased by experiments with XGBoost, LightGBM and CatBoost libraries, is that only a portion of all features appears in a tree from the ensemble; thus the complexity of computing marginal Shapley (or Owen or Banzhaf) feature attributions may be reduced. In particular, in the case of CatBoost models, the trees are oblivious (symmetric) and the number of features in each of them is no larger than the depth. We exploit the symmetry to derive an explicit formula with improved complexity for marginal Shapley (and Banzhaf and Owen) values which is only in terms of the internal parameters of the CatBoost model.Comment: 48 pages, 7 figure

    A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies

    Full text link
    When conducting user studies to ascertain the usefulness of model explanations in aiding human decision-making, it is important to use real-world use cases, data, and users. However, this process can be resource-intensive, allowing only a limited number of explanation methods to be evaluated. Simulated user evaluations (SimEvals), which use machine learning models as a proxy for human users, have been proposed as an intermediate step to select promising explanation methods. In this work, we conduct the first SimEvals on a real-world use case to evaluate whether explanations can better support ML-assisted decision-making in e-commerce fraud detection. We study whether SimEvals can corroborate findings from a user study conducted in this fraud detection context. In particular, we find that SimEvals suggest that all considered explainers are equally performant, and none beat a baseline without explanations -- this matches the conclusions of the original user study. Such correspondences between our results and the original user study provide initial evidence in favor of using SimEvals before running user studies. We also explore the use of SimEvals as a cheap proxy to explore an alternative user study set-up. We hope that this work motivates further study of when and how SimEvals should be used to aid in the design of real-world evaluations.Comment: 9 pages, 2 figure

    mSHAP: SHAP Values for Two-Part Models

    Full text link
    Two-part models are important to and used throughout insurance and actuarial science. Since insurance is required for registering a car, obtaining a mortgage, and participating in certain businesses, it is especially important that the models which price insurance policies are fair and non-discriminatory. Black box models can make it very difficult to know which covariates are influencing the results. SHAP values enable interpretation of various black box models, but little progress has been made in two-part models. In this paper, we propose mSHAP (or multiplicative SHAP), a method for computing SHAP values of two-part models using the SHAP values of the individual models. This method will allow for the predictions of two-part models to be explained at an individual observation level. After developing mSHAP, we perform an in-depth simulation study. Although the kernelSHAP algorithm is also capable of computing approximate SHAP values for a two-part model, a comparison with our method demonstrates that mSHAP is exponentially faster. Ultimately, we apply mSHAP to a two-part ratemaking model for personal auto property damage insurance coverage. Additionally, an R package (mshap) is available to easily implement the method in a wide variety of applications

    Explainability and Fairness in Machine Learning

    Get PDF
    Over the years, when speaking about prediction models, the focus has been set on improving their accuracy, at the cost of loosing any comprehension of how the model predicts. Consequently, it has also been lost the ability of knowing if the behavior of the model is correct. Moreover, due to the fact that the addresses of the predictions do not have information about how ethic or fair the model is when predicting, persons become reticent to use such type of models. Therefore, in the last years there have been developed investigations aiming to explain such predictions in order to make them intelligible for humans, using techniques like LIME or SHAP, responsible for explaining in an interpretable way what happens behind the prediction. This work addresses this issue and reviews recent literature on the topic.A lo largo de los años, en el ámbito de los modelos de predicción, el foco se ha centrado en mejorar las predicciones realizadas por los modelos, perdiendo a cambio toda comprensión a cerca de cómo el modelo realiza la predicción. La pérdida de comprensión conlleva además el desconocimiento del correcto funcionamiento del modelo, así como reticencias a usar dicho modelo de las personas destinatarias de las predicciones al no poseer información acerca de aspectos éticos y justos a la hora de realizar las predicciones. Es por ello que en los últimos años se ha investigado cómo explicar éstas para así hacerlas de nuevo intelegibles para el ser humano, desarrollando técnicas como LIME y SHAP, encargadas de exponer en una forma interpretable por el ser humano lo que sucede detrás de la predicción. En este trabajo abordamos este tema, y revisamos la literatura existente sobre el mismo.Universidad de Sevilla. Grado en Física y Matématica

    Interpretable Time Series Clustering Using Local Explanations

    Full text link
    This study focuses on exploring the use of local interpretability methods for explaining time series clustering models. Many of the state-of-the-art clustering models are not directly explainable. To provide explanations for these clustering algorithms, we train classification models to estimate the cluster labels. Then, we use interpretability methods to explain the decisions of the classification models. The explanations are used to obtain insights into the clustering models. We perform a detailed numerical study to test the proposed approach on multiple datasets, clustering models, and classification models. The analysis of the results shows that the proposed approach can be used to explain time series clustering models, specifically when the underlying classification model is accurate. Lastly, we provide a detailed analysis of the results, discussing how our approach can be used in a real-life scenario

    Confident Feature Ranking

    Full text link
    Interpretation of feature importance values often relies on the relative order of the features rather than on the value itself, referred to as ranking. However, the order may be unstable due to the small sample sizes used in calculating the importance values. We propose that post-hoc importance methods produce a ranking and simultaneous confident intervals for the rankings. Based on pairwise comparisons of the feature importance values, our method is guaranteed to include the ``true'' (infinite sample) ranking with high probability and allows for selecting top-k sets
    corecore