7 research outputs found
Tuning Word2vec for Large Scale Recommendation Systems
Word2vec is a powerful machine learning tool that emerged from Natural
Lan-guage Processing (NLP) and is now applied in multiple domains, including
recom-mender systems, forecasting, and network analysis. As Word2vec is often
used offthe shelf, we address the question of whether the default
hyperparameters are suit-able for recommender systems. The answer is
emphatically no. In this paper, wefirst elucidate the importance of
hyperparameter optimization and show that un-constrained optimization yields an
average 221% improvement in hit rate over thedefault parameters. However,
unconstrained optimization leads to hyperparametersettings that are very
expensive and not feasible for large scale recommendationtasks. To this end, we
demonstrate 138% average improvement in hit rate with aruntime
budget-constrained hyperparameter optimization. Furthermore, to
makehyperparameter optimization applicable for large scale recommendation
problemswhere the target dataset is too large to search over, we investigate
generalizinghyperparameters settings from samples. We show that applying
constrained hy-perparameter optimization using only a 10% sample of the data
still yields a 91%average improvement in hit rate over the default parameters
when applied to thefull datasets. Finally, we apply hyperparameters learned
using our method of con-strained optimization on a sample to the Who To Follow
recommendation serviceat Twitter and are able to increase follow rates by 15%.Comment: 11 pages, 4 figures, Fourteenth ACM Conference on Recommender System
Modeling the Field Value Variations and Field Interactions Simultaneously for Fraud Detection
With the explosive growth of e-commerce, online transaction fraud has become
one of the biggest challenges for e-commerce platforms. The historical
behaviors of users provide rich information for digging into the users' fraud
risk. While considerable efforts have been made in this direction, a
long-standing challenge is how to effectively exploit internal user information
and provide explainable prediction results. In fact, the value variations of
same field from different events and the interactions of different fields
inside one event have proven to be strong indicators for fraudulent behaviors.
In this paper, we propose the Dual Importance-aware Factorization Machines
(DIFM), which exploits the internal field information among users' behavior
sequence from dual perspectives, i.e., field value variations and field
interactions simultaneously for fraud detection. The proposed model is deployed
in the risk management system of one of the world's largest e-commerce
platforms, which utilize it to provide real-time transaction fraud detection.
Experimental results on real industrial data from different regions in the
platform clearly demonstrate that our model achieves significant improvements
compared with various state-of-the-art baseline models. Moreover, the DIFM
could also give an insight into the explanation of the prediction results from
dual perspectives.Comment: 11 pages, 4 figure
Probabilistic XGBoost Threshold Classification with Autoencoder for Credit Card Fraud Detection
Due to the imbalanced data of outnumbered legitimate transactions than the fraudulent transaction, the detection of fraud is a challenging task to find an effective solution. In this study, autoencoder with probabilistic threshold shifting of XGBoost (AE-XGB) for credit card fraud detection is designed. Initially, AE-XGB employs autoencoder the prevalent dimensionality reduction technique to extract data features from latent space representation. Then the reconstructed lower dimensional features utilize eXtreame Gradient Boost (XGBoost), an ensemble boosting algorithm with probabilistic threshold to classify the data as fraudulent or legitimate. In addition to AE-XGB, other existing ensemble algorithms such as Adaptive Boosting (AdaBoost), Gradient Boosting Machine (GBM), Random Forest, Categorical Boosting (CatBoost), LightGBM and XGBoost are compared with optimal and default threshold. To validate the methodology, we used IEEE-CIS fraud detection dataset for our experiment. Class imbalance and high dimensionality characteristics of dataset reduce the performance of model hence the data is preprocessed and trained. To evaluate the performance of the model, evaluation indicators such as precision, recall, f1-score, g-mean and Mathews Correlation Coefficient (MCC) are accomplished. The findings revealed that the performance of the proposed AE-XGB model is effective in handling imbalanced data and able to detect fraudulent transactions with 90.4% of recall and 90.5% of f1-score from incoming new transactions
Handling class imbalance in online transaction fraud detection
With the rise of internet facilities, a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the bank physically for every transaction. However, the fraud cases have also increased causing the loss of money to the consumers. Hence, an effective fraud detection system is the need of the hour which can detect fraudulent transactions automatically in real-time. Generally, the genuine transactions are large in number than the fraudulent transactions which leads to the class imbalance problem. In this research work, an online transaction fraud detection system using deep learning has been proposed which can handle class imbalance problem by applying algorithm-level methods which modify the learning of the model to focus more on the minority class i.e., fraud transactions. A novel loss function named Weighted Hard- Reduced Focal Loss (WH-RFL) has been proposed which has achieved maximum fraud detection rate i.e., True Positive Rate (TPR) at the cost of misclassification of few genuine transactions as high TPR is preferred over a high True Negative Rate (TNR) in fraud detection system and same has been demonstrated using three publicly available imbalanced transactional datasets. Also, Thresholding has been applied to optimize the decision threshold using cross-validation to detect maximum number of frauds and it has been demonstrated by the experimental results that the selection of the right thresholding method with deep learning yields better results
A new feature engineering framework for financial cyber fraud detection using machine learning and deep learning
As online payment system advances, the total losses via online banking in the United Kingdom have increased because fraudulent techniques have also progressed and used advanced technology. Using traditional fraud detection models with only raw transaction data cannot cope with the emerging new and innovative scheme to deceive financial institutions. Many studies published by both academic and commercial organisations introduce new fraud detection models using various machine learning algorithms, however, financial fraud losses via the online banking have been still increasing. This thesis looks at the holistic views of feature engineering for classification and machine learning (ML) and deep learning (DL) algorithms for fraud detection to understand their capabilities and how to deal with input data in each algorithm. And then, proposes a new feature engineering framework that can produce the most effective features set for any ML and DL algorithms by taking both methods of feature engineering and features selection into a new framework. The framework consists of two main components: feature creation and feature selection. The purpose of feature creation component is to create many effective feature candidates by feature aggregation and transformation based on customer’s behaviour. The purpose of feature selection is to evaluate all features and to drop irrelevant features and very high correlated features from the dataset. In the experiment, I proved the effect of using a new feature engineering framework by using a real-life banking transactional data provided by a private European bank and evaluating performances of the built fraud detection models in an appropriate way. Machine Learning and Deep learning models perform at their best when the created features set by the new framework are applied with higher scores in all evaluation metrics compared to the scores of the models built with the original dataset
Analyzing Granger causality in climate data with time series classification methods
Attribution studies in climate science aim for scientifically ascertaining the influence of climatic variations on natural or anthropogenic factors. Many of those studies adopt the concept of Granger causality to infer statistical cause-effect relationships, while utilizing traditional autoregressive models. In this article, we investigate the potential of state-of-the-art time series classification techniques to enhance causal inference in climate science. We conduct a comparative experimental study of different types of algorithms on a large test suite that comprises a unique collection of datasets from the area of climate-vegetation dynamics. The results indicate that specialized time series classification methods are able to improve existing inference procedures. Substantial differences are observed among the methods that were tested