167 research outputs found

    A Data-Driven Approach to Predict the Success of Bank Telemarketing

    Get PDF
    We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent finan- cial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DT), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an eval- uation phase, using the most recent data (after July 2012) and a rolling windows scheme. The NN presented the best results (AUC=0.8 and ALIFT=0.7), allowing to reach 79% of the subscribers by selecting the half better classified clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction confirmed the obtained model as credible and valuable for telemarketing campaign managers

    Feature selection strategies for improving data-driven decision support in bank telemarketing

    Get PDF
    The usage of data mining techniques to unveil previously undiscovered knowledge has been applied in past years to a wide number of domains, including banking and marketing. Raw data is the basic ingredient for successfully detecting interesting patterns. A key aspect of raw data manipulation is feature engineering and it is related with the correct characterization or selection of relevant features (or variables) that conceal relations with the target goal. This study is particularly focused on feature engineering, aiming at the unfolding features that best characterize the problem of selling long-term bank deposits through telemarketing campaigns. For the experimental setup, a case-study from a Portuguese bank, ranging the 2008-2013 year period and encompassing the recent global financial crisis, was addressed. To assess the relevance of such problem, a novel literature analysis using text mining and the latent Dirichlet allocation algorithm was conducted, confirming the existence of a research gap for bank telemarketing. Starting from a dataset containing typical telemarketing contacts and client information, research followed three different and complementary strategies: first, by enriching the dataset with social and economic context features; then, by including customer lifetime value related features; finally, by applying a divide and conquer strategy for splitting the problem in smaller fractions, leading to optimized sub-problems. Each of the three approaches improved previous results in terms of model metrics related to prediction performance. The relevance of the proposed features was evaluated, confirming the obtained models as credible and valuable for telemarketing campaign managers.A utilização de técnicas de data mining para a descoberta de conhecimento tem sido aplicada nos últimos anos a uma grande variedade de domínios, incluindo banca e marketing. Os dados no seu estado primitivo constituem o ingrediente básico para a deteção de padrões de informação. Um aspeto chave da manipulação de dados em bruto consiste na "engenharia de atributos", que compreende uma correta definição e seleção de atributos relevantes (ou variáveis) que se relacionem com o alvo da descoberta de conhecimento. Este trabalho foca-se numa abordagem de "engenharia de atributos" para definir as variáveis que melhor caraterizam o problema de vender depósitos bancários a prazo através de campanhas de telemarketing. Sendo um estudo empírico, foi utilizado um caso de estudo de um banco português, abrangendo o período 2008-2013, que inclui os efeitos da crise financeira internacional. Para aferir da importância deste problema, foi realizada uma inovadora análise da literatura recorrendo a text mining e ao algoritmo latent Dirichlet allocation, confirmando a existência de uma lacuna nesta matéria. Utilizando como base um conjunto de dados de contactos de telemarketing e informação sobre os clientes, três estratégias diferentes e complementares foram propostas: primeiro, os dados foram enriquecidos com atributos socioeconómicos; posteriormente, foram adicionadas características associadas ao valor do cliente ao longo do seu tempo de vida; finalmente, o problema foi dividido em problemas mais específicos, permitindo abordagens otimizadas a cada subproblema. Cada abordagem melhorou as métricas associadas à capacidade preditiva do modelo. Adicionalmente, a relevância dos atributos foi avaliada, confirmando os modelos obtidos como credíveis e valiosos para gestores de campanhas de telemarketing

    Using customer lifetime value and neural networks to improve the prediction of bank deposit subscription in telemarketing campaigns

    Get PDF
    Customer lifetime value (LTV) enables using client characteristics, such as recency, frequency and monetary value, to describe the value of a client through time in terms of profitability. We present the concept of LTV applied to telemarketing for improving the return-on-investment, using a recent (from 2008 to 2013) and real case study of bank campaigns to sell long-term deposits. The goal was to benefit from past contacts history to extract additional knowledge. A total of twelve LTV input variables were tested, under a forward selection method and using a realistic rolling windows scheme, highlighting the validity of five new LTV features. The results achieved by our LTV data-driven approach using neural networks allowed an improvement up to 4 pp in the Lift cumulative curve for targeting the deposit subscribers when compared with a baseline model (with no history data). Explanatory knowledge was also extracted from the proposed model, revealing two highly relevant LTV features, the last result of the previous campaign to sell the same product and the frequency of past client successes. The obtained results are particularly valuable for contact center companies, which can improve predictive performance without even having to ask for more information to the companies they serve.info:eu-repo/semantics/acceptedVersio

    Mutual information and sensitivity analysis for feature selection in customer targeting: a comparative study

    Get PDF
    WOS:000454945400004Feature selection is a highly relevant task in any data-driven knowledge discovery project. The present research focuses on analysing the advantages and disadvantages of using mutual information (MI) and data-based sensitivity analysis (DSA) for feature selection in classification problems, by applying both to a bank telemarketing case. A logistic regression model is built on the tuned set of features identified by each of the two techniques as the most influencing set of features on the success of a telemarketing contact, in a total of 13 features for MI and 9 for DSA. The latter performs better for lower values of false positives while the former is slightly better for a higher false-positive ratio. Thus, MI becomes a better choice if the intention is reducing slightly the cost of contacts without risking losing a high number of successes. However, DSA achieved good prediction results with less features.info:eu-repo/semantics/acceptedVersio

    Application of SMOTE to Handle Imbalance Class in Deposit Classification Using the Extreme Gradient Boosting Algorithm

    Get PDF
    Deposits became one of the main products and funding sources for banks and increasing deposit marketing is very important. However, telemarketing as a form of deposit marketing is less effective and efficient as it requires calling every customer for deposit offers. Therefore, the identification of potential deposit customers was necessary so that telemarketing became more effective and efficient by targeting the right customers, thus improving bank marketing performance with the ultimate goal of increasing sources of funding for banks. To identify customers, data mining is used with the UCI Bank Marketing Dataset from a Portuguese banking institution. This dataset consists of 45,211 records with 17 attributes. The classification algorithm used is Extreme Gradient Boosting (XGBoost) which is suitable for large data. The data used has a high-class imbalance, with "yes" and "no" percentages of 11.7% and 88.3%, respectively. Therefore, the proposed solution in the research, which focused on addressing the Imbalance Class in the Bank marketing dataset, was to use Synthetic Minority Over-sampling (SMOTE) and the XGBoost method. The result of the XGBoost study was an accuracy of 0.91016, precision of 0.79476, recall of 0.72928, F1-Score of 0.56198, ROC Area of 0.93831, and AUCPR of 0.63886. After SMOTE was applied, the accuracy was 0.91072, the precision was 0.78883, the recall was 0.75588, F1-Score was 0.59153, ROC Area was 0.93723, and AUCPR was 0.63733. The results showed that XGBoost and SMOTE could outperform other algorithms such as K-Nearest Neighbor, Random Forest, Logistic Regression, Artificial Neural Network, Naïve Bayes, and Support Vector Machine in terms of accuracy. This study contributes to the development of effective machine learning models that can be used as a support system for information technology experts in the finance and banking industries to identify potential customers interested in subscribing to deposits and increasing bank funding sources

    Identifying Long-Term Deposit Customers : A Machine Learning Approach

    Get PDF
    Majority of the revenue from the banking sector is usually generated from long term deposits by customers. It is for banks to understand customer characteristics to increase product sales. To aid this, marketing strategies are employed to target potential customers and let them interact with the banks directly, generating a large amount of data on customer characteristics and demographics. In recent years, it has been discovered that various data analysis, feature selection and machine learning techniques can be employed to analyze customer characteristics as well as variables that can impact customer decision significantly. These methods can be used to identify consumers in different categories to predict whether a customer would subscribe to a long-term deposit, allowing the marketing strategy to be more successful. In this study, we have taken a R programming approach to analyze financial transaction data to gain insight into how business processes can be improved using data mining techniques to find interesting trends and make more data-driven decisions. We have used statistical analysis like Exploratory Data Analysis (EDA), Principal Component Analysis (PCA), Factor Analysis and Correlations in the given data set. Besides, the study's goal is to use at least three typical classification algorithms among Logistic Regression, Random Forest, Support Vector Machine and K-nearest neighbors, and then make predictive models around customers signing up for long term deposits. Where we have gotten best accuracy from Logistic Regression which is 90.64 % as well the sensitivity is 99.05 %. Results were analyzed using the accuracy, sensitivity, and specificity score of these algorithms.acceptedVersionPeer reviewe

    An Enhanced Spectral Clustering Algorithm with S-Distance

    Get PDF
    This work is partially supported by the project "Prediction of diseases through computer assisted diagnosis system using images captured by minimally-invasive and non-invasive modalities", Computer Science and Engineering, PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur India (under ID: SPARCMHRD-231). This work is also partially supported by the project "Smart Solutions in Ubiquitous Computing Environments", Grant Agency of Excellence, University of Hradec Kralove, Faculty of Informatics and Management, Czech Republic (under ID: UHK-FIM-GE-2204/2021); project at Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876 and the Fundamental Research Grant Scheme (FRGS) Vot5F073 supported by the Ministry of Education Malaysia for the completion of the research.Calculating and monitoring customer churn metrics is important for companies to retain customers and earn more profit in business. In this study, a churn prediction framework is developed by modified spectral clustering (SC). However, the similarity measure plays an imperative role in clustering for predicting churn with better accuracy by analyzing industrial data. The linear Euclidean distance in the traditional SC is replaced by the non-linear S-distance (Sd). The Sd is deduced from the concept of S-divergence (SD). Several characteristics of Sd are discussed in this work. Assays are conducted to endorse the proposed clustering algorithm on four synthetics, eight UCI, two industrial databases and one telecommunications database related to customer churn. Three existing clustering algorithms-k-means, density-based spatial clustering of applications with noise and conventional SC-are also implemented on the above-mentioned 15 databases. The empirical outcomes show that the proposed clustering algorithm beats three existing clustering algorithms in terms of its Jaccard index, f-score, recall, precision and accuracy. Finally, we also test the significance of the clustering results by the Wilcoxon's signed-rank test, Wilcoxon's rank-sum test, and sign tests. The relative study shows that the outcomes of the proposed algorithm are interesting, especially in the case of clusters of arbitrary shape.project "Prediction of diseases through computer assisted diagnosis system using images captured by minimally-invasive and non-invasive modalities", Computer Science and Engineering, PDPM Indian Institute of Information Technology, Design and Manufacturing SPARCMHRD-231project "Smart Solutions in Ubiquitous Computing Environments", Grant Agency of Excellence, University of Hradec Kralove, Faculty of Informatics and Management, Czech Republic UHK-FIM-GE-2204/2021Universiti Teknologi Malaysia (UTM) 20H04Malaysia Research University Network (MRUN) 4L876Fundamental Research Grant Scheme (FRGS) by the Ministry of Education Malaysia 5F07

    A Study of Feature Selection Methods for Classification

    Full text link
    [ES] En este proyecto se implementarán varias técnicas de selección de características que se aplican para mejorar el rendimiento de la clasificación automática. Se analizarán varios problemas de clasificación de un repositorio de bases de datos disponible públicamente correspondiente al análisis de datos biofísicos. Esos problemas podrían incluir los siguientes temas médicos: arritmia; cáncer de mama; insuficiencia cardiaca; y virus de la hepatitis C. Los datos consisten en características extraídas de señales electrocardiográficas (ECG), señales electroencefalográficas (EEG), imágenes médicas, anamnesis, etc. En algunos casos, podría ser necesario un paso de preprocesamiento para tratar la normalización, la eliminación de artefactos y los datos faltantes. El objetivo de los métodos de selección de características es obtener una lista ordenada del conjunto completo de características de acuerdo con algunos criterios definidos. A partir de esta clasificación de características se puede obtener un conjunto más pequeño de características que permite mejorar el rendimiento de la clasificación y evitar un posible sobreajuste del modelo de clasificación entrenado. En este proyecto, se realiza una comparación de los métodos SF, incluidos los métodos de selección de características secuenciales (SFS) y basados ​​en relieve. Como clasificadores, consideraremos el análisis discriminante lineal (LDA), el análisis discriminante cuadrático (QDA) y la máquina de vectores de soporte (SVM), entre otros. La calidad de los resultados de la clasificación se evaluará utilizando diferentes rangos de la lista de clasificación de características y varios índices como precisión, precisión equilibrada, matriz de confusión, ¿ Además, se estimará el costo computacional de los diferentes casos de clasificación.[EN] This project will implement several techniques of feature selection applied to improve automatic classification performance. Several classification problems from a publicly available database repository corresponding to biophysical data analysis will be analyzed. Those problems could include the following medical subjects: arrhythmia; breast cancer; heart failure; and hepatitis C virus. The data consist of features extracted from electrocardiographic (ECG) signals, electroencephalographic (EEG) signals, medical images, anamnesis, etc. In some cases, a preprocessing step could be required to deal with normalization, artifact removing, and missing data. The objective of feature selection methods is to obtain a sorted list of the full set of features according to some defined criteria. From this ranking of features a smaller set of features can be obtained that allows the classification performance be improved and avoid possible overfitting of the trained classification model. In this project, a comparison of SF methods is made including Relief-based and sequential feature selection (SFS) methods. As classifiers we will consider linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and support vector machine (SVM) among others. The quality of classification results will be evaluated using different ranges of the feature ranking list and several indices such as accuracy, balanced accuracy, confusion matrix, ¿ Besides, computational cost of the different cases of classification will be estimated.Liu, C. (2022). A Study of Feature Selection Methods for Classification. Universitat Politècnica de València. http://hdl.handle.net/10251/182468TFG

    Balancing Data Protection and Model Accuracy : An Investigation of Protection Methods on Machine Learning Model Performance for a Bank Marketing Dataset

    Get PDF
    The practice of sharing customer data among companies for marketing purposes is becoming increasingly common. However, sharing customer-level data poses potential risks and serious problems for businesses, such as substantial declines in brand value, erosion of customer trust, loss of competitive advantage, and the imposition of legal penalties (Schneider et al. 2017). These may eventually lead to financial loss and reputation damage for the companies. With the growing awareness of the value of personal information, more companies and customers are concerned about protecting data privacy. In this paper, we used marketing data from a Portuguese bank to explore methods for balancing prediction accuracy and customer data privacy using various machine learning and data privacy techniques. The dataset includes observations from 45211 respondents and the observation period is from May 2008 to November 2010. Our goal is to find a method that enables third parties to share data with the bank while safeguarding customer privacy and maintaining accuracy in predicting customer behaviour. We tested several machine learning models: Logistic Regression, Random Forest, and Neural Network (feedforward) on original data and then chose Random Forest, which gave the best prediction performance, as the model to proceed to explore. After using two different data privacy methods (Sampling and Random Noise) on the original data, we found the Random Forest model gives us accuracy levels that are very close to the accuracy before using the privacy methods. By doing this, we demonstrated a method for companies to protect customer data privacy without sacrificing predictive accuracy. The results of this study will have significant implications for companies that seek to share customer data while maintaining high levels of privacy and accuracy.nhhma
    • …
    corecore