2,406 research outputs found

    Finding the True Frequent Itemsets

    Full text link
    Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ\theta of a transactional dataset D\mathcal{D}. Often though, the ultimate goal of mining D\mathcal{D} is not an analysis of the dataset \emph{per se}, but the understanding of the underlying process that generated it. Specifically, in many applications D\mathcal{D} is a collection of samples obtained from an unknown probability distribution π\pi on transactions, and by extracting the FIs in D\mathcal{D} one attempts to infer itemsets that are frequently (i.e., with probability at least θ\theta) generated by π\pi, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of \emph{false positives}, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold θ^\hat{\theta} such that the collection of itemsets with frequency at least θ^\hat{\theta} in D\mathcal{D} contains only TFIs with probability at least 1−δ1-\delta, for some user-specified δ\delta. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of D\mathcal{D} at frequency θ\theta and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.Comment: 13 pages, Extended version of work appeared in SIAM International Conference on Data Mining, 201

    A framework for personalized dynamic cross-selling in e-commerce retailing

    Get PDF
    Cross-selling and product bundling are prevalent strategies in the retail sector. Instead of static bundling offers, i.e. giving the same offer to everyone, personalized dynamic cross-selling generates targeted bundle offers and can help maximize revenues and profits. In resolving the two basic problems of dynamic cross-selling, which involves selecting the right complementary products and optimizing the discount, the issue of computational complexity becomes central as the customer base and length of the product list grows. Traditional recommender systems are built upon simple collaborative filtering techniques, which exploit the informational cues gained from users in the form of product ratings and rating differences across users. The retail setting differs in that there are only records of transactions (in period X, customer Y purchased product Z). Instead of a range of explicit rating scores, transactions form binary datasets; 1-purchased and 0-not-purchased. This makes it a one-class collaborative filtering (OCCF) problem. Notwithstanding the existence of wider application domains of such an OCCF problem, very little work has been done in the retail setting. This research addresses this gap by developing an effective framework for dynamic cross-selling for online retailing. In the first part of the research, we propose an effective yet intuitive approach to integrate temporal information regarding a product\u27s lifecycle (i.e., the non-stationary nature of the sales history) in the form of a weight component into latent-factor-based OCCF models, improving the quality of personalized product recommendations. To improve the scalability of large product catalogs with transaction sparsity typical in online retailing, the approach relies on product catalog hierarchy and segments (rather than individual SKUs) for collaborative filtering. In the second part of the work, we propose effective bundle discount policies, which estimate a specific customer\u27s interest in potential cross-selling products (identified using the proposed OCCF methods) and calibrate the discount to strike an effective balance between the probability of the offer acceptance and the size of the discount. We also developed a highly effective simulation platform for generation of e-retailer transactions under various settings and test and validate the proposed methods. To the best of our knowledge, this is the first study to address the topic of real-time personalized dynamic cross-selling with discounting. The proposed techniques are applicable to cross-selling, up-selling, and personalized and targeted selling within the e-retail business domain. Through extensive analysis of various market scenario setups, we also provide a number of managerial insights on the performance of cross-selling strategies

    Predictive Modelling of Retail Banking Transactions for Credit Scoring, Cross-Selling and Payment Pattern Discovery

    Get PDF
    Evaluating transactional payment behaviour offers a competitive advantage in the modern payment ecosystem, not only for confirming the presence of good credit applicants or unlocking the cross-selling potential between the respective product and service portfolios of financial institutions, but also to rule out bad credit applicants precisely in transactional payments streams. In a diagnostic test for analysing the payment behaviour, I have used a hybrid approach comprising a combination of supervised and unsupervised learning algorithms to discover behavioural patterns. Supervised learning algorithms can compute a range of credit scores and cross-sell candidates, although the applied methods only discover limited behavioural patterns across the payment streams. Moreover, the performance of the applied supervised learning algorithms varies across the different data models and their optimisation is inversely related to the pre-processed dataset. Subsequently, the research experiments conducted suggest that the Two-Class Decision Forest is an effective algorithm to determine both the cross-sell candidates and creditworthiness of their customers. In addition, a deep-learning model using neural network has been considered with a meaningful interpretation of future payment behaviour through categorised payment transactions, in particular by providing additional deep insights through graph-based visualisations. However, the research shows that unsupervised learning algorithms play a central role in evaluating the transactional payment behaviour of customers to discover associations using market basket analysis based on previous payment transactions, finding the frequent transactions categories, and developing interesting rules when each transaction category is performed on the same payment stream. Current research also reveals that the transactional payment behaviour analysis is multifaceted in the financial industry for assessing the diagnostic ability of promotion candidates and classifying bad credit applicants from among the entire customer base. The developed predictive models can also be commonly used to estimate the credit risk of any credit applicant based on his/her transactional payment behaviour profile, combined with deep insights from the categorised payment transactions analysis. The research study provides a full review of the performance characteristic results from different developed data models. Thus, the demonstrated data science approach is a possible proof of how machine learning models can be turned into cost-sensitive data models

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

    Scalable and Weakly Supervised Bank Transaction Classification

    Full text link
    This paper aims to categorize bank transactions using weak supervision, natural language processing, and deep neural network techniques. Our approach minimizes the reliance on expensive and difficult-to-obtain manual annotations by leveraging heuristics and domain knowledge to train accurate transaction classifiers. We present an effective and scalable end-to-end data pipeline, including data preprocessing, transaction text embedding, anchoring, label generation, discriminative neural network training, and an overview of the system architecture. We demonstrate the effectiveness of our method by showing it outperforms existing market-leading solutions, achieves accurate categorization, and can be quickly extended to novel and composite use cases. This can in turn unlock many financial applications such as financial health reporting and credit risk assessment

    Combining Sequential and Aggregated Data for Churn Prediction in Casual Freemium Games

    Full text link
    In freemium games, the revenue from a player comes from the in-app purchases made and the advertisement to which that player is exposed. The longer a player is playing the game, the higher will be the chances that he or she will generate a revenue within the game. Within this scenario, it is extremely important to be able to detect promptly when a player is about to quit playing (churn) in order to react and attempt to retain the player within the game, thus prolonging his or her game lifetime. In this article we investigate how to improve the current state-of-the-art in churn prediction by combining sequential and aggregate data using different neural network architectures. The results of the comparative analysis show that the combination of the two data types grants an improvement in the prediction accuracy over predictors based on either purely sequential or purely aggregated data

    A Comprehensive Survey of Data Mining-based Fraud Detection Research

    Full text link
    This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.Comment: 14 page
    • …
    corecore