16 research outputs found

    Feature Selection in Large Scale Data Stream for Credit Card Fraud Detection

    Get PDF
    There is increased interest in accurate model acquisition from large scale data streams. In this paper, because we have focused attention on time-oriented variation, we propose a method contracting time-series data for data stream. Additionally, our proposal method employs the combination of plural simple contraction method and original features. In this experiment, we treat a real data stream in credit card transactions because it is large scale and difficult to classify. This experiment yields that this proposal method improves classification performance according to training data. However, this proposal method needs more generality. Hence, we'll improve generality with employing the suitable combination of a contraction method and a feature for the feature in our proposal method

    Cost-aware Generalized α\alpha-investing for Multiple Hypothesis Testing

    Full text link
    We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized α\alpha-investing framework which enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of α\alpha-wealth which motivates a consideration of sample size in the α\alpha-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected α\alpha-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for n=1n=1 where nn is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.Comment: 26 pages, 5 figures, 8 table

    Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

    Get PDF
    In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag
    corecore