16 research outputs found
Feature Selection in Large Scale Data Stream for Credit Card Fraud Detection
There is increased interest in accurate model acquisition from large scale data streams. In this paper, because we have focused attention on time-oriented variation, we propose a method contracting time-series data for data stream. Additionally, our proposal method employs the combination of plural simple contraction method and original features. In this experiment, we treat a real data stream in credit card transactions because it is large scale and difficult to classify. This experiment yields that this proposal method improves classification performance according to training data. However, this proposal method needs more generality. Hence, we'll improve generality with employing the suitable combination of a contraction method and a feature for the feature in our proposal method
Cost-aware Generalized -investing for Multiple Hypothesis Testing
We consider the problem of sequential multiple hypothesis testing with
nontrivial data collection costs. This problem appears, for example, when
conducting biological experiments to identify differentially expressed genes of
a disease process. This work builds on the generalized -investing
framework which enables control of the false discovery rate in a sequential
testing setting. We make a theoretical analysis of the long term asymptotic
behavior of -wealth which motivates a consideration of sample size in
the -investing decision rule. Posing the testing process as a game with
nature, we construct a decision rule that optimizes the expected
-wealth reward (ERO) and provides an optimal sample size for each test.
Empirical results show that a cost-aware ERO decision rule correctly rejects
more false null hypotheses than other methods for where is the sample
size. When the sample size is not fixed cost-aware ERO uses a prior on the null
hypothesis to adaptively allocate of the sample budget to each test. We extend
cost-aware ERO investing to finite-horizon testing which enables the decision
rule to allocate samples in a non-myopic manner. Finally, empirical tests on
real data sets from biological experiments show that cost-aware ERO balances
the allocation of samples to an individual test against the allocation of
samples across multiple tests.Comment: 26 pages, 5 figures, 8 table
Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream
In marketing we are often confronted with a continuous stream of responses to
marketing messages. Such streaming data provide invaluable information
regarding message effectiveness and segmentation. However, streaming data are
hard to analyze using conventional methods: their high volume and the fact that
they are continuously augmented means that it takes considerable time to
analyze them. We propose a method for estimating a finite mixture of logistic
regression models which can be used to cluster customers based on a continuous
stream of responses. This method, which we coin oFMLR, allows segments to be
identified in data streams or extremely large static datasets. Contrary to
black box algorithms, oFMLR provides model estimates that are directly
interpretable. We first introduce oFMLR, explaining in passing general topics
such as online estimation and the EM algorithm, making this paper a high level
overview of possible methods of dealing with large data streams in marketing
practice. Next, we discuss model convergence, identifiability, and relations to
alternative, Bayesian, methods; we also identify more general issues that arise
from dealing with continuously augmented data sets. Finally, we introduce the
oFMLR [R] package and evaluate the method by numerical simulation and by
analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag