Search CORE

16 research outputs found

Feature Selection in Large Scale Data Stream for Credit Card Fraud Detection

Author: Ise Masayuki
Konishi Osamu
Niimi Ayahiko
Publication venue: IEEE SMC Hiroshima Chapter
Publication date: 01/11/2009
Field of study

There is increased interest in accurate model acquisition from large scale data streams. In this paper, because we have focused attention on time-oriented variation, we propose a method contracting time-series data for data stream. Additionally, our proposal method employs the combination of plural simple contraction method and original features. In this experiment, we treat a real data stream in credit card transactions because it is large scale and difficult to classify. This experiment yields that this proposal method improves classification performance according to training data. However, this proposal method needs more generality. Hence, we'll improve generality with employing the suitable combination of a contraction method and a feature for the feature in our proposal method

Hiroshima University Institutional Repository

Okayama University Scientific Achievement Repository

Cost-aware Generalized $\alpha$ -investing for Multiple Hypothesis Testing

Author: Cook Thomas
Dubey Harsh Vardhan
Flaherty Patrick
Lee Ji Ah
Zhao Tingting
Zhu Guangyu
Publication venue
Publication date: 03/11/2023
Field of study

We consider the problem of sequential multiple hypothesis testing with nontrivial data collection costs. This problem appears, for example, when conducting biological experiments to identify differentially expressed genes of a disease process. This work builds on the generalized

\alpha

-investing framework which enables control of the false discovery rate in a sequential testing setting. We make a theoretical analysis of the long term asymptotic behavior of

\alpha

-wealth which motivates a consideration of sample size in the

\alpha

-investing decision rule. Posing the testing process as a game with nature, we construct a decision rule that optimizes the expected

\alpha

-wealth reward (ERO) and provides an optimal sample size for each test. Empirical results show that a cost-aware ERO decision rule correctly rejects more false null hypotheses than other methods for

n=1

where

n

is the sample size. When the sample size is not fixed cost-aware ERO uses a prior on the null hypothesis to adaptively allocate of the sample budget to each test. We extend cost-aware ERO investing to finite-horizon testing which enables the decision rule to allocate samples in a non-myopic manner. Finally, empirical tests on real data sets from biological experiments show that cost-aware ERO balances the allocation of samples to an individual test against the allocation of samples across multiple tests.Comment: 26 pages, 5 figures, 8 table

arXiv.org e-Print Archive

Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

Author: Kaptein Maurits
Ketelaar Paul
Publication venue
Publication date: 01/01/2018
Field of study

In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag

arXiv.org e-Print Archive

Tilburg University Repository