15,170 research outputs found

    Analyzing the Social Structure and Dynamics of E-mail and Spam in Massive Backbone Internet Traffic

    Full text link
    E-mail is probably the most popular application on the Internet, with everyday business and personal communications dependent on it. Spam or unsolicited e-mail has been estimated to cost businesses significant amounts of money. However, our understanding of the network-level behavior of legitimate e-mail traffic and how it differs from spam traffic is limited. In this study, we have passively captured SMTP packets from a 10 Gbit/s Internet backbone link to construct a social network of e-mail users based on their exchanged e-mails. The focus of this paper is on the graph metrics indicating various structural properties of e-mail networks and how they evolve over time. This study also looks into the differences in the structural and temporal characteristics of spam and non-spam networks. Our analysis on the collected data allows us to show several differences between the behavior of spam and legitimate e-mail traffic, which can help us to understand the behavior of spammers and give us the knowledge to statistically model spam traffic on the network-level in order to complement current spam detection techniques.Comment: 15 pages, 20 figures, technical repor

    Computationally Efficient Nonparametric Importance Sampling

    Full text link
    The variance reduction established by importance sampling strongly depends on the choice of the importance sampling distribution. A good choice is often hard to achieve especially for high-dimensional integration problems. Nonparametric estimation of the optimal importance sampling distribution (known as nonparametric importance sampling) is a reasonable alternative to parametric approaches.In this article nonparametric variants of both the self-normalized and the unnormalized importance sampling estimator are proposed and investigated. A common critique on nonparametric importance sampling is the increased computational burden compared to parametric methods. We solve this problem to a large degree by utilizing the linear blend frequency polygon estimator instead of a kernel estimator. Mean square error convergence properties are investigated leading to recommendations for the efficient application of nonparametric importance sampling. Particularly, we show that nonparametric importance sampling asymptotically attains optimal importance sampling variance. The efficiency of nonparametric importance sampling algorithms heavily relies on the computational efficiency of the employed nonparametric estimator. The linear blend frequency polygon outperforms kernel estimators in terms of certain criteria such as efficient sampling and evaluation. Furthermore, it is compatible with the inversion method for sample generation. This allows to combine our algorithms with other variance reduction techniques such as stratified sampling. Empirical evidence for the usefulness of the suggested algorithms is obtained by means of three benchmark integration problems. As an application we estimate the distribution of the queue length of a spam filter queueing system based on real data.Comment: 29 pages, 7 figure

    Estimating labels from label proportions

    Get PDF
    Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice.
    • …
    corecore