487 research outputs found

    Parameter estimation for sums of correlated gamma random variables. Application to anomaly detection in Internet Traffic

    Get PDF
    International audienceA new family of distributions, constructed by summing correlated gamma random variables, is studied. First, a simple closed-form expression for their density is derived. Second, the three parameters characterizing such a density are estimated by using the maximum likelihood (ML) principle. Numerical simulation are conducted to compare the performance of the ML estimator against those of the conventional estimator of moments. Finally, a multiresolution multivariate gamma based modeling of Internet traffic illustrate the potential interest of the proposed distributions for the detection of anomalies. Aggregated times series of IP packet counts are split into adjacent non overlapping time blocks. The distribution of these series are modeled by the proposed multivariate gamma based distributions, over a collection of different aggregation levels. The anomaly detection strategy is based on tracking changes along time of the corresponding multiresolution parameters

    Adaptive algorithms for identifying large flows in IP traffic

    Get PDF
    We propose in this paper an on-line algorithm based on Bloom filters for identifying large flows in IP traffic (a.k.a. elephants). Because of the large number of small flows, hash tables of these algorithms have to be regularly refreshed. Recognizing that the periodic erasure scheme usually used in the technical literature turns out to be quite inefficient when using real traffic traces over a long period of time, we introduce a simple adaptive scheme that closely follows the variations of traffic. When tested against real traffic traces, the proposed on-line algorithm performs well in the sense that the detection ratio of long flows by the algorithm over a long time period is quite high. Beyond the identification of elephants, this same class of algorithms is applied to the closely related problem of detection of anomalies in IP traffic, e.g., SYN flood due for instance to attacks. An algorithm for detecting SYN and volume flood anomalies in Internet traffic is designed. Experiments show that an anomaly is detected in less than one minute and the targeted destinations are identified at the same time

    Outlier Identification in Spatio-Temporal Processes

    Full text link
    This dissertation answers some of the statistical challenges arising in spatio-temporal data from Internet traffic, electricity grids and climate models. It begins with methodological contributions to the problem of anomaly detection in communication networks. Using electricity consumption patterns for University of Michigan campus, the well known spatial prediction method kriging has been adapted for identification of false data injections into the system. Events like Distributed Denial of Service (DDoS), Botnet/Malware attacks, Port Scanning etc. call for methods which can identify unusual activity in Internet traffic patterns. Storing information on the entire network though feasible cannot be done at the time scale at which data arrives. In this work, hashing techniques which can produce summary statistics for the network have been used. The hashed data so obtained indeed preserves the heavy tailed nature of traffic payloads, thereby providing a platform for the application of extreme value theory (EVT) to identify heavy hitters in volumetric attacks. These methods based on EVT require the estimation of the tail index of a heavy tailed distribution. The traditional estimators (Hill et al. (1975)) for the tail index tend to be biased in the presence of outliers. To circumvent this issue, a trimmed version of the classic Hill estimator has been proposed and studied from a theoretical perspective. For the Pareto domain of attraction, the optimality and asymptotic normality of the estimator has been established. Additionally, a data driven strategy to detect the number of extreme outliers in heavy tailed data has also been presented. The dissertation concludes with the statistical formulation of m-year return levels of extreme climatic events (heat/cold waves). The Generalized Pareto distribution (GPD) serves as good fit for modeling peaks over threshold of a distribution. Allowing the parameters of the GPD to vary as a function of covariates such as time of the year, El-Nino and location in the US, extremes of the areal impact of heat waves have been well modeled and inferred.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145789/1/shrijita_1.pd

    The 8th International Conference on Time Series and Forecasting

    Get PDF
    The aim of ITISE 2022 is to create a friendly environment that could lead to the establishment or strengthening of scientific collaborations and exchanges among attendees. Therefore, ITISE 2022 is soliciting high-quality original research papers (including significant works-in-progress) on any aspect time series analysis and forecasting, in order to motivating the generation and use of new knowledge, computational techniques and methods on forecasting in a wide range of fields

    Exploring the topical structure of short text through probability models : from tasks to fundamentals

    Get PDF
    Recent technological advances have radically changed the way we communicate. Today’s communication has become ubiquitous and it has fostered the need for information that is easier to create, spread and consume. As a consequence, we have experienced the shortening of text messages in mediums ranging from electronic mailing, instant messaging to microblogging. Moreover, the ubiquity and fast-paced nature of these mediums have promoted their use for unthinkable tasks. For instance, reporting real-world events was classically carried out by news reporters, but, nowadays, most interesting events are first disclosed on social networks like Twitter by eyewitness through short text messages. As a result, the exploitation of the thematic content in short text has captured the interest of both research and industry. Topic models are a type of probability models that have traditionally been used to explore this thematic content, a.k.a. topics, in regular text. Most popular topic models fall into the sub-class of LVMs (Latent Variable Models), which include several latent variables at the corpus, document and word levels to summarise the topics at each level. However, classical LVM-based topic models struggle to learn semantically meaningful topics in short text because the lack of co-occurring words within a document hampers the estimation of the local latent variables at the document level. To overcome this limitation, pooling and hierarchical Bayesian strategies that leverage on contextual information have been essential to improve the quality of topics in short text. In this thesis, we study the problem of learning semantically meaningful and predictive representations of text in two distinct phases: • In the first phase, Part I, we investigate the use of LVM-based topic models for the specific task of event detection in Twitter. In this situation, the use of contextual information to pool tweets together comes naturally. Thus, we first extend an existing clustering algorithm for event detection to use the topics learned from pooled tweets. Then, we propose a probability model that integrates topic modelling and clustering to enable the flow of information between both components. • In the second phase, Part II and Part III, we challenge the use of local latent variables in LVMs, specially when the context of short messages is not available. First of all, we study the evaluation of the generalization capabilities of LVMs like PFA (Poisson Factor Analysis) and propose unbiased estimation methods to approximate it. With the most accurate method, we compare the generalization of chordal models without latent variables to that of PFA topic models in short and regular text collections. In summary, we demonstrate that by integrating clustering and topic modelling, the performance of event detection techniques in Twitter is improved due to the interaction between both components. Moreover, we develop several unbiased likelihood estimation methods for assessing the generalization of PFA and we empirically validate their accuracy in different document collections. Finally, we show that we can learn chordal models without latent variables in text through Chordalysis, and that they can be a competitive alternative to classical topic models, specially in short text.Els avenços tecnològics han canviat radicalment la forma que ens comuniquem. Avui en dia, la comunicació és ubiqua, la qual cosa fomenta l’ús de informació fàcil de crear, difondre i consumir. Com a resultat, hem experimentat l’escurçament dels missatges de text en diferents medis de comunicació, des del correu electrònic, a la missatgeria instantània, al microblogging. A més de la ubiqüitat, la naturalesa accelerada d’aquests medis ha promogut el seu ús per tasques fins ara inimaginables. Per exemple, el relat d’esdeveniments era clàssicament dut a terme per periodistes a peu de carrer, però, en l’actualitat, el successos més interessants es publiquen directament en xarxes socials com Twitter a través de missatges curts. Conseqüentment, l’explotació de la informació temàtica del text curt ha atret l'interès tant de la recerca com de la indústria. Els models temàtics (o topic models) són un tipus de models de probabilitat que tradicionalment s’han utilitzat per explotar la informació temàtica en documents de text. Els models més populars pertanyen al subgrup de models amb variables latents, els quals incorporen varies variables a nivell de corpus, document i paraula amb la finalitat de descriure el contingut temàtic a cada nivell. Tanmateix, aquests models tenen dificultats per aprendre la semàntica en documents curts degut a la manca de coocurrència en les paraules d’un mateix document, la qual cosa impedeix una correcta estimació de les variables locals. Per tal de solucionar aquesta limitació, l’agregació de missatges segons el context i l’ús d’estratègies jeràrquiques Bayesianes són essencials per millorar la qualitat dels temes apresos. En aquesta tesi, estudiem en dos fases el problema d’aprenentatge d’estructures semàntiques i predictives en documents de text: En la primera fase, Part I, investiguem l’ús de models temàtics amb variables latents per la detecció d’esdeveniments a Twitter. En aquest escenari, l’ús del context per agregar tweets sorgeix de forma natural. Per això, primer estenem un algorisme de clustering per detectar esdeveniments a partir dels temes apresos en els tweets agregats. I seguidament, proposem un nou model de probabilitat que integra el model temàtic i el de clustering per tal que la informació flueixi entre ambdós components. En la segona fase, Part II i Part III, qüestionem l’ús de variables latents locals en models per a text curt sense context. Primer de tot, estudiem com avaluar la capacitat de generalització d’un model amb variables latents com el PFA (Poisson Factor Analysis) a través del càlcul de la likelihood. Atès que aquest càlcul és computacionalment intractable, proposem diferents mètodes d estimació. Amb el mètode més acurat, comparem la generalització de models chordals sense variables latents amb la del models PFA, tant en text curt com estàndard. En resum, demostrem que integrant clustering i models temàtics, el rendiment de les tècniques de detecció d’esdeveniments a Twitter millora degut a la interacció entre ambdós components. A més a més, desenvolupem diferents mètodes d’estimació per avaluar la capacitat generalizadora dels models PFA i validem empíricament la seva exactitud en diverses col·leccions de text. Finalment, mostrem que podem aprendre models chordals sense variables latents en text a través de Chordalysis i que aquests models poden ser una bona alternativa als models temàtics clàssics, especialment en text curt.Postprint (published version

    PREDICTING INTERNET TRAFFIC BURSTS USING EXTREME VALUE THEORY

    Get PDF
    Computer networks play an important role in today’s organization and people life. These interconnected devices share a common medium and they tend to compete for it. Quality of Service (QoS) comes into play as to define what level of services users get. Accurately defining the QoS metrics is thus important. Bursts and serious deteriorations are omnipresent in Internet and considered as an important aspects of it. This thesis examines bursts and serious deteriorations in Internet traffic and applies Extreme Value Theory (EVT) to their prediction and modelling. EVT itself is a field of statistics that has been in application in fields like hydrology and finance, with only a recent introduction to the field of telecommunications. Model fitting is based on real traces from Belcore laboratory along with some simulated traces based on fractional Gaussian noise and linear fractional alpha stable motion. QoS traces from University of Napoli are also used in the prediction stage. Three methods from EVT are successfully used for the bursts prediction problem. They are Block Maxima (BM) method, Peaks Over Threshold (POT) method, and RLargest Order Statistics (RLOS) method. Bursts in internet traffic are predicted using the above three methods. A clear methodology was developed for the bursts prediction problem. New metrics for QoS are suggested based on Return Level and Return Period. Thus, robust QoS metrics can be defined. In turn, a superior QoS will be obtained that would support mission critical applications

    Fluorescence-based high-resolution tracking of nanoparticles

    Get PDF
    • …
    corecore