238 research outputs found

    GreedyDual-Join: Locality-Aware Buffer Management for Approximate Join Processing Over Data Streams

    Full text link
    We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques

    Emergence of good conduct, scaling and Zipf laws in human behavioral sequences in an online world

    Get PDF
    We study behavioral action sequences of players in a massive multiplayer online game. In their virtual life players use eight basic actions which allow them to interact with each other. These actions are communication, trade, establishing or breaking friendships and enmities, attack, and punishment. We measure the probabilities for these actions conditional on previous taken and received actions and find a dramatic increase of negative behavior immediately after receiving negative actions. Similarly, positive behavior is intensified by receiving positive actions. We observe a tendency towards anti-persistence in communication sequences. Classifying actions as positive (good) and negative (bad) allows us to define binary 'world lines' of lives of individuals. Positive and negative actions are persistent and occur in clusters, indicated by large scaling exponents alpha~0.87 of the mean square displacement of the world lines. For all eight action types we find strong signs for high levels of repetitiveness, especially for negative actions. We partition behavioral sequences into segments of length n (behavioral `words' and 'motifs') and study their statistical properties. We find two approximate power laws in the word ranking distribution, one with an exponent of kappa-1 for the ranks up to 100, and another with a lower exponent for higher ranks. The Shannon n-tuple redundancy yields large values and increases in terms of word length, further underscoring the non-trivial statistical properties of behavioral sequences. On the collective, societal level the timeseries of particular actions per day can be understood by a simple mean-reverting log-normal model.Comment: 6 pages, 5 figure

    Generalized (m,k)-Zipf law for fractional Brownian motion-like time series with or without effect of an additional linear trend

    Full text link
    We have translated fractional Brownian motion (FBM) signals into a text based on two ''letters'', as if the signal fluctuations correspond to a constant stepsize random walk. We have applied the Zipf method to extract the ζ\zeta ' exponent relating the word frequency and its rank on a log-log plot. We have studied the variation of the Zipf exponent(s) giving the relationship between the frequency of occurrence of words of length m<8m<8 made of such two letters: ζ\zeta ' is varying as a power law in terms of mm. We have also searched how the ζ\zeta ' exponent of the Zipf law is influenced by a linear trend and the resulting effect of its slope. We can distinguish finite size effects, and results depending whether the starting FBM is persistent or not, i.e. depending on the FBM Hurst exponent HH. It seems then numerically proven that the Zipf exponent of a persistent signal is more influenced by the trend than that of an antipersistent signal. It appears that the conjectured law ζ=2H1\zeta ' = |2H-1| only holds near H=0.5H=0.5. We have also introduced considerations based on the notion of a {\it time dependent Zipf law} along the signal.Comment: 24 pages, 12 figures; to appear in Int. J. Modern Phys

    Maximum likelihood estimation for constrained parameters of multinomial distributions - Application to Zipf-Mandelbrot models

    Get PDF
    A numerical maximum likelihood (ML) estimation procedure is developed for the constrained parameters of multinomial distributions. The main difficulty involved in computing the likelihood function is the precise and fast determination of the multinomial coefficients. For this the coefficients are rewritten into a telescopic product. The presented method is applied to the ML estimation of the Zipf–Mandelbrot (ZM) distribution, which provides a true model in many real-life cases. The examples discussed arise from ecological and medical observations. Based on the estimates, the hypothesis that the data is ZM distributed is tested using a chi-square test. The computer code of the presented procedure is available on request by the author

    Network traffic data analysis

    Get PDF
    The desire to conceptualize network traffic in a prevailing communication network is a facet for many types of network research studies. In this research, real traffic traces collected over trans-Pacific backbone links (the MAWI repository, providing publicly available anonymized traces) are analyzed to study the underlying traffic patterns. All data analysis and visualization is carried out using Matlab (Matlab is a trademark of The Mathworks, Inc.). At packet level, we first measure parameters such as distribution of packet lengths, distribution of protocol types, and then fit following analytical models. Next, the concept of flow is introduced and flow based analysis is studied. We consider flow related parameters such as top ports seen, duration of the flow, distribution of flow lengths, and number of flows with different timeout values and provide analytical models to fit the flow lengths. Further, we study the amount of data flowing between source-destination pairs. Finally, we focus on TCP-specific aspects of captured traces such as retransmissions and packet round-trip times. From the results obtained, we infer the Zipf-type nature of distribution for number of flows, heavy-tailness of flow sizes and the contribution of well-known ports at packet and flow level. Our study helps a network analyst to farther the knowledge and helps optimize the network resources, while performing efficient traffic engineering

    The efficiency of individual optimization in the conditions of competitive growth

    Full text link
    The paper aims to discuss statistical properties of the multi-agent based model of competitive growth. Each of the agents is described by growth (or decay) rule of its virtual "mass" with the rate affected by the interaction with other agents. The interaction depends on the strategy vector and mutual distance between agents and both are subjected to the agent's individual optimization process. Steady-state simulations yield phase diagrams with the high and low competition phases (HCP and LCP, respectively) separated by critical point. Particular focus has been made on the indicators of the power-law behavior of the mass distributions with respect to the critical regime. In this regime the study has revealed remarkable anomaly in the optimization efficiency

    Two halves of a meaningful text are statistically different

    Full text link
    Which statistical features distinguish a meaningful text (possibly written in an unknown system) from a meaningless set of symbols? Here we answer this question by comparing features of the first half of a text to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre {\it etc}). We found that the first half has more different words and more rare words than the second half. Also, words in the first half are distributed less homogeneously over the text in the sense of of the difference between the frequency and the inverse spatial period. These differences hold for the significant majority of several hundred relatively short texts we studied. The statistical significance is confirmed via the Wilcoxon test. Differences disappear after random permutation of words that destroys the linear structure of the text. The differences reveal a temporal asymmetry in meaningful texts, which is confirmed by showing that texts are much better compressible in their natural way (i.e. along the narrative) than in the word-inverted form. We conjecture that these results connect the semantic organization of a text (defined by the flow of its narrative) to its statistical features.Comment: 15 pages and 14 table
    corecore