238 research outputs found
GreedyDual-Join: Locality-Aware Buffer Management for Approximate Join Processing Over Data Streams
We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques
Emergence of good conduct, scaling and Zipf laws in human behavioral sequences in an online world
We study behavioral action sequences of players in a massive multiplayer
online game. In their virtual life players use eight basic actions which allow
them to interact with each other. These actions are communication, trade,
establishing or breaking friendships and enmities, attack, and punishment. We
measure the probabilities for these actions conditional on previous taken and
received actions and find a dramatic increase of negative behavior immediately
after receiving negative actions. Similarly, positive behavior is intensified
by receiving positive actions. We observe a tendency towards anti-persistence
in communication sequences. Classifying actions as positive (good) and negative
(bad) allows us to define binary 'world lines' of lives of individuals.
Positive and negative actions are persistent and occur in clusters, indicated
by large scaling exponents alpha~0.87 of the mean square displacement of the
world lines. For all eight action types we find strong signs for high levels of
repetitiveness, especially for negative actions. We partition behavioral
sequences into segments of length n (behavioral `words' and 'motifs') and study
their statistical properties. We find two approximate power laws in the word
ranking distribution, one with an exponent of kappa-1 for the ranks up to 100,
and another with a lower exponent for higher ranks. The Shannon n-tuple
redundancy yields large values and increases in terms of word length, further
underscoring the non-trivial statistical properties of behavioral sequences. On
the collective, societal level the timeseries of particular actions per day can
be understood by a simple mean-reverting log-normal model.Comment: 6 pages, 5 figure
Generalized (m,k)-Zipf law for fractional Brownian motion-like time series with or without effect of an additional linear trend
We have translated fractional Brownian motion (FBM) signals into a text based
on two ''letters'', as if the signal fluctuations correspond to a constant
stepsize random walk. We have applied the Zipf method to extract the
exponent relating the word frequency and its rank on a log-log plot. We have
studied the variation of the Zipf exponent(s) giving the relationship between
the frequency of occurrence of words of length made of such two letters:
is varying as a power law in terms of . We have also searched how
the exponent of the Zipf law is influenced by a linear trend and the
resulting effect of its slope. We can distinguish finite size effects, and
results depending whether the starting FBM is persistent or not, i.e. depending
on the FBM Hurst exponent . It seems then numerically proven that the Zipf
exponent of a persistent signal is more influenced by the trend than that of an
antipersistent signal. It appears that the conjectured law
only holds near . We have also introduced considerations based on the
notion of a {\it time dependent Zipf law} along the signal.Comment: 24 pages, 12 figures; to appear in Int. J. Modern Phys
Maximum likelihood estimation for constrained parameters of multinomial distributions - Application to Zipf-Mandelbrot models
A numerical maximum likelihood (ML) estimation procedure is developed for the constrained parameters of multinomial distributions. The main difficulty involved in computing the likelihood function is the precise and fast determination of the multinomial coefficients. For this the coefficients are rewritten into a telescopic product. The presented method is applied to the ML estimation of the Zipf–Mandelbrot (ZM) distribution, which provides a true model in many real-life cases. The examples discussed arise from ecological and medical observations. Based on the estimates, the hypothesis that the data is ZM distributed is tested using a chi-square test. The computer code of the presented procedure is available on request by the author
Network traffic data analysis
The desire to conceptualize network traffic in a prevailing communication network is a facet for many types of network research studies. In this research, real traffic traces collected over trans-Pacific backbone links (the MAWI repository, providing publicly available anonymized traces) are analyzed to study the underlying traffic patterns. All data analysis and visualization is carried out using Matlab (Matlab is a trademark of The Mathworks, Inc.). At packet level, we first measure parameters such as distribution of packet lengths, distribution of protocol types, and then fit following analytical models. Next, the concept of flow is introduced and flow based analysis is studied. We consider flow related parameters such as top ports seen, duration of the flow, distribution of flow lengths, and number of flows with different timeout values and provide analytical models to fit the flow lengths. Further, we study the amount of data flowing between source-destination pairs. Finally, we focus on TCP-specific aspects of captured traces such as retransmissions and packet round-trip times. From the results obtained, we infer the Zipf-type nature of distribution for number of flows, heavy-tailness of flow sizes and the contribution of well-known ports at packet and flow level. Our study helps a network analyst to farther the knowledge and helps optimize the network resources, while performing efficient traffic engineering
The efficiency of individual optimization in the conditions of competitive growth
The paper aims to discuss statistical properties of the multi-agent based
model of competitive growth. Each of the agents is described by growth (or
decay) rule of its virtual "mass" with the rate affected by the interaction
with other agents. The interaction depends on the strategy vector and mutual
distance between agents and both are subjected to the agent's individual
optimization process. Steady-state simulations yield phase diagrams with the
high and low competition phases (HCP and LCP, respectively) separated by
critical point. Particular focus has been made on the indicators of the
power-law behavior of the mass distributions with respect to the critical
regime. In this regime the study has revealed remarkable anomaly in the
optimization efficiency
Two halves of a meaningful text are statistically different
Which statistical features distinguish a meaningful text (possibly written in
an unknown system) from a meaningless set of symbols? Here we answer this
question by comparing features of the first half of a text to its second half.
This comparison can uncover hidden effects, because the halves have the same
values of many parameters (style, genre {\it etc}). We found that the first
half has more different words and more rare words than the second half. Also,
words in the first half are distributed less homogeneously over the text in the
sense of of the difference between the frequency and the inverse spatial
period. These differences hold for the significant majority of several hundred
relatively short texts we studied. The statistical significance is confirmed
via the Wilcoxon test. Differences disappear after random permutation of words
that destroys the linear structure of the text. The differences reveal a
temporal asymmetry in meaningful texts, which is confirmed by showing that
texts are much better compressible in their natural way (i.e. along the
narrative) than in the word-inverted form. We conjecture that these results
connect the semantic organization of a text (defined by the flow of its
narrative) to its statistical features.Comment: 15 pages and 14 table
- …