6,138 research outputs found
Gap bootstrap methods for massive data sets with an application to transportation engineering
In this paper we describe two bootstrap methods for massive data sets. Naive
applications of common resampling methodology are often impractical for massive
data sets due to computational burden and due to complex patterns of
inhomogeneity. In contrast, the proposed methods exploit certain structural
properties of a large class of massive data sets to break up the original
problem into a set of simpler subproblems, solve each subproblem separately
where the data exhibit approximate uniformity and where computational
complexity can be reduced to a manageable level, and then combine the results
through certain analytical considerations. The validity of the proposed methods
is proved and their finite sample properties are studied through a moderately
large simulation study. The methodology is illustrated with a real data example
from Transportation Engineering, which motivated the development of the
proposed methods.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS587 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Network Tomography: Identifiability and Fourier Domain Estimation
The statistical problem for network tomography is to infer the distribution
of , with mutually independent components, from a measurement model
, where is a given binary matrix representing the
routing topology of a network under consideration. The challenge is that the
dimension of is much larger than that of and thus the
problem is often called ill-posed. This paper studies some statistical aspects
of network tomography. We first address the identifiability issue and prove
that the distribution is identifiable up to a shift parameter
under mild conditions. We then use a mixture model of characteristic functions
to derive a fast algorithm for estimating the distribution of
based on the General method of Moments. Through extensive model simulation and
real Internet trace driven simulation, the proposed approach is shown to be
favorable comparing to previous methods using simple discretization for
inferring link delays in a heterogeneous network.Comment: 21 page
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
Evaluating the impact of traffic sampling in network analysis
Dissertação de mestrado integrado em Engenharia InformáticaThe sampling of network traffic is a very effective method in order to comprehend the
behaviour and flow of a network, essential to build network management tools to control
Service Level Agreements (SLAs), Quality of Service (QoS), traffic engineering, and the
planning of both the capacity and the safety of the network.
With the exponential rise of the amount traffic caused by the number of devices connected
to the Internet growing, it gets increasingly harder and more expensive to understand the
behaviour of a network through the analysis of the total volume of traffic. The use of
sampling techniques, or selective analysis, which consists in the election of small number of
packets in order to estimate the expected behaviour of a network, then becomes essential.
Even though these techniques drastically reduce the amount of data to be analyzed, the fact
that the sampling analysis tasks have to be performed in the network equipment can cause a
significant impact in the performance of these equipment devices, and a reduction in the
accuracy of the estimation of network state.
In this dissertation project, an evaluation of the impact of selective analysis of network
traffic will be explored, at a level of performance in estimating network state, and statistical
properties such as self-similarity and Long-Range Dependence (LRD) that exist in original
network traffic, allowing a better understanding of the behaviour of sampled network traffic.A análise seletiva do tráfego de rede é um método muito eficaz para a compreensão do
comportamento e fluxo de uma rede, sendo essencial para apoiar ferramentas de gestão de
tarefas tais como o cumprimento de contratos de serviço (Service Level Agreements - SLAs),
o controlo da Qualidade de Serviço (QoS), a engenharia de tráfego, o planeamento de
capacidade e a segurança das redes.
Neste sentido, e face ao exponencial aumento da quantidade de tráfego presente causado
pelo número de dispositivos com ligação à rede ser cada vez maior, torna-se cada vez
mais complicado e dispendioso o entendimento do comportamento de uma rede através
da análise do volume total de tráfego. A utilização de técnicas de amostragem, ou análise
seletiva, que consiste na eleição de um pequeno conjunto de pacotes de forma a tentar
estimar, ou calcular, o comportamento expectável de uma rede, torna-se assim essencial.
Apesar de estas técnicas reduzirem bastante o volume de dados a ser analisado, o facto de as
tarefas de análise seletiva terem de ser efetuadas nos equipamentos de rede pode criar um
impacto significativo no desempenho dos mesmos e uma redução de acurácia na estimação
do estado da rede.
Nesta dissertação de mestrado será então feita uma avaliação do impacto da análise
seletiva do tráfego de rede, a nÃvel do desempenho na estimativa do estado da rede e a nÃvel
das propriedades estatÃsticas tais como a Long-Range Dependence (LRD) existente no tráfego
original, permitindo assim entender melhor o comportamento do tráfego de rede seletivo
- …