6,138 research outputs found

    Gap bootstrap methods for massive data sets with an application to transportation engineering

    Get PDF
    In this paper we describe two bootstrap methods for massive data sets. Naive applications of common resampling methodology are often impractical for massive data sets due to computational burden and due to complex patterns of inhomogeneity. In contrast, the proposed methods exploit certain structural properties of a large class of massive data sets to break up the original problem into a set of simpler subproblems, solve each subproblem separately where the data exhibit approximate uniformity and where computational complexity can be reduced to a manageable level, and then combine the results through certain analytical considerations. The validity of the proposed methods is proved and their finite sample properties are studied through a moderately large simulation study. The methodology is illustrated with a real data example from Transportation Engineering, which motivated the development of the proposed methods.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS587 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Network Tomography: Identifiability and Fourier Domain Estimation

    Full text link
    The statistical problem for network tomography is to infer the distribution of X\mathbf{X}, with mutually independent components, from a measurement model Y=AX\mathbf{Y}=A\mathbf{X}, where AA is a given binary matrix representing the routing topology of a network under consideration. The challenge is that the dimension of X\mathbf{X} is much larger than that of Y\mathbf{Y} and thus the problem is often called ill-posed. This paper studies some statistical aspects of network tomography. We first address the identifiability issue and prove that the X\mathbf{X} distribution is identifiable up to a shift parameter under mild conditions. We then use a mixture model of characteristic functions to derive a fast algorithm for estimating the distribution of X\mathbf{X} based on the General method of Moments. Through extensive model simulation and real Internet trace driven simulation, the proposed approach is shown to be favorable comparing to previous methods using simple discretization for inferring link delays in a heterogeneous network.Comment: 21 page

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    Evaluating the impact of traffic sampling in network analysis

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaThe sampling of network traffic is a very effective method in order to comprehend the behaviour and flow of a network, essential to build network management tools to control Service Level Agreements (SLAs), Quality of Service (QoS), traffic engineering, and the planning of both the capacity and the safety of the network. With the exponential rise of the amount traffic caused by the number of devices connected to the Internet growing, it gets increasingly harder and more expensive to understand the behaviour of a network through the analysis of the total volume of traffic. The use of sampling techniques, or selective analysis, which consists in the election of small number of packets in order to estimate the expected behaviour of a network, then becomes essential. Even though these techniques drastically reduce the amount of data to be analyzed, the fact that the sampling analysis tasks have to be performed in the network equipment can cause a significant impact in the performance of these equipment devices, and a reduction in the accuracy of the estimation of network state. In this dissertation project, an evaluation of the impact of selective analysis of network traffic will be explored, at a level of performance in estimating network state, and statistical properties such as self-similarity and Long-Range Dependence (LRD) that exist in original network traffic, allowing a better understanding of the behaviour of sampled network traffic.A análise seletiva do tráfego de rede é um método muito eficaz para a compreensão do comportamento e fluxo de uma rede, sendo essencial para apoiar ferramentas de gestão de tarefas tais como o cumprimento de contratos de serviço (Service Level Agreements - SLAs), o controlo da Qualidade de Serviço (QoS), a engenharia de tráfego, o planeamento de capacidade e a segurança das redes. Neste sentido, e face ao exponencial aumento da quantidade de tráfego presente causado pelo número de dispositivos com ligação à rede ser cada vez maior, torna-se cada vez mais complicado e dispendioso o entendimento do comportamento de uma rede através da análise do volume total de tráfego. A utilização de técnicas de amostragem, ou análise seletiva, que consiste na eleição de um pequeno conjunto de pacotes de forma a tentar estimar, ou calcular, o comportamento expectável de uma rede, torna-se assim essencial. Apesar de estas técnicas reduzirem bastante o volume de dados a ser analisado, o facto de as tarefas de análise seletiva terem de ser efetuadas nos equipamentos de rede pode criar um impacto significativo no desempenho dos mesmos e uma redução de acurácia na estimação do estado da rede. Nesta dissertação de mestrado será então feita uma avaliação do impacto da análise seletiva do tráfego de rede, a nível do desempenho na estimativa do estado da rede e a nível das propriedades estatísticas tais como a Long-Range Dependence (LRD) existente no tráfego original, permitindo assim entender melhor o comportamento do tráfego de rede seletivo
    • …
    corecore