296 research outputs found

    Automatically Selecting Parameters for Graph-Based Clustering

    Get PDF
    Data streams present a number of challenges, caused by change in stream concepts over time. In this thesis we present a novel method for detection of concept drift within data streams by analysing geometric features of the clustering algorithm, RepStream. Further, we present novel methods for automatically adjusting critical input parameters over time, and generating self-organising nearest-neighbour graphs, improving robustness and decreasing the need to domain-specific knowledge in the face of stream evolution

    Analisi di dati di traiettoria su piattaforma Big Data

    Get PDF
    L’evoluzione delle tecnologie di geolocalizzazione ha portato, in questi ultimi anni, alla generazione di una grande quantità di dati spaziali. Tra questi, risultano particolarmente rilevanti i dati di traiettoria, cioè quei dati spazio-temporali che rappresentano il percorso di un oggetto attraverso lo spazio in funzione del tempo: questi dati sono di fondamentale importanza in molti ambiti applicativi, come quello della sicurezza, del trasporto o dell’ecologia. Lo sviluppo tecnologico sta però radicalmente trasformando i dati spaziali: i volumi, l’eterogeneità e la frequenza con cui arrivano eccedono le capacità delle tecnologie tradizionali. Queste nuove caratteristiche rendono allora necessaria l’adozione di tecnologie innovative. Questa tesi descrive lo studio e lo sviluppo di un’applicazione per l’analisi di dati di traiettoria su piattaforma Big Data. L’applicazione è in grado di estrarre una serie di informazioni, come l’abitazione, il luogo di lavoro o i luoghi frequentati, legate alle persone a cui i dati si riferiscono. Il contributo principale riguarda l’algoritmica per l’attività di clustering dei dati spazio-temporali: l’algoritmo di clustering grid e density-based utilizzato è risultato infatti più efficiente rispetto ai tradizionali algoritmi density-based ma egualmente efficace. Inoltre, l’applicazione dimostra la validità e l’efficienza delle tecnologie Big Data applicate a grandi volumi di dati spaziali

    Communication Efficient Algorithms for Generating Massive Networks

    Get PDF
    Massive complex systems are prevalent throughout all of our lives, from various biological systems as the human genome to technological networks such as Facebook or Twitter. Rapid advances in technology allow us to gather more and more data that is connected to these systems. Analyzing and extracting this huge amount of information is a crucial task for a variety of scientific disciplines. A common abstraction for handling complex systems are networks (graphs) made up of entities and their relationships. For example, we can represent wireless ad hoc networks in terms of nodes and their connections with each other.We then identify the nodes as vertices and their connections as edges between the vertices. This abstraction allows us to develop algorithms that are independent of the underlying domain. Designing algorithms for massive networks is a challenging task that requires thorough analysis and experimental evaluation. A major hurdle for this task is the scarcity of publicly available large-scale datasets. To approach this issue, we can make use of network generators [21]. These generators allow us to produce synthetic instances that exhibit properties found in many real-world networks. In this thesis we develop a set of novel graph generators that have a focus on scalability. In particular, we cover the classic ErdËťos-RĂ©nyi model, random geometric graphs and random hyperbolic graphs. These models represent different real-world systems, from the aforementioned wireless ad-hoc networks [40] to social networks [44].We ensure scalability by making use of pseudorandomization via hash functions and redundant computations. The resulting network generators are communication agnostic, i.e. they require no communication. This allows us to generate massive instances of up to 243 vertices and 247 edges in less than 22 minutes on 32:768 processors. In addition to proving theoretical bounds for each generator, we perform an extensive experimental evaluation. We cover both their sequential performance, as well as scaling behavior.We are able to show that our algorithms are competitive to state-of-the-art implementations found in network analysis libraries. Additionally, our generators exhibit near optimal scaling behavior for large instances. Finally, we show that pseudorandomization has little to no measurable impact on the quality of our generated instances

    Sub-model aggregation for scalable eigenvector spatial filtering: Application to spatially varying coefficient modeling

    Full text link
    This study proposes a method for aggregating/synthesizing global and local sub-models for fast and flexible spatial regression modeling. Eigenvector spatial filtering (ESF) was used to model spatially varying coefficients and spatial dependence in the residuals by sub-model, while the generalized product-of-experts method was used to aggregate these sub-models. The major advantages of the proposed method are as follows: (i) it is highly scalable for large samples in terms of accuracy and computational efficiency; (ii) it is easily implemented by estimating sub-models independently first and aggregating/averaging them thereafter; and (iii) likelihood-based inference is available because the marginal likelihood is available in closed-form. The accuracy and computational efficiency of the proposed method are confirmed using Monte Carlo simulation experiments. This method was then applied to residential land price analysis in Japan. The results demonstrate the usefulness of this method for improving the interpretability of spatially varying coefficients. The proposed method is implemented in an R package spmoran (version 0.3.0 or later)

    A Global-Local Approximation Framework for Large-Scale Gaussian Process Modeling

    Full text link
    In this work, we propose a novel framework for large-scale Gaussian process (GP) modeling. Contrary to the global, and local approximations proposed in the literature to address the computational bottleneck with exact GP modeling, we employ a combined global-local approach in building the approximation. Our framework uses a subset-of-data approach where the subset is a union of a set of global points designed to capture the global trend in the data, and a set of local points specific to a given testing location to capture the local trend around the testing location. The correlation function is also modeled as a combination of a global, and a local kernel. The performance of our framework, which we refer to as TwinGP, is on par or better than the state-of-the-art GP modeling methods at a fraction of their computational cost

    Bayesian optimisation for likelihood-free cosmological inference

    Full text link
    Many cosmological models have only a finite number of parameters of interest, but a very expensive data-generating process and an intractable likelihood function. We address the problem of performing likelihood-free Bayesian inference from such black-box simulation-based models, under the constraint of a very limited simulation budget (typically a few thousand). To do so, we adopt an approach based on the likelihood of an alternative parametric model. Conventional approaches to approximate Bayesian computation such as likelihood-free rejection sampling are impractical for the considered problem, due to the lack of knowledge about how the parameters affect the discrepancy between observed and simulated data. As a response, we make use of a strategy previously developed in the machine learning literature (Bayesian optimisation for likelihood-free inference, BOLFI), which combines Gaussian process regression of the discrepancy to build a surrogate surface with Bayesian optimisation to actively acquire training data. We extend the method by deriving an acquisition function tailored for the purpose of minimising the expected uncertainty in the approximate posterior density, in the parametric approach. The resulting algorithm is applied to the problems of summarising Gaussian signals and inferring cosmological parameters from the Joint Lightcurve Analysis supernovae data. We show that the number of required simulations is reduced by several orders of magnitude, and that the proposed acquisition function produces more accurate posterior approximations, as compared to common strategies.Comment: 16+9 pages, 12 figures. Matches PRD published version after minor modification
    • …
    corecore