296 research outputs found
Automatically Selecting Parameters for Graph-Based Clustering
Data streams present a number of challenges, caused by change in stream concepts over time. In this thesis we present a novel method for detection of concept drift within data streams by analysing geometric features of the clustering algorithm, RepStream. Further, we present novel methods for automatically adjusting critical input parameters over time, and generating self-organising nearest-neighbour graphs, improving robustness and decreasing the need to domain-specific knowledge in the face of stream evolution
Analisi di dati di traiettoria su piattaforma Big Data
L’evoluzione delle tecnologie di geolocalizzazione ha portato, in questi ultimi anni, alla generazione di una grande quantità di dati spaziali. Tra questi, risultano particolarmente rilevanti i dati di traiettoria, cioè quei dati spazio-temporali che rappresentano il percorso di un oggetto attraverso lo spazio in funzione del tempo: questi dati sono di fondamentale importanza in molti ambiti applicativi, come quello della sicurezza, del trasporto o dell’ecologia. Lo sviluppo tecnologico sta però radicalmente trasformando i dati spaziali: i volumi, l’eterogeneità e la frequenza con cui arrivano eccedono le capacità delle tecnologie tradizionali. Queste nuove caratteristiche rendono allora necessaria l’adozione di tecnologie innovative. Questa tesi descrive lo studio e lo sviluppo di un’applicazione per l’analisi di dati di traiettoria su piattaforma Big Data. L’applicazione è in grado di estrarre una serie di informazioni, come l’abitazione, il luogo di lavoro o i luoghi frequentati, legate alle persone a cui i dati si riferiscono. Il contributo principale riguarda l’algoritmica per l’attività di clustering dei dati spazio-temporali: l’algoritmo di clustering grid e density-based utilizzato è risultato infatti più efficiente rispetto ai tradizionali algoritmi density-based ma egualmente efficace. Inoltre, l’applicazione dimostra la validità e l’efficienza delle tecnologie Big Data applicate a grandi volumi di dati spaziali
Communication Efficient Algorithms for Generating Massive Networks
Massive complex systems are prevalent throughout all of our lives, from various biological
systems as the human genome to technological networks such as Facebook or Twitter.
Rapid advances in technology allow us to gather more and more data that is connected to
these systems. Analyzing and extracting this huge amount of information is a crucial task
for a variety of scientific disciplines.
A common abstraction for handling complex systems are networks (graphs) made up of
entities and their relationships. For example, we can represent wireless ad hoc networks in
terms of nodes and their connections with each other.We then identify the nodes as vertices
and their connections as edges between the vertices. This abstraction allows us to develop
algorithms that are independent of the underlying domain.
Designing algorithms for massive networks is a challenging task that requires thorough
analysis and experimental evaluation. A major hurdle for this task is the scarcity of publicly
available large-scale datasets. To approach this issue, we can make use of network generators
[21]. These generators allow us to produce synthetic instances that exhibit properties
found in many real-world networks.
In this thesis we develop a set of novel graph generators that have a focus on scalability.
In particular, we cover the classic ErdËťos-RĂ©nyi model, random geometric graphs and
random hyperbolic graphs. These models represent different real-world systems, from the
aforementioned wireless ad-hoc networks [40] to social networks [44].We ensure scalability
by making use of pseudorandomization via hash functions and redundant computations.
The resulting network generators are communication agnostic, i.e. they require no communication.
This allows us to generate massive instances of up to 243 vertices and 247 edges
in less than 22 minutes on 32:768 processors.
In addition to proving theoretical bounds for each generator, we perform an extensive
experimental evaluation. We cover both their sequential performance, as well as scaling
behavior.We are able to show that our algorithms are competitive to state-of-the-art implementations
found in network analysis libraries. Additionally, our generators exhibit near
optimal scaling behavior for large instances. Finally, we show that pseudorandomization
has little to no measurable impact on the quality of our generated instances
Sub-model aggregation for scalable eigenvector spatial filtering: Application to spatially varying coefficient modeling
This study proposes a method for aggregating/synthesizing global and local
sub-models for fast and flexible spatial regression modeling. Eigenvector
spatial filtering (ESF) was used to model spatially varying coefficients and
spatial dependence in the residuals by sub-model, while the generalized
product-of-experts method was used to aggregate these sub-models. The major
advantages of the proposed method are as follows: (i) it is highly scalable for
large samples in terms of accuracy and computational efficiency; (ii) it is
easily implemented by estimating sub-models independently first and
aggregating/averaging them thereafter; and (iii) likelihood-based inference is
available because the marginal likelihood is available in closed-form. The
accuracy and computational efficiency of the proposed method are confirmed
using Monte Carlo simulation experiments. This method was then applied to
residential land price analysis in Japan. The results demonstrate the
usefulness of this method for improving the interpretability of spatially
varying coefficients. The proposed method is implemented in an R package
spmoran (version 0.3.0 or later)
A Global-Local Approximation Framework for Large-Scale Gaussian Process Modeling
In this work, we propose a novel framework for large-scale Gaussian process
(GP) modeling. Contrary to the global, and local approximations proposed in the
literature to address the computational bottleneck with exact GP modeling, we
employ a combined global-local approach in building the approximation. Our
framework uses a subset-of-data approach where the subset is a union of a set
of global points designed to capture the global trend in the data, and a set of
local points specific to a given testing location to capture the local trend
around the testing location. The correlation function is also modeled as a
combination of a global, and a local kernel. The performance of our framework,
which we refer to as TwinGP, is on par or better than the state-of-the-art GP
modeling methods at a fraction of their computational cost
Bayesian optimisation for likelihood-free cosmological inference
Many cosmological models have only a finite number of parameters of interest,
but a very expensive data-generating process and an intractable likelihood
function. We address the problem of performing likelihood-free Bayesian
inference from such black-box simulation-based models, under the constraint of
a very limited simulation budget (typically a few thousand). To do so, we adopt
an approach based on the likelihood of an alternative parametric model.
Conventional approaches to approximate Bayesian computation such as
likelihood-free rejection sampling are impractical for the considered problem,
due to the lack of knowledge about how the parameters affect the discrepancy
between observed and simulated data. As a response, we make use of a strategy
previously developed in the machine learning literature (Bayesian optimisation
for likelihood-free inference, BOLFI), which combines Gaussian process
regression of the discrepancy to build a surrogate surface with Bayesian
optimisation to actively acquire training data. We extend the method by
deriving an acquisition function tailored for the purpose of minimising the
expected uncertainty in the approximate posterior density, in the parametric
approach. The resulting algorithm is applied to the problems of summarising
Gaussian signals and inferring cosmological parameters from the Joint
Lightcurve Analysis supernovae data. We show that the number of required
simulations is reduced by several orders of magnitude, and that the proposed
acquisition function produces more accurate posterior approximations, as
compared to common strategies.Comment: 16+9 pages, 12 figures. Matches PRD published version after minor
modification
- …