2,531 research outputs found
Transforming Graph Representations for Statistical Relational Learning
Relational data representations have become an increasingly important topic
due to the recent proliferation of network datasets (e.g., social, biological,
information networks) and a corresponding increase in the application of
statistical relational learning (SRL) algorithms to these domains. In this
article, we examine a range of representation issues for graph-based relational
data. Since the choice of relational data representation for the nodes, links,
and features can dramatically affect the capabilities of SRL algorithms, we
survey approaches and opportunities for relational representation
transformation designed to improve the performance of these algorithms. This
leads us to introduce an intuitive taxonomy for data representation
transformations in relational domains that incorporates link transformation and
node transformation as symmetric representation tasks. In particular, the
transformation tasks for both nodes and links include (i) predicting their
existence, (ii) predicting their label or type, (iii) estimating their weight
or importance, and (iv) systematically constructing their relevant features. We
motivate our taxonomy through detailed examples and use it to survey and
compare competing approaches for each of these tasks. We also discuss general
conditions for transforming links, nodes, and features. Finally, we highlight
challenges that remain to be addressed
Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning
Many problems in sequential decision making and stochastic control often have
natural multiscale structure: sub-tasks are assembled together to accomplish
complex goals. Systematically inferring and leveraging hierarchical structure,
particularly beyond a single level of abstraction, has remained a longstanding
challenge. We describe a fast multiscale procedure for repeatedly compressing,
or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of
sub-problems at different scales is automatically determined. Coarsened MDPs
are themselves independent, deterministic MDPs, and may be solved using
existing algorithms. The multiscale representation delivered by this procedure
decouples sub-tasks from each other and can lead to substantial improvements in
convergence rates both locally within sub-problems and globally across
sub-problems, yielding significant computational savings. A second fundamental
aspect of this work is that these multiscale decompositions yield new transfer
opportunities across different problems, where solutions of sub-tasks at
different levels of the hierarchy may be amenable to transfer to new problems.
Localized transfer of policies and potential operators at arbitrary scales is
emphasized. Finally, we demonstrate compression and transfer in a collection of
illustrative domains, including examples involving discrete and continuous
statespaces.Comment: 86 pages, 15 figure
Diffusion, methods and applications
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de lectura: junio de 2014Big Data, an important problem nowadays, can be understood in terms of a very large number of
patterns, a very large pattern dimension or, often, both. In this thesis, we will concentrate on the
high dimensionality issue, applying manifold learning techniques for visualizing and analyzing
such patterns.
The core technique will be Di usion Maps (DM) and its Anisotropic Di usion (AD) version,
introduced by Ronald R. Coifman and his school at Yale University, and of which we will give
a complete, systematic, compact and self-contained treatment. This will be done after a brief
survey of previous manifold learning methods.
The algorithmic contributions of the thesis will be centered in two computational challenges of
di usion methods: the potential high cost of the similarity matrix eigenanalysis that is needed
to define the di usion embedding coordinates, and the di culty of computing this embedding
over new patterns not available for the initial eigenanalysis. With respect to the first issue, we
will show how the AD set up can be used to skip it when looking for local models. In this case,
local patterns will be selected through a k-Nearest Neighbors search using a properly defined
local Mahalanobis distance, that enables neighbors to be found over the latent variable space
underlying the AD model while we can work directly with the observable patterns and, thus,
avoiding the potentially costly similarity matrix eigenanalysis.
The second proposed algorithm, that we will call Auto-adaptative Laplacian Pyramids (ALP),
focuses in the out-of-sample embedding extension and consists in a modification of the classical
Laplacian Pyramids (LP) method. In this new algorithm the LP iterations will be combined with
an estimate of the Leave One Out CV error, something that makes possible to directly define
during training a criterion to estimate the optimal stopping point of this iterative algorithm.
This thesis will also present several application contributions to important problems in renewable
energy and medical imaging. More precisely, we will show how DM is a good method
for dimensionality reduction of meteorological weather predictions, providing tools to visualize
and describe these data, as well as to cluster them in order to define local models.
In turn, we will apply our AD-based localized search method first to find the location in the
human body of CT scan images and then to predict wind energy ramps on both individual farms
and over the whole of Spain. We will see that, in both cases, our results improve on the current
state of the art methods.
Finally, we will compare our ALP proposal with the well-known Nyström method as well as
with LP on two large dimensional problems, the time compression of meteorological data and
the analysis of meteorological variables relevant in daily radiation forecasts. In both cases we
will show that ALP compares favorably with the other approaches for out-of-sample extension
problemsBig Data es un problema importante hoy en día, que puede ser entendido en términos de un
amplio número de patrones, una alta dimensión o, como sucede normalmente, de ambos. Esta
tesis se va a centrar en problemas de alta dimensión, aplicando técnicas de aprendizaje de
subvariedades para visualizar y analizar dichos patrones.
La técnica central será Di usion Maps (DM) y su versión anisotrópica, Anisotropic Di usion
(AD), introducida por Ronald R. Coifman y su escuela en la Universidad de Yale, la cual va a
ser tratada de manera completa, sistemática, compacta y auto-contenida. Esto se llevará a cabo
tras un breve repaso de métodos previos de aprendizaje de subvariedades.
Las contribuciones algorítmicas de esta tesis estarán centradas en dos de los grandes retos en
métodos de difusión: el potencial alto coste que tiene el análisis de autovalores de la matriz de
similitud, necesaria para definir las coordenadas embebidas; y la dificultad para calcular este
mismo embedding sobre nuevos datos que no eran accesibles cuando se realizó el análisis de
autovalores inicial. Respecto al primer tema, se mostrará cómo la aproximación AD se puede
utilizar para evitar el cálculo del embedding cuando estamos interesados en definir modelos locales.
En este caso, se seleccionarán patrones cercanos por medio de una búsqueda de vecinos
próximos (k-NN), usando como distancia una medida de Mahalanobis local que permita encontrar
vecinos sobre las variables latentes existentes bajo el modelo de AD. Todo esto se llevará
a cabo trabajando directamente sobre los patrones observables y, por tanto, evitando el costoso
cálculo que supone el cálculo de autovalores de la matriz de similitud.
El segundo algoritmo propuesto, que llamaremos Auto-adaptative Laplacian Pyramids (ALP),
se centra en la extensión del embedding para datos fuera de la muestra, y se trata de una modificación
del método denominado Laplacian Pyramids (LP). En este nuevo algoritmo, las iteraciones
de LP se combinarán con una estimación del error de Leave One Out CV, permitiendo definir
directamente durante el periodo de entrenamiento, un criterio para estimar el criterio de parada
óptimo para este método iterativo.
En esta tesis se presentarán también una serie de contribuciones de aplicación de estas técnicas
a importantes problemas en energías renovables e imágenes médicas. Más concretamente, se
muestra como DM es un buen método para reducir la dimensión de predicciones del tiempo
meteorológico, sirviendo por tanto de herramienta de visualización y descripción, así como de
clasificación de los datos con vistas a definir modelos locales sobre cada grupo descrito.
Posteriormente, se aplicará nuestro método de búsqueda localizada basado en AD tanto a la
búsqueda de la correspondiente posición de tomografías en el cuerpo humano, como para la
detección de rampas de energía eólica en parques individuales o de manera global en España.
En ambos casos se verá como los resultados obtenidos mejoran los métodos del estado del arte
actual.
Finalmente se comparará el algoritmo de ALP propuesto frente al conocido método de Nyström
y al método de LP, en dos problemas de alta dimensión: el problema de compresión temporal
de datos meteorológicos y el análisis de variables meteorológicas relevantes para la predicción
de la radiación diaria. En ambos casos se mostrará cómo ALP es comparativamente mejor que
otras aproximaciones existentes para resolver el problema de extensión del embedding a puntos
fuera de la muestr
Manifold Alignment Aware Ants:a Markovian process for manifold extraction
Dimensionality reduction and clustering are often used as preliminary steps for many complex machine learning tasks. The presence of noise and outliers can deteriorate the performance of such preprocessing and therefore impair the subsequent analysis tremendously. In manifold learning, several studies indicate solutions for removing background noise or noise close to the structure when the density is substantially higher than that exhibited by the noise. However, in many applications, including astronomical datasets, the density varies alongside manifolds that are buried in a noisy background. We propose a novel method to extract manifolds in the presence of noise based on the idea of Ant colony optimization. In contrast to the existing random walk solutions, our technique captures points which are locally aligned with major directions of the manifold. Moreover, we empirically show that the biologically inspired formulation of ant pheromone reinforces this behavior enabling it to recover multiple manifolds embedded in extremely noisy data clouds. The algorithm's performance is demonstrated in comparison to the state-of-the-art approaches, such as Markov Chain, LLPD, and Disperse, on several synthetic and real astronomical datasets stemming from an N-body simulation of a cosmological volum
Statistical shape analysis for bio-structures : local shape modelling, techniques and applications
A Statistical Shape Model (SSM) is a statistical representation of a shape obtained
from data to study variation in shapes. Work on shape modelling is constrained by
many unsolved problems, for instance, difficulties in modelling local versus global
variation. SSM have been successfully applied in medical image applications such
as the analysis of brain anatomy. Since brain structure is so complex and varies
across subjects, methods to identify morphological variability can be useful for
diagnosis and treatment.
The main objective of this research is to generate and develop a statistical shape
model to analyse local variation in shapes. Within this particular context, this
work addresses the question of what are the local elements that need to be identified for effective shape analysis. Here, the proposed method is based on a Point
Distribution Model and uses a combination of other well known techniques: Fractal
analysis; Markov Chain Monte Carlo methods; and the Curvature Scale Space
representation for the problem of contour localisation. Similarly, Diffusion Maps
are employed as a spectral shape clustering tool to identify sets of local partitions
useful in the shape analysis. Additionally, a novel Hierarchical Shape Analysis
method based on the Gaussian and Laplacian pyramids is explained and used to
compare the featured Local Shape Model.
Experimental results on a number of real contours such as animal, leaf and brain
white matter outlines have been shown to demonstrate the effectiveness of the
proposed model. These results show that local shape models are efficient in modelling
the statistical variation of shape of biological structures. Particularly, the
development of this model provides an approach to the analysis of brain images
and brain morphometrics. Likewise, the model can be adapted to the problem of
content based image retrieval, where global and local shape similarity needs to be
measured
A Computational Framework for Learning from Complex Data: Formulations, Algorithms, and Applications
Many real-world processes are dynamically changing over time. As a consequence, the observed complex data generated by these processes also evolve smoothly. For example, in computational biology, the expression data matrices are evolving, since gene expression controls are deployed sequentially during development in many biological processes. Investigations into the spatial and temporal gene expression dynamics are essential for understanding the regulatory biology governing development. In this dissertation, I mainly focus on two types of complex data: genome-wide spatial gene expression patterns in the model organism fruit fly and Allen Brain Atlas mouse brain data. I provide a framework to explore spatiotemporal regulation of gene expression during development. I develop evolutionary co-clustering formulation to identify co-expressed domains and the associated genes simultaneously over different temporal stages using a mesh-generation pipeline. I also propose to employ the deep convolutional neural networks as a multi-layer feature extractor to generate generic representations for gene expression pattern in situ hybridization (ISH) images. Furthermore, I employ the multi-task learning method to fine-tune the pre-trained models with labeled ISH images. My proposed computational methods are evaluated using synthetic data sets and real biological data sets including the gene expression data from the fruit fly BDGP data sets and Allen Developing Mouse Brain Atlas in comparison with baseline existing methods. Experimental results indicate that the proposed representations, formulations, and methods are efficient and effective in annotating and analyzing the large-scale biological data sets
- …