7 research outputs found
Modeling Relational Data via Latent Factor Blockmodel
In this paper we address the problem of modeling relational data, which
appear in many applications such as social network analysis, recommender
systems and bioinformatics. Previous studies either consider latent feature
based models but disregarding local structure in the network, or focus
exclusively on capturing local structure of objects based on latent blockmodels
without coupling with latent characteristics of objects. To combine the
benefits of the previous work, we propose a novel model that can simultaneously
incorporate the effect of latent features and covariates if any, as well as the
effect of latent structure that may exist in the data. To achieve this, we
model the relation graph as a function of both latent feature factors and
latent cluster memberships of objects to collectively discover globally
predictive intrinsic properties of objects and capture latent block structure
in the network to improve prediction performance. We also develop an
optimization transfer algorithm based on the generalized EM-style strategy to
learn the latent factors. We prove the efficacy of our proposed model through
the link prediction task and cluster analysis task, and extensive experiments
on the synthetic data and several real world datasets suggest that our proposed
LFBM model outperforms the other state of the art approaches in the evaluated
tasks.Comment: 10 pages, 12 figure
Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scal-ability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factoriza-tion (NMF) technique. We show that by carefully partition-ing the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result ef-fectively assures practitioners of the scalability of NMF on Web-scale dyadic data
Probabilistic Structured Models for Plant Trait Analysis
University of Minnesota Ph.D. dissertation. March 2017. Major: Communication Sciences and Disorders. Advisor: Arindam Banerjee. 1 computer file (PDF); xii, 171 pages.Many fields in modern science and engineering such as ecology, computational biology, astronomy, signal processing, climate science, brain imaging, natural language processing, and many more involve collecting data sets in which the dimensionality of the data p exceeds the sample size n. Since it is usually impossible to obtain consistent procedures unless p < n, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse structured graphical models, low-rank matrices, and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. Of particular interest are structure learning of graphical models in high dimensional setting. The majority of statistical analysis of graphical model estimations assume that all the data are fully observed and the data points are sampled from the same distribution and provide the sample complexity and convergence rate by considering only one graphical structure for all the observations. In this thesis, we extend the above results to estimate the structure of graphical models where the data is partially observed or the data is sampled from multiple distributions. First, we consider the problem of estimating change in the dependency structure of two p-dimensional models, based on samples drawn from two graphical models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We present and analyze a norm-regularized estimator for directly estimating the change in structure, without having to estimate the structures of the individual graphical models. Next, we consider the problem of estimating sparse structure of Gaussian copula distributions (corresponding to non-paranormal distributions) using samples with missing values. We prove that our proposed estimators consistently estimate the non-paranormal correlation matrix where the convergence rate depends on the probability of missing values. In the second part of thesis, we consider matrix completion problem. Low-rank matrix completion methods have been successful in a variety of settings such as recommendation systems. However, most of the existing matrix completion methods only provide a point estimate of missing entries, and do not characterize uncertainties of the predictions. First, we illustrate that the the posterior distribution in latent factor models, such as probabilistic matrix factorization, when marginalized over one latent factor has the Matrix Generalized Inverse Gaussian (MGIG) distribution. We show that the MGIG is unimodal, and the mode can be obtained by solving an Algebraic Riccati Equation equation. The characterization leads to a novel Collapsed Monte Carlo inference algorithm for such latent factor models. Next, we propose a Bayesian hierarchical probabilistic matrix factorization (BHPMF) model to 1) incorporate hierarchical side information, and 2) provide uncertainty quantified predictions. The former yields significant performance improvements in the problem of plant trait prediction, a key problem in ecology, by leveraging the taxonomic hierarchy in the plant kingdom. The latter is helpful in identifying predictions of low confidence which can in turn be used to guide field work for data collection efforts. Finally, we consider applications of probabilistic structured models to plant trait analysis. We apply BHPMF model to fill the gaps in TRY database. The BHPMF model is the-state-of-the-art model for plant trait prediction and is getting increasing visibility and usage in the plant trait analysis. We have submitted a R package for BHPMF to CRAN. Next, we apply the Gaussian graphical model structure estimators to obtain the trait-trait interactions. We study the trait-trait interactions structure at different climate zones and among different plant growth forms and uncover the dependence of traits on climate and on vegetation
Recommended from our members
Probabilistic Models of Student Learning and Forgetting
This thesis uses statistical machine learning techniques to construct predictive models of human learning and to improve human learning by discovering optimal teaching methodologies. In Chapters 2 and 3, I present and evaluate models for predicting the changing memory strength of material being studied over time. The models combine a psychological theory of memory with Bayesian methods for inferring individual differences. In Chapter 4, I develop methods for delivering efficient, systematic, personalized review using the statistical models. Results are presented from three large semester-long experiments with middle school students which demonstrate how this \u22big data\u22 approach to education yields substantial gains in the long-term retention of course material. In Chapter 5, I focus on optimizing various aspects of instruction for populations of students. This involves a novel experimental paradigm which combines Bayesian nonparametric modeling techniques and probabilistic generative models of student performance. In Chapters 6 and 7, I present supporting laboratory behavioral studies and theoretical analyses. These include an examination of the relationship between study format and the testing effect, and a parsimonious theoretical account of long-term recency effects
A Recommendation System for Preconditioned Iterative Solvers
Solving linear systems of equations is an integral part of most scientific simulations. In
recent years, there has been a considerable interest in large scale scientific simulation of
complex physical processes. Iterative solvers are usually preferred for solving linear systems
of such magnitude due to their lower computational requirements. Currently, computational
scientists have access to a multitude of iterative solver options available as "plug-and-
play" components in various problem solving environments. Choosing the right solver
configuration from the available choices is critical for ensuring convergence and achieving
good performance, especially for large complex matrices. However, identifying the
"best" preconditioned iterative solver and parameters is challenging even for an expert due
to issues such as the lack of a unified theoretical model, complexity of the solver configuration
space, and multiple selection criteria. Therefore, it is desirable to have principled
practitioner-centric strategies for identifying solver configuration(s) for solving large linear
systems.
The current dissertation presents a general practitioner-centric framework for (a) problem
independent retrospective analysis, and (b) problem-specific predictive modeling of
performance data. Our retrospective performance analysis methodology introduces new
metrics such as area under performance-profile curve and conditional variance-based finetuning
score that facilitate a robust comparative performance evaluation as well as parameter
sensitivity analysis. We present results using this analysis approach on a number of popular
preconditioned iterative solvers available in packages such as PETSc, Trilinos, Hypre, ILUPACK, and WSMP. The predictive modeling of performance data is an integral part
of our multi-stage approach for solver recommendation. The key novelty of our approach
lies in our modular learning based formulation that comprises of three sub problems: (a)
solvability modeling, (b) performance modeling, and (c) performance optimization, which
provides the flexibility to effectively target challenges such as software failure and multiobjective
optimization. Our choice of a "solver trial" instance space represented in terms
of the characteristics of the corresponding "linear system", "solver configuration" and their
interactions, leads to a scalable and elegant formulation. Empirical evaluation of our approach
on performance datasets associated with fairly large groups of solver configurations
demonstrates that one can obtain high quality recommendations that are close to the ideal
choices