7 research outputs found

    Subjectively interesting connecting trees

    Get PDF

    Modeling Relational Data via Latent Factor Blockmodel

    Full text link
    In this paper we address the problem of modeling relational data, which appear in many applications such as social network analysis, recommender systems and bioinformatics. Previous studies either consider latent feature based models but disregarding local structure in the network, or focus exclusively on capturing local structure of objects based on latent blockmodels without coupling with latent characteristics of objects. To combine the benefits of the previous work, we propose a novel model that can simultaneously incorporate the effect of latent features and covariates if any, as well as the effect of latent structure that may exist in the data. To achieve this, we model the relation graph as a function of both latent feature factors and latent cluster memberships of objects to collectively discover globally predictive intrinsic properties of objects and capture latent block structure in the network to improve prediction performance. We also develop an optimization transfer algorithm based on the generalized EM-style strategy to learn the latent factors. We prove the efficacy of our proposed model through the link prediction task and cluster analysis task, and extensive experiments on the synthetic data and several real world datasets suggest that our proposed LFBM model outperforms the other state of the art approaches in the evaluated tasks.Comment: 10 pages, 12 figure

    Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

    Full text link
    The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scal-ability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factoriza-tion (NMF) technique. We show that by carefully partition-ing the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result ef-fectively assures practitioners of the scalability of NMF on Web-scale dyadic data

    Mining and modeling graphs using patterns and priors

    No full text

    Probabilistic Structured Models for Plant Trait Analysis

    Get PDF
    University of Minnesota Ph.D. dissertation. March 2017. Major: Communication Sciences and Disorders. Advisor: Arindam Banerjee. 1 computer file (PDF); xii, 171 pages.Many fields in modern science and engineering such as ecology, computational biology, astronomy, signal processing, climate science, brain imaging, natural language processing, and many more involve collecting data sets in which the dimensionality of the data p exceeds the sample size n. Since it is usually impossible to obtain consistent procedures unless p < n, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse structured graphical models, low-rank matrices, and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. Of particular interest are structure learning of graphical models in high dimensional setting. The majority of statistical analysis of graphical model estimations assume that all the data are fully observed and the data points are sampled from the same distribution and provide the sample complexity and convergence rate by considering only one graphical structure for all the observations. In this thesis, we extend the above results to estimate the structure of graphical models where the data is partially observed or the data is sampled from multiple distributions. First, we consider the problem of estimating change in the dependency structure of two p-dimensional models, based on samples drawn from two graphical models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We present and analyze a norm-regularized estimator for directly estimating the change in structure, without having to estimate the structures of the individual graphical models. Next, we consider the problem of estimating sparse structure of Gaussian copula distributions (corresponding to non-paranormal distributions) using samples with missing values. We prove that our proposed estimators consistently estimate the non-paranormal correlation matrix where the convergence rate depends on the probability of missing values. In the second part of thesis, we consider matrix completion problem. Low-rank matrix completion methods have been successful in a variety of settings such as recommendation systems. However, most of the existing matrix completion methods only provide a point estimate of missing entries, and do not characterize uncertainties of the predictions. First, we illustrate that the the posterior distribution in latent factor models, such as probabilistic matrix factorization, when marginalized over one latent factor has the Matrix Generalized Inverse Gaussian (MGIG) distribution. We show that the MGIG is unimodal, and the mode can be obtained by solving an Algebraic Riccati Equation equation. The characterization leads to a novel Collapsed Monte Carlo inference algorithm for such latent factor models. Next, we propose a Bayesian hierarchical probabilistic matrix factorization (BHPMF) model to 1) incorporate hierarchical side information, and 2) provide uncertainty quantified predictions. The former yields significant performance improvements in the problem of plant trait prediction, a key problem in ecology, by leveraging the taxonomic hierarchy in the plant kingdom. The latter is helpful in identifying predictions of low confidence which can in turn be used to guide field work for data collection efforts. Finally, we consider applications of probabilistic structured models to plant trait analysis. We apply BHPMF model to fill the gaps in TRY database. The BHPMF model is the-state-of-the-art model for plant trait prediction and is getting increasing visibility and usage in the plant trait analysis. We have submitted a R package for BHPMF to CRAN. Next, we apply the Gaussian graphical model structure estimators to obtain the trait-trait interactions. We study the trait-trait interactions structure at different climate zones and among different plant growth forms and uncover the dependence of traits on climate and on vegetation

    A Recommendation System for Preconditioned Iterative Solvers

    Get PDF
    Solving linear systems of equations is an integral part of most scientific simulations. In recent years, there has been a considerable interest in large scale scientific simulation of complex physical processes. Iterative solvers are usually preferred for solving linear systems of such magnitude due to their lower computational requirements. Currently, computational scientists have access to a multitude of iterative solver options available as "plug-and- play" components in various problem solving environments. Choosing the right solver configuration from the available choices is critical for ensuring convergence and achieving good performance, especially for large complex matrices. However, identifying the "best" preconditioned iterative solver and parameters is challenging even for an expert due to issues such as the lack of a unified theoretical model, complexity of the solver configuration space, and multiple selection criteria. Therefore, it is desirable to have principled practitioner-centric strategies for identifying solver configuration(s) for solving large linear systems. The current dissertation presents a general practitioner-centric framework for (a) problem independent retrospective analysis, and (b) problem-specific predictive modeling of performance data. Our retrospective performance analysis methodology introduces new metrics such as area under performance-profile curve and conditional variance-based finetuning score that facilitate a robust comparative performance evaluation as well as parameter sensitivity analysis. We present results using this analysis approach on a number of popular preconditioned iterative solvers available in packages such as PETSc, Trilinos, Hypre, ILUPACK, and WSMP. The predictive modeling of performance data is an integral part of our multi-stage approach for solver recommendation. The key novelty of our approach lies in our modular learning based formulation that comprises of three sub problems: (a) solvability modeling, (b) performance modeling, and (c) performance optimization, which provides the flexibility to effectively target challenges such as software failure and multiobjective optimization. Our choice of a "solver trial" instance space represented in terms of the characteristics of the corresponding "linear system", "solver configuration" and their interactions, leads to a scalable and elegant formulation. Empirical evaluation of our approach on performance datasets associated with fairly large groups of solver configurations demonstrates that one can obtain high quality recommendations that are close to the ideal choices
    corecore