1,540 research outputs found

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    Factor Analysis of Data Matrices: New Theoretical and Computational Aspects With Applications

    Get PDF
    The classical fitting problem in exploratory factor analysis (EFA) is to find estimates for the factor loadings matrix and the matrix of unique factor variances which give the best fit to the sample covariance or correlation matrix with respect to some goodness-of-fit criterion. Predicted factor scores can be obtained as a function of these estimates and the data. In this thesis, the EFA model is considered as a specific data matrix decomposition with fixed unknown matrix parameters. Fitting the EFA model directly to the data yields simultaneous solutions for both loadings and factor scores. Several new algorithms are introduced for the least squares and weighted least squares estimation of all EFA model unknowns. The numerical procedures are based on the singular value decomposition, facilitate the estimation of both common and unique factor scores, and work equally well when the number of variables exceeds the number of available observations. Like EFA, noisy independent component analysis (ICA) is a technique for reduction of the data dimensionality in which the interrelationships among the observed variables are explained in terms of a much smaller number of latent factors. The key difference between EFA and noisy ICA is that in the latter model the common factors are assumed to be both independent and non-normal. In contrast to EFA, there is no rotational indeterminacy in noisy ICA. In this thesis, noisy ICA is viewed as a method of factor rotation in EFA. Starting from an initial EFA solution, an orthogonal rotation matrix is sought that minimizes the dependence between the common factors. The idea of rotating the scores towards independence is also employed in three-mode factor analysis to analyze data sets having a three-way structure. The new theoretical and computational aspects contained in this thesis are illustrated by means of several examples with real and artificial data

    Distinguishing cause from effect using observational data: methods and benchmarks

    Get PDF
    The discovery of causal relationships from purely observational data is a fundamental problem in science. The most elementary form of such a causal discovery problem is to decide whether X causes Y or, alternatively, Y causes X, given joint observations of two variables X, Y. An example is to decide whether altitude causes temperature, or vice versa, given only joint measurements of both variables. Even under the simplifying assumptions of no confounding, no feedback loops, and no selection bias, such bivariate causal discovery problems are challenging. Nevertheless, several approaches for addressing those problems have been proposed in recent years. We review two families of such methods: Additive Noise Methods (ANM) and Information Geometric Causal Inference (IGCI). We present the benchmark CauseEffectPairs that consists of data for 100 different cause-effect pairs selected from 37 datasets from various domains (e.g., meteorology, biology, medicine, engineering, economy, etc.) and motivate our decisions regarding the "ground truth" causal directions of all pairs. We evaluate the performance of several bivariate causal discovery methods on these real-world benchmark data and in addition on artificially simulated data. Our empirical results on real-world data indicate that certain methods are indeed able to distinguish cause from effect using only purely observational data, although more benchmark data would be needed to obtain statistically significant conclusions. One of the best performing methods overall is the additive-noise method originally proposed by Hoyer et al. (2009), which obtains an accuracy of 63+-10 % and an AUC of 0.74+-0.05 on the real-world benchmark. As the main theoretical contribution of this work we prove the consistency of that method.Comment: 101 pages, second revision submitted to Journal of Machine Learning Researc

    Optimizing Data Selection for Contact Prediction in Proteins

    Get PDF
    Proteins are essential to life across all organisms. They act as enzymes, antibodies, transporters of molecules, structural elements, among other important roles. Their ability to interact with specific molecules in a selective manner, is what makes them important. Being able to understand their interaction can provide many advantages in fields such as drug design and metabolic engineering. Current methods of predicting protein interaction attempt to geometrically fit the structures of two proteins together by generating a large amount of potential configurations and then discriminating the correct pose from the remaining ones. Given the large search space, approaches to reduce the complexity are often employed. Identifying a contact point between the pairing proteins is a good constraining factor. If at least one contact can be predicted among a small set of possibilities (e.g. 100), the search space will be significantly reduced. Using structural and evolutionary information of the interacting proteins, a machine learning predictor can be developed for this task. Such evolutionary measures are computed over a substantial amount of homologous sequences, which can be filtered and ordered in many different ways. As a result, a machine learning solution was developed that focused in measuring the effects that differing homolog arrangements can have over the final prediction

    Applications of information theory in filtering and sensor management

    Get PDF
    “A classical sensor tasking methodology is analyzed in the context of generating sensor schedules for monitoring resident space objects (RSOs). This approach, namely maximizing the expected Kullback-Leibler divergence in a measurement update, is evaluated from a probabilistic perspective to determine the accuracy of the conventional approach. In this investigation, a newdivergence-based approach is proposed to circumvent themyopic nature of the measure, forecasting the potential information contribution to a time of interest and leveraging the system dynamics and measurement model to do so. The forecasted objective exploits properties of a batch measurement update to frequently exhibit faster optimization times when compared to an accumulation of the conventional myopic employment. The forecasting approach additionally affords the ability to emphasize tracking performance at the point in time to which the information is mapped. The forecasted divergence is lifted into the multitarget domain and combined with a collision entropy objective. The addition of the collision consideration assists the tasking policy in avoiding scenarios in which determining the origin of a measurement is difficult, ameliorating issues when executing the sensor schedule. The properties of the divergencebased and collision entropy-based objectives are explored to determine appropriate optimization schemes that can enable their use in real-time application. It is demonstrated through a single-target tasking simulation that the forecasted measure successfully outperforms traditional approaches with regard to tracking performance at the forecasted time. This simulation is followed by a multitarget tasking scenario in which different optimization strategies are analyzed, illustrating the feasibility of the proposed tasking policy and evaluating the solution from both schedule quality and runtime perspectives”--Abstract, page iii
    • …
    corecore