276,129 research outputs found

    Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection

    Full text link
    We study the problem of selecting a subset of k random variables from a large set, in order to obtain the best linear prediction of another variable of interest. This problem can be viewed in the context of both feature selection and sparse approximation. We analyze the performance of widely used greedy heuristics, using insights from the maximization of submodular functions and spectral analysis. We introduce the submodularity ratio as a key quantity to help understand why greedy algorithms perform well even when the variables are highly correlated. Using our techniques, we obtain the strongest known approximation guarantees for this problem, both in terms of the submodularity ratio and the smallest k-sparse eigenvalue of the covariance matrix. We further demonstrate the wide applicability of our techniques by analyzing greedy algorithms for the dictionary selection problem, and significantly improve the previously known guarantees. Our theoretical analysis is complemented by experiments on real-world and synthetic data sets; the experiments show that the submodularity ratio is a stronger predictor of the performance of greedy algorithms than other spectral parameters

    Cross-validation of stagewise mixed-model analysis of Swedish variety trials with winter wheat and spring barley

    Get PDF
    In cultivar testing, linear mixed models have been used routinely to analyze multienvironment trials. A single‐stage analysis is considered as the gold standard, whereas two‐stage analysis produces similar results when a fully efficient weighting method is used, namely when the full variance–covariance matrix of the estimated means from Stage 1 is forwarded to Stage 2. However, in practice, this may be hard to do and a diagonal approximation is often used. We conducted a cross‐validation with data from Swedish cultivar trials on winter wheat (Triticum aestivum L.) and spring barley (Hordeum vulgare L.) to assess the performance of single‐stage and two‐stage analyses. The fully efficient method and two diagonal approximation methods were used for weighting in the two‐stage analyses. In Sweden, cultivar recommendation is delineated by zones (regions), not individual locations. We demonstrate the use of best linear unbiased prediction (BLUP) for cultivar effects per zone, which exploits correlations between zones and thus allows information to be borrowed across zones. Complex variance–covariance structures were applied to allow for heterogeneity of cultivar × zone variance. The single‐stage analysis and the three weighted two‐stage analyses all performed similarly. Loss of information caused by a diagonal approximation of the variance–covariance matrix of adjusted means from Stage 1 was negligible. As expected, BLUP outperformed best linear unbiased estimation. Complex variance–covariance structures were dispensable. To our knowledge, this study is the first to use cross‐validation for comparing single‐stage analyses with stagewise analyses

    Low-dimensional Representation of Error Covariance

    Get PDF
    Ensemble and reduced-rank approaches to prediction and assimilation rely on low-dimensional approximations of the estimation error covariances. Here stability properties of the forecast/analysis cycle for linear, time-independent systems are used to identify factors that cause the steady-state analysis error covariance to admit a low-dimensional representation. A useful measure of forecast/analysis cycle stability is the bound matrix, a function of the dynamics, observation operator and assimilation method. Upper and lower estimates for the steady-state analysis error covariance matrix eigenvalues are derived from the bound matrix. The estimates generalize to time-dependent systems. If much of the steady-state analysis error variance is due to a few dominant modes, the leading eigenvectors of the bound matrix approximate those of the steady-state analysis error covariance matrix. The analytical results are illustrated in two numerical examples where the Kalman filter is carried to steady state. The first example uses the dynamics of a generalized advection equation exhibiting nonmodal transient growth. Failure to observe growing modes leads to increased steady-state analysis error variances. Leading eigenvectors of the steady-state analysis error covariance matrix are well approximated by leading eigenvectors of the bound matrix. The second example uses the dynamics of a damped baroclinic wave model. The leading eigenvectors of a lowest-order approximation of the bound matrix are shown to approximate well the leading eigenvectors of the steady-state analysis error covariance matrix

    Impact of individual nodes in Boolean network dynamics

    Full text link
    Boolean networks serve as discrete models of regulation and signaling in biological cells. Identifying the key controllers of such processes is important for understanding the dynamical systems and planning further analysis. Here we quantify the dynamical impact of a node as the probability of damage spreading after switching the node's state. We find that the leading eigenvector of the adjacency matrix is a good predictor of dynamical impact in the case of long-term spreading. This so-called eigenvector centrality is also a good proxy measure of the influence a node's initial state has on the attractor the system eventually arrives at. Quality of prediction is further improved when eigenvector centrality is based on the weighted matrix of activities rather than the unweighted adjacency matrix. Simulations are performed with ensembles of random Boolean networks and a Boolean model of signaling in fibroblasts. The findings are supported by analytic arguments from a linear approximation of damage spreading.Comment: 6 pages, 3 figures, 3 table

    Halo abundances within the cosmic web

    Full text link
    We investigate the dependence of the mass function of dark-matter haloes on their environment within the cosmic web of large-scale structure. A dependence of the halo mass function on large-scale mean density is a standard element of cosmological theory, allowing mass-dependent biasing to be understood via the peak-background split. On the assumption of a Gaussian density field, this analysis can be extended to ask how the mass function depends on the geometrical environment: clusters, filaments, sheets and voids, as classified via the tidal tensor (the Hessian matrix of the gravitational potential). In linear theory, the problem can be solved exactly, and the result is attractively simple: the conditional mass function has no explicit dependence on the local tidal field, and is a function only of the local density on the filtering scale used to define the tidal tensor. There is nevertheless a strong implicit predicted dependence on geometrical environment, because the local density couples statistically to the derivatives of the potential. We compute the predictions of this model and study the limits of their validity by comparing them to results deduced empirically from NN-body simulations. We have verified that, to a good approximation, the abundance of haloes in different environments depends only on their densities, and not on their tidal structure. In this sense we find relative differences between halo abundances in different environments with the same density which are smaller than 13%. Furthermore, for sufficiently large filtering scales, the agreement with the theoretical prediction is good, although there are important deviations from the Gaussian prediction at small, non-linear scales. We discuss how to obtain improved predictions in this regime, using the 'effective-universe' approach.Comment: 14 pages, 6 figures. Revision matching journal versio

    Modified Linear Projection for Large Spatial Data Sets

    Full text link
    Recent developments in engineering techniques for spatial data collection such as geographic information systems have resulted in an increasing need for methods to analyze large spatial data sets. These sorts of data sets can be found in various fields of the natural and social sciences. However, model fitting and spatial prediction using these large spatial data sets are impractically time-consuming, because of the necessary matrix inversions. Various methods have been developed to deal with this problem, including a reduced rank approach and a sparse matrix approximation. In this paper, we propose a modification to an existing reduced rank approach to capture both the large- and small-scale spatial variations effectively. We have used simulated examples and an empirical data analysis to demonstrate that our proposed approach consistently performs well when compared with other methods. In particular, the performance of our new method does not depend on the dependence properties of the spatial covariance functions.Comment: 29 pages, 5 figures, 4 table

    Sharp analysis of low-rank kernel matrix approximations

    Get PDF
    We consider supervised learning problems within the positive-definite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n^2). Low-rank approximations of the kernel matrix are often considered as they allow the reduction of running time complexities to O(p^2 n), where p is the rank of the approximation. The practicality of such methods thus depends on the required rank p. In this paper, we show that in the context of kernel ridge regression, for approximations based on a random subset of columns of the original kernel matrix, the rank p may be chosen to be linear in the degrees of freedom associated with the problem, a quantity which is classically used in the statistical analysis of such methods, and is often seen as the implicit number of parameters of non-parametric estimators. This result enables simple algorithms that have sub-quadratic running time complexity, but provably exhibit the same predictive performance than existing algorithms, for any given problem instance, and not only for worst-case situations
    corecore