276,129 research outputs found
Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection
We study the problem of selecting a subset of k random variables from a large
set, in order to obtain the best linear prediction of another variable of
interest. This problem can be viewed in the context of both feature selection
and sparse approximation. We analyze the performance of widely used greedy
heuristics, using insights from the maximization of submodular functions and
spectral analysis. We introduce the submodularity ratio as a key quantity to
help understand why greedy algorithms perform well even when the variables are
highly correlated. Using our techniques, we obtain the strongest known
approximation guarantees for this problem, both in terms of the submodularity
ratio and the smallest k-sparse eigenvalue of the covariance matrix. We further
demonstrate the wide applicability of our techniques by analyzing greedy
algorithms for the dictionary selection problem, and significantly improve the
previously known guarantees. Our theoretical analysis is complemented by
experiments on real-world and synthetic data sets; the experiments show that
the submodularity ratio is a stronger predictor of the performance of greedy
algorithms than other spectral parameters
Cross-validation of stagewise mixed-model analysis of Swedish variety trials with winter wheat and spring barley
In cultivar testing, linear mixed models have been used routinely to analyze multienvironment trials. A single‐stage analysis is considered as the gold standard, whereas two‐stage analysis produces similar results when a fully efficient weighting method is used, namely when the full variance–covariance matrix of the estimated means from Stage 1 is forwarded to Stage 2. However, in practice, this may be hard to do and a diagonal approximation is often used. We conducted a cross‐validation with data from Swedish cultivar trials on winter wheat (Triticum aestivum L.) and spring barley (Hordeum vulgare L.) to assess the performance of single‐stage and two‐stage analyses. The fully efficient method and two diagonal approximation methods were used for weighting in the two‐stage analyses. In Sweden, cultivar recommendation is delineated by zones (regions), not individual locations. We demonstrate the use of best linear unbiased prediction (BLUP) for cultivar effects per zone, which exploits correlations between zones and thus allows information to be borrowed across zones. Complex variance–covariance structures were applied to allow for heterogeneity of cultivar × zone variance. The single‐stage analysis and the three weighted two‐stage analyses all performed similarly. Loss of information caused by a diagonal approximation of the variance–covariance matrix of adjusted means from Stage 1 was negligible. As expected, BLUP outperformed best linear unbiased estimation. Complex variance–covariance structures were dispensable. To our knowledge, this study is the first to use cross‐validation for comparing single‐stage analyses with stagewise analyses
Low-dimensional Representation of Error Covariance
Ensemble and reduced-rank approaches to prediction and assimilation rely on low-dimensional approximations of the estimation error covariances. Here stability properties of the forecast/analysis cycle for linear, time-independent systems are used to identify factors that cause the steady-state analysis error covariance to admit a low-dimensional representation. A useful measure of forecast/analysis cycle stability is the bound matrix, a function of the dynamics, observation operator and assimilation method. Upper and lower estimates for the steady-state analysis error covariance matrix eigenvalues are derived from the bound matrix. The estimates generalize to time-dependent systems. If much of the steady-state analysis error variance is due to a few dominant modes, the leading eigenvectors of the bound matrix approximate those of the steady-state analysis error covariance matrix. The analytical results are illustrated in two numerical examples where the Kalman filter is carried to steady state. The first example uses the dynamics of a generalized advection equation exhibiting nonmodal transient growth. Failure to observe growing modes leads to increased steady-state analysis error variances. Leading eigenvectors of the steady-state analysis error covariance matrix are well approximated by leading eigenvectors of the bound matrix. The second example uses the dynamics of a damped baroclinic wave model. The leading eigenvectors of a lowest-order approximation of the bound matrix are shown to approximate well the leading eigenvectors of the steady-state analysis error covariance matrix
Impact of individual nodes in Boolean network dynamics
Boolean networks serve as discrete models of regulation and signaling in
biological cells. Identifying the key controllers of such processes is
important for understanding the dynamical systems and planning further
analysis. Here we quantify the dynamical impact of a node as the probability of
damage spreading after switching the node's state. We find that the leading
eigenvector of the adjacency matrix is a good predictor of dynamical impact in
the case of long-term spreading. This so-called eigenvector centrality is also
a good proxy measure of the influence a node's initial state has on the
attractor the system eventually arrives at. Quality of prediction is further
improved when eigenvector centrality is based on the weighted matrix of
activities rather than the unweighted adjacency matrix. Simulations are
performed with ensembles of random Boolean networks and a Boolean model of
signaling in fibroblasts. The findings are supported by analytic arguments from
a linear approximation of damage spreading.Comment: 6 pages, 3 figures, 3 table
Halo abundances within the cosmic web
We investigate the dependence of the mass function of dark-matter haloes on
their environment within the cosmic web of large-scale structure. A dependence
of the halo mass function on large-scale mean density is a standard element of
cosmological theory, allowing mass-dependent biasing to be understood via the
peak-background split. On the assumption of a Gaussian density field, this
analysis can be extended to ask how the mass function depends on the
geometrical environment: clusters, filaments, sheets and voids, as classified
via the tidal tensor (the Hessian matrix of the gravitational potential). In
linear theory, the problem can be solved exactly, and the result is
attractively simple: the conditional mass function has no explicit dependence
on the local tidal field, and is a function only of the local density on the
filtering scale used to define the tidal tensor. There is nevertheless a strong
implicit predicted dependence on geometrical environment, because the local
density couples statistically to the derivatives of the potential. We compute
the predictions of this model and study the limits of their validity by
comparing them to results deduced empirically from -body simulations. We
have verified that, to a good approximation, the abundance of haloes in
different environments depends only on their densities, and not on their tidal
structure. In this sense we find relative differences between halo abundances
in different environments with the same density which are smaller than 13%.
Furthermore, for sufficiently large filtering scales, the agreement with the
theoretical prediction is good, although there are important deviations from
the Gaussian prediction at small, non-linear scales. We discuss how to obtain
improved predictions in this regime, using the 'effective-universe' approach.Comment: 14 pages, 6 figures. Revision matching journal versio
Modified Linear Projection for Large Spatial Data Sets
Recent developments in engineering techniques for spatial data collection
such as geographic information systems have resulted in an increasing need for
methods to analyze large spatial data sets. These sorts of data sets can be
found in various fields of the natural and social sciences. However, model
fitting and spatial prediction using these large spatial data sets are
impractically time-consuming, because of the necessary matrix inversions.
Various methods have been developed to deal with this problem, including a
reduced rank approach and a sparse matrix approximation. In this paper, we
propose a modification to an existing reduced rank approach to capture both the
large- and small-scale spatial variations effectively. We have used simulated
examples and an empirical data analysis to demonstrate that our proposed
approach consistently performs well when compared with other methods. In
particular, the performance of our new method does not depend on the dependence
properties of the spatial covariance functions.Comment: 29 pages, 5 figures, 4 table
Sharp analysis of low-rank kernel matrix approximations
We consider supervised learning problems within the positive-definite kernel
framework, such as kernel ridge regression, kernel logistic regression or the
support vector machine. With kernels leading to infinite-dimensional feature
spaces, a common practical limiting difficulty is the necessity of computing
the kernel matrix, which most frequently leads to algorithms with running time
at least quadratic in the number of observations n, i.e., O(n^2). Low-rank
approximations of the kernel matrix are often considered as they allow the
reduction of running time complexities to O(p^2 n), where p is the rank of the
approximation. The practicality of such methods thus depends on the required
rank p. In this paper, we show that in the context of kernel ridge regression,
for approximations based on a random subset of columns of the original kernel
matrix, the rank p may be chosen to be linear in the degrees of freedom
associated with the problem, a quantity which is classically used in the
statistical analysis of such methods, and is often seen as the implicit number
of parameters of non-parametric estimators. This result enables simple
algorithms that have sub-quadratic running time complexity, but provably
exhibit the same predictive performance than existing algorithms, for any given
problem instance, and not only for worst-case situations
- …