17,767 research outputs found

    Online Unsupervised Multi-view Feature Selection

    Full text link
    In the era of big data, it is becoming common to have data with multiple modalities or coming from multiple sources, known as "multi-view data". Multi-view data are usually unlabeled and come from high-dimensional spaces (such as language vocabularies), unsupervised multi-view feature selection is crucial to many applications. However, it is nontrivial due to the following challenges. First, there are too many instances or the feature dimensionality is too large. Thus, the data may not fit in memory. How to select useful features with limited memory space? Second, how to select features from streaming data and handles the concept drift? Third, how to leverage the consistent and complementary information from different views to improve the feature selection in the situation when the data are too big or come in as streams? To the best of our knowledge, none of the previous works can solve all the challenges simultaneously. In this paper, we propose an Online unsupervised Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming multi-view data in an online fashion. OMVFS embeds unsupervised feature selection into a clustering algorithm via NMF with sparse learning. It further incorporates the graph regularization to preserve the local structure information and help select discriminative features. Instead of storing all the historical data, OMVFS processes the multi-view data chunk by chunk and aggregates all the necessary information into several small matrices. By using the buffering technique, the proposed OMVFS can reduce the computational and storage cost while taking advantage of the structure information. Furthermore, OMVFS can capture the concept drifts in the data streams. Extensive experiments on four real-world datasets show the effectiveness and efficiency of the proposed OMVFS method. More importantly, OMVFS is about 100 times faster than the off-line methods

    Streaming Coreset Constructions for M-Estimators

    Get PDF
    We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points

    The Small Scale Velocity Dispersion of Galaxies: A Comparison of Cosmological Simulations

    Full text link
    The velocity dispersion of galaxies on small scales (r∼1h−1r\sim1h^{-1} Mpc), σ12(r)\sigma_{12}(r), can be estimated from the anisotropy of the galaxy-galaxy correlation function in redshift space. We apply this technique to ``mock-catalogs'' extracted from N-body simulations of several different variants of Cold Dark Matter dominated cosmological models to obtain results which may be consistently compared to similar results from observations. We find a large variation in the value of σ12(1h−1Mpc)\sigma_{12}(1 h^{-1} Mpc) in different regions of the same simulation. We conclude that this statistic should not be considered to conclusively rule out any of the cosmological models we have studied. We attempt to make the statistic more robust by removing clusters from the simulations using an automated cluster-removing routine, but this appears to reduce the discriminatory power of the statistic. However, studying σ12\sigma_{12} as clusters with different internal velocity dispersions are removed leads to interesting information about the amount of power on cluster and subcluster scales. We also compute the pairwise velocity dispersion directly and compare this to the values obtained using the Davis-Peebles method, and find that the agreement is fairly good. We evaluate the models used for the mean streaming velocity and the pairwise peculiar velocity distribution in the original Davis-Peebles method by comparing the models with the results from the simulations.Comment: 20 pages, uuencoded (Latex file + 8 Postscript figures), uses AAS macro

    Redshift-Space Distortions and the Real-Space Clustering of Different Galaxy Types

    Get PDF
    We study the distortions induced by peculiar velocities on the redshift-space correlation function of galaxies of different morphological types in the Pisces-Perseus redshift survey. Redshift-space distortions affect early- and late-type galaxies in different ways. In particular, at small separations, the dominant effect comes from virialized cluster cores, where ellipticals are the dominant population. The net result is that a meaningful comparison of the clustering strength of different morphological types can be performed only in real space, i.e., after projecting out the redshift distortions on the two-point correlation function xi(r_p,pi). A power-law fit to the projected function w_p(r_p) on scales smaller than 10/h Mpc gives r_o = 8.35_{-0.76}^{+0.75} /h Mpc, \gamma = 2.05_{-0.08}^{+0.10} for the early-type population, and r_o = 5.55_{-0.45}^{+0.40} /h Mpc, \gamma = 1.73_{-0.08}^{+0.07} for spirals and irregulars. These values are derived for a sample luminosity brighter than M_{Zw} = -19.5. We detect a 25% increase of r_o with luminosity for all types combined, from M_{Zw} = -19 to -20. In the framework of a simple stable-clustering model for the mean streaming of pairs, we estimate sigma_12(1), the one-dimensional pairwise velocity dispersion between 0 and 1 /h Mpc, to be 865^{+250}_{-165} km/s for early-type galaxies and 345^{+95}_{-65} km/s for late types. This latter value should be a fair estimate of the pairwise dispersion for ``field'' galaxies; it is stable with respect to the presence or absence of clusters in the sample, and is consistent with the values found for non-cluster galaxies and IRAS galaxies at similar separations.Comment: 17 LaTeX pages including 3 tables, plus 11 PS figures. Uses AASTeX macro package (aaspp4.sty) and epsf.sty. To appear on ApJ, 489, Nov 199

    Effects of Unstable Dark Matter on Large-Scale Structure and Constraints from Future Surveys

    Full text link
    In this paper we explore the effect of decaying dark matter (DDM) on large-scale structure and possible constraints from galaxy imaging surveys. DDM models have been studied, in part, as a way to address apparent discrepancies between the predictions of standard cold dark matter models and observations of galactic structure. Our study is aimed at developing independent constraints on these models. In such models, DDM decays into a less massive, stable dark matter (SDM) particle and a significantly lighter particle. The small mass splitting between the parent DDM and the daughter SDM provides the SDM with a recoil or "kick" velocity vk, inducing a free-streaming suppression of matter fluctuations. This suppression may be probed via weak lensing power spectra measured by a number of forthcoming imaging surveys that aim primarily to constrain dark energy. Using scales on which linear perturbation theory alone is valid (multipoles < 300), surveys like Euclid or LSST can be sensitive to vk > 90 km/s for lifetimes ~ 1-5 Gyr. To estimate more aggressive constraints, we model nonlinear corrections to lensing power using a simple halo evolution model that is in good agreement with numerical simulations. In our most ambitious forecasts, using multipoles < 3000, we find that imaging surveys can be sensitive to vk ~ 10 km/s for lifetimes < 10 Gyr. Lensing will provide a particularly interesting complement to existing constraints in that they will probe the long lifetime regime far better than contemporary techniques. A caveat to these ambitious forecasts is that the evolution of perturbations on nonlinear scales will need to be well calibrated by numerical simulations before they can be realized. This work motivates the pursuit of such a numerical simulation campaign to constrain dark matter with cosmological weak lensing.Comment: 15 pages, 7 figures. Submitted to PR

    Gravitational Clustering: A Simple, Robust and Adaptive Approach for Distributed Networks

    Full text link
    Distributed signal processing for wireless sensor networks enables that different devices cooperate to solve different signal processing tasks. A crucial first step is to answer the question: who observes what? Recently, several distributed algorithms have been proposed, which frame the signal/object labelling problem in terms of cluster analysis after extracting source-specific features, however, the number of clusters is assumed to be known. We propose a new method called Gravitational Clustering (GC) to adaptively estimate the time-varying number of clusters based on a set of feature vectors. The key idea is to exploit the physical principle of gravitational force between mass units: streaming-in feature vectors are considered as mass units of fixed position in the feature space, around which mobile mass units are injected at each time instant. The cluster enumeration exploits the fact that the highest attraction on the mobile mass units is exerted by regions with a high density of feature vectors, i.e., gravitational clusters. By sharing estimates among neighboring nodes via a diffusion-adaptation scheme, cooperative and distributed cluster enumeration is achieved. Numerical experiments concerning robustness against outliers, convergence and computational complexity are conducted. The application in a distributed cooperative multi-view camera network illustrates the applicability to real-world problems.Comment: 12 pages, 9 figure
    • …
    corecore