17,767 research outputs found
Online Unsupervised Multi-view Feature Selection
In the era of big data, it is becoming common to have data with multiple
modalities or coming from multiple sources, known as "multi-view data".
Multi-view data are usually unlabeled and come from high-dimensional spaces
(such as language vocabularies), unsupervised multi-view feature selection is
crucial to many applications. However, it is nontrivial due to the following
challenges. First, there are too many instances or the feature dimensionality
is too large. Thus, the data may not fit in memory. How to select useful
features with limited memory space? Second, how to select features from
streaming data and handles the concept drift? Third, how to leverage the
consistent and complementary information from different views to improve the
feature selection in the situation when the data are too big or come in as
streams? To the best of our knowledge, none of the previous works can solve all
the challenges simultaneously. In this paper, we propose an Online unsupervised
Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming
multi-view data in an online fashion. OMVFS embeds unsupervised feature
selection into a clustering algorithm via NMF with sparse learning. It further
incorporates the graph regularization to preserve the local structure
information and help select discriminative features. Instead of storing all the
historical data, OMVFS processes the multi-view data chunk by chunk and
aggregates all the necessary information into several small matrices. By using
the buffering technique, the proposed OMVFS can reduce the computational and
storage cost while taking advantage of the structure information. Furthermore,
OMVFS can capture the concept drifts in the data streams. Extensive experiments
on four real-world datasets show the effectiveness and efficiency of the
proposed OMVFS method. More importantly, OMVFS is about 100 times faster than
the off-line methods
Streaming Coreset Constructions for M-Estimators
We introduce a new method of maintaining a (k,epsilon)-coreset for clustering M-estimators over insertion-only streams. Let (P,w) be a weighted set (where w : P - > [0,infty) is the weight function) of points in a rho-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x,z) <=rho(D(x,y) + D(y,z)) for all x,y,z in X). For any set of points C, we define COST(P,w,C) = sum_{p in P} w(p) min_{c in C} D(p,c). A (k,epsilon)-coreset for (P,w) is a weighted set (Q,v) such that for every set C of k points, (1-epsilon)COST(P,w,C) <= COST(Q,v,C) <= (1+epsilon)COST(P,w,C). Essentially, the coreset (Q,v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data.
M-estimators are functions D(x,y) that can be written as psi(d(x,y)) where ({X}, d) is a true metric (i.e. 1-metric) space. Special cases of M-estimators include the well-known k-median (psi(x) =x) and k-means (psi(x) = x^2) functions. Our technique takes an existing offline construction for an M-estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M-estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O(epsilon^{-2} k log k log n) points of storage. The previous state-of-the-art required storing at least O(epsilon^{-2} k log k log^{4} n) points
The Small Scale Velocity Dispersion of Galaxies: A Comparison of Cosmological Simulations
The velocity dispersion of galaxies on small scales ( Mpc),
, can be estimated from the anisotropy of the galaxy-galaxy
correlation function in redshift space. We apply this technique to
``mock-catalogs'' extracted from N-body simulations of several different
variants of Cold Dark Matter dominated cosmological models to obtain results
which may be consistently compared to similar results from observations. We
find a large variation in the value of in different
regions of the same simulation. We conclude that this statistic should not be
considered to conclusively rule out any of the cosmological models we have
studied. We attempt to make the statistic more robust by removing clusters from
the simulations using an automated cluster-removing routine, but this appears
to reduce the discriminatory power of the statistic. However, studying
as clusters with different internal velocity dispersions are
removed leads to interesting information about the amount of power on cluster
and subcluster scales. We also compute the pairwise velocity dispersion
directly and compare this to the values obtained using the Davis-Peebles
method, and find that the agreement is fairly good. We evaluate the models used
for the mean streaming velocity and the pairwise peculiar velocity distribution
in the original Davis-Peebles method by comparing the models with the results
from the simulations.Comment: 20 pages, uuencoded (Latex file + 8 Postscript figures), uses AAS
macro
Redshift-Space Distortions and the Real-Space Clustering of Different Galaxy Types
We study the distortions induced by peculiar velocities on the redshift-space
correlation function of galaxies of different morphological types in the
Pisces-Perseus redshift survey. Redshift-space distortions affect early- and
late-type galaxies in different ways. In particular, at small separations, the
dominant effect comes from virialized cluster cores, where ellipticals are the
dominant population. The net result is that a meaningful comparison of the
clustering strength of different morphological types can be performed only in
real space, i.e., after projecting out the redshift distortions on the
two-point correlation function xi(r_p,pi). A power-law fit to the projected
function w_p(r_p) on scales smaller than 10/h Mpc gives r_o =
8.35_{-0.76}^{+0.75} /h Mpc, \gamma = 2.05_{-0.08}^{+0.10} for the early-type
population, and r_o = 5.55_{-0.45}^{+0.40} /h Mpc, \gamma =
1.73_{-0.08}^{+0.07} for spirals and irregulars. These values are derived for a
sample luminosity brighter than M_{Zw} = -19.5. We detect a 25% increase of r_o
with luminosity for all types combined, from M_{Zw} = -19 to -20. In the
framework of a simple stable-clustering model for the mean streaming of pairs,
we estimate sigma_12(1), the one-dimensional pairwise velocity dispersion
between 0 and 1 /h Mpc, to be 865^{+250}_{-165} km/s for early-type galaxies
and 345^{+95}_{-65} km/s for late types. This latter value should be a fair
estimate of the pairwise dispersion for ``field'' galaxies; it is stable with
respect to the presence or absence of clusters in the sample, and is consistent
with the values found for non-cluster galaxies and IRAS galaxies at similar
separations.Comment: 17 LaTeX pages including 3 tables, plus 11 PS figures. Uses AASTeX
macro package (aaspp4.sty) and epsf.sty. To appear on ApJ, 489, Nov 199
Effects of Unstable Dark Matter on Large-Scale Structure and Constraints from Future Surveys
In this paper we explore the effect of decaying dark matter (DDM) on
large-scale structure and possible constraints from galaxy imaging surveys. DDM
models have been studied, in part, as a way to address apparent discrepancies
between the predictions of standard cold dark matter models and observations of
galactic structure. Our study is aimed at developing independent constraints on
these models. In such models, DDM decays into a less massive, stable dark
matter (SDM) particle and a significantly lighter particle. The small mass
splitting between the parent DDM and the daughter SDM provides the SDM with a
recoil or "kick" velocity vk, inducing a free-streaming suppression of matter
fluctuations. This suppression may be probed via weak lensing power spectra
measured by a number of forthcoming imaging surveys that aim primarily to
constrain dark energy. Using scales on which linear perturbation theory alone
is valid (multipoles < 300), surveys like Euclid or LSST can be sensitive to vk
> 90 km/s for lifetimes ~ 1-5 Gyr. To estimate more aggressive constraints, we
model nonlinear corrections to lensing power using a simple halo evolution
model that is in good agreement with numerical simulations. In our most
ambitious forecasts, using multipoles < 3000, we find that imaging surveys can
be sensitive to vk ~ 10 km/s for lifetimes < 10 Gyr. Lensing will provide a
particularly interesting complement to existing constraints in that they will
probe the long lifetime regime far better than contemporary techniques. A
caveat to these ambitious forecasts is that the evolution of perturbations on
nonlinear scales will need to be well calibrated by numerical simulations
before they can be realized. This work motivates the pursuit of such a
numerical simulation campaign to constrain dark matter with cosmological weak
lensing.Comment: 15 pages, 7 figures. Submitted to PR
Gravitational Clustering: A Simple, Robust and Adaptive Approach for Distributed Networks
Distributed signal processing for wireless sensor networks enables that
different devices cooperate to solve different signal processing tasks. A
crucial first step is to answer the question: who observes what? Recently,
several distributed algorithms have been proposed, which frame the
signal/object labelling problem in terms of cluster analysis after extracting
source-specific features, however, the number of clusters is assumed to be
known. We propose a new method called Gravitational Clustering (GC) to
adaptively estimate the time-varying number of clusters based on a set of
feature vectors. The key idea is to exploit the physical principle of
gravitational force between mass units: streaming-in feature vectors are
considered as mass units of fixed position in the feature space, around which
mobile mass units are injected at each time instant. The cluster enumeration
exploits the fact that the highest attraction on the mobile mass units is
exerted by regions with a high density of feature vectors, i.e., gravitational
clusters. By sharing estimates among neighboring nodes via a
diffusion-adaptation scheme, cooperative and distributed cluster enumeration is
achieved. Numerical experiments concerning robustness against outliers,
convergence and computational complexity are conducted. The application in a
distributed cooperative multi-view camera network illustrates the applicability
to real-world problems.Comment: 12 pages, 9 figure
- …