12,728 research outputs found
Cluster membership probabilities from proper motions and multiwavelength photometric catalogues: I. Method and application to the Pleiades cluster
We present a new technique designed to take full advantage of the high
dimensionality (photometric, astrometric, temporal) of the DANCe survey to
derive self-consistent and robust membership probabilities of the Pleiades
cluster. We aim at developing a methodology to infer membership probabilities
to the Pleiades cluster from the DANCe multidimensional astro-photometric data
set in a consistent way throughout the entire derivation. The determination of
the membership probabilities has to be applicable to censored data and must
incorporate the measurement uncertainties into the inference procedure.
We use Bayes' theorem and a curvilinear forward model for the likelihood of
the measurements of cluster members in the colour-magnitude space, to infer
posterior membership probabilities. The distribution of the cluster members
proper motions and the distribution of contaminants in the full
multidimensional astro-photometric space is modelled with a
mixture-of-Gaussians likelihood. We analyse several representation spaces
composed of the proper motions plus a subset of the available magnitudes and
colour indices. We select two prominent representation spaces composed of
variables selected using feature relevance determination techniques based in
Random Forests, and analyse the resulting samples of high probability
candidates. We consistently find lists of high probability (p > 0.9975)
candidates with 1000 sources, 4 to 5 times more than obtained in the
most recent astro-photometric studies of the cluster.
The methodology presented here is ready for application in data sets that
include more dimensions, such as radial and/or rotational velocities, spectral
indices and variability.Comment: 14 pages, 4 figures, accepted by A&
Efficient Scalable Accurate Regression Queries in In-DBMS Analytics
Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes
Subset selection in dimension reduction methods
Dimension reduction methods play an important role in multivariate statistical analysis, in particular with high-dimensional data. Linear methods can be seen as a linear mapping from the original feature space to a dimension reduction subspace. The aim is to transform the data so that the essential structure is more easily understood. However, highly correlated variables provide redundant information, whereas some other feature may be irrelevant, and we would like to identify and then discard both of them while pursuing dimension reduction. Here we propose a greedy search algorithm, which avoids the search over all possible subsets, for ranking subsets of variables based on their ability to explain variation in the dimension reduction variates.Dimension reduction methods, Linear mapping, Subset selection, Greedy search
The discriminative functional mixture model for a comparative analysis of bike sharing systems
Bike sharing systems (BSSs) have become a means of sustainable intermodal
transport and are now proposed in many cities worldwide. Most BSSs also provide
open access to their data, particularly to real-time status reports on their
bike stations. The analysis of the mass of data generated by such systems is of
particular interest to BSS providers to update system structures and policies.
This work was motivated by interest in analyzing and comparing several European
BSSs to identify common operating patterns in BSSs and to propose practical
solutions to avoid potential issues. Our approach relies on the identification
of common patterns between and within systems. To this end, a model-based
clustering method, called FunFEM, for time series (or more generally functional
data) is developed. It is based on a functional mixture model that allows the
clustering of the data in a discriminative functional subspace. This model
presents the advantage in this context to be parsimonious and to allow the
visualization of the clustered systems. Numerical experiments confirm the good
behavior of FunFEM, particularly compared to state-of-the-art methods. The
application of FunFEM to BSS data from JCDecaux and the Transport for London
Initiative allows us to identify 10 general patterns, including pathological
ones, and to propose practical improvement strategies based on the system
comparison. The visualization of the clustered data within the discriminative
subspace turns out to be particularly informative regarding the system
efficiency. The proposed methodology is implemented in a package for the R
software, named funFEM, which is available on the CRAN. The package also
provides a subset of the data analyzed in this work.Comment: Published at http://dx.doi.org/10.1214/15-AOAS861 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- âŠ