180,250 research outputs found
Unweighted regression models perform better than weighted regression techniques for respondent-driven sampling data: results from a simulation study
Background: It is unclear whether weighted or unweighted regression is preferred in the analysis of data derived from respondent driven sampling. Our objective was to evaluate the validity of various regression models, with and without weights and with various controls for clustering in the estimation of the risk of group membership from data collected using respondent-driven sampling (RDS).
Methods: Twelve networked populations, with varying levels of homophily and prevalence, based on a known distribution of a continuous predictor were simulated using 1000 RDS samples from each population. Weighted and unweighted binomial and Poisson general linear models, with and without various clustering controls and standard error adjustments were modelled for each sample and evaluated with respect to validity, bias and coverage rate. Population prevalence was also estimated.
Results: In the regression analysis, the unweighted log-link (Poisson) models maintained the nominal type-I error rate across all populations. Bias was substantial and type-I error rates unacceptably high for weighted binomial regression. Coverage rates for the estimation of prevalence were highest using RDS-weighted logistic regression, except at low prevalence (10%) where unweighted models are recommended.
Conclusions: Caution is warranted when undertaking regression analysis of RDS data. Even when reported degree is accurate, low reported degree can unduly influence regression estimates. Unweighted Poisson regression is therefore recommended.York University Librarie
Graph Laplacians and their convergence on random neighborhood graphs
Given a sample from a probability measure with support on a submanifold in
Euclidean space one can construct a neighborhood graph which can be seen as an
approximation of the submanifold. The graph Laplacian of such a graph is used
in several machine learning methods like semi-supervised learning,
dimensionality reduction and clustering. In this paper we determine the
pointwise limit of three different graph Laplacians used in the literature as
the sample size increases and the neighborhood size approaches zero. We show
that for a uniform measure on the submanifold all graph Laplacians have the
same limit up to constants. However in the case of a non-uniform measure on the
submanifold only the so called random walk graph Laplacian converges to the
weighted Laplace-Beltrami operator.Comment: Improved presentation, typos corrected, to appear in JML
Galaxy clustering with photometric surveys using PDF redshift information
Photometric surveys produce large-area maps of the galaxy distribution, but
with less accurate redshift information than is obtained from spectroscopic
methods. Modern photometric redshift (photo-z) algorithms use galaxy
magnitudes, or colors, that are obtained through multi-band imaging to produce
a probability density function (PDF) for each galaxy in the map. We used
simulated data to study the effect of using different photo-z estimators to
assign galaxies to redshift bins in order to compare their effects on angular
clustering and galaxy bias measurements. We found that if we use the entire
PDF, rather than a single-point (mean or mode) estimate, the deviations are
less biased, especially when using narrow redshift bins. When the redshift bin
widths are , the use of the entire PDF reduces the typical
measurement bias from 5%, when using single point estimates, to 3%.Comment: Matches the MNRAS published version. 19 pages, 19 Figure
The Weak Clustering of Gas-Rich Galaxies
We examine the clustering properties of HI-selected galaxies through an
analysis of the HI Parkes All-Sky Survey Catalogue (HICAT) two-point
correlation function. Various sub-samples are extracted from this catalogue to
study the overall clustering of HI-rich galaxies and its dependence on
luminosity, HI gas mass and rotational velocity. These samples cover the entire
southern sky Dec < 0 deg, containing up to 4,174 galaxies over the radial
velocity range 300-12,700 km/s. A scale length of r_0 = 3.45 +/- 0.25 Mpc/h and
slope of gamma = 1.47 +/- 0.08 is obtained for the HI-rich galaxy real-space
correlation function, making gas-rich galaxies among the most weakly clustered
objects known. HI-selected galaxies also exhibit weaker clustering than
optically selected galaxies of comparable luminosities. Good agreement is found
between our results and those of synthetic HI-rich galaxy catalogues generated
from the Millennium Run CDM simulation. Bisecting HICAT using different
parameter cuts, clustering is found to depend most strongly on rotational
velocity and luminosity, while the dependency on HI mass is marginal. Splitting
the sample around v_rot = 108 km/s, a scale length of r_0 = 2.86 +/- 0.46 Mpc/h
is found for galaxies with low rotational velocities compared to r_0 = 3.96 +/-
0.33 Mpc/h for the high rotational velocity sample.Comment: Accepted for publication in the Astrophysical Journa
- …