240 research outputs found

    Document Similarity from Vector Space Densities

    Full text link
    We propose a computationally light method for estimating similarities between text documents, which we call the density similarity (DS) method. The method is based on a word embedding in a high-dimensional Euclidean space and on kernel regression, and takes into account semantic relations among words. We find that the accuracy of this method is virtually the same as that of a state-of-the-art method, while the gain in speed is very substantial. Additionally, we introduce generalized versions of the top-k accuracy metric and of the Jaccard metric of agreement between similarity models.Comment: 12 pages, 3 figure

    On the First Crossing of Two Boundaries by an Order Statistics Risk Process

    Get PDF
    We derive a closed form expression for the probability that a non-decreasing, pure jump stochastic risk process with the order statistics (OS) property will not exit the strip between two non-decreasing, possibly discontinuous, time-dependent boundaries, within a finite time interval. The result yields new expressions for the ruin probability in the insurance and the dual risk models with dependence between the claim severities or capital gains respectively

    Tests of Statistical Methods for Estimating Galaxy Luminosity Function and Applications to the Hubble Deep Field

    Full text link
    We studied the statistical methods for the estimation of the luminosity function (LF) of galaxies. We focused on four nonparametric estimators: 1/Vmax1/V_{\rm max} estimator, maximum-likelihood estimator of Efstathiou et al. (1988), Cho{\l}oniewski's estimator, and improved Lynden-Bell's estimator. The performance of the 1/Vmax1/V_{\rm max} estimator has been recently questioned, especially for the faint-end estimation of the LF. We improved these estimators for the studies of the distant Universe, and examined their performances for various classes of functional forms by Monte Carlo simulations. We also applied these estimation methods to the mock 2dF redshift survey catalog prepared by Cole et al. (1998). We found that 1/Vmax1/V_{\rm max} estimator yields a completely unbiased result if there is no inhomogeneity, but is not robust against clusters or voids. This is consistent with the well-known results, and we did not confirm the bias trend of 1/Vmax1/V_{\rm max} estimator claimed by Willmer (1997) in the case of homogeneous sample. We also found that the other three maximum-likelihood type estimators are quite robust and give consistent results with each other. In practice we recommend Cho{\l}oniewski's estimator for two reasons: 1. it simultaneously provides the shape and normalization of the LF; 2. it is the fastest among these four estimators, because of the algorithmic simplicity. Then, we analyzed the photometric redshift data of the Hubble Deep Field prepared by Fern\'{a}ndez-Soto et al. (1999) using the above four methods. We also derived luminosity density ρL\rho_{\rm L} at BB- and II-band. Our BB-band estimation is roughly consistent with that of Sawicki, Lin, & Yee (1997), but a few times lower at 2.0<z<3.02.0 < z < 3.0. The evolution of ρL(I)\rho_{\rm L}(I) is found to be less prominent.Comment: To appear in ApJS July 2000 issue. 36 page

    Using Topological Statistics to Detect Determinism in Time Series

    Full text link
    Statistical differentiability of the measure along the reconstructed trajectory is a good candidate to quantify determinism in time series. The procedure is based upon a formula that explicitly shows the sensitivity of the measure to stochasticity. Numerical results for partially surrogated time series and series derived from several stochastic models, illustrate the usefulness of the method proposed here. The method is shown to work also for high--dimensional systems and experimental time seriesComment: 23 RevTeX pages, 14 eps figures. To appear in Physical Review

    Color, 3D simulated images with shapelets

    Full text link
    We present a method to simulate color, 3-dimensional images taken with a space-based observatory by building off of the established shapelets pipeline. The simulated galaxies exhibit complex morphologies, which are realistically correlated between, and include, known redshifts. The simulations are created using galaxies from the 4 optical and near-infrared bands (B, V, i and z) of the Hubble Ultra Deep Field (UDF) as a basis set to model morphologies and redshift. We include observational effects such as sky noise and pixelization and can add astronomical signals of interest such as weak gravitational lensing. The realism of the simulations is demonstrated by comparing their morphologies to the original UDF galaxies and by comparing their distribution of ellipticities as a function of redshift and magnitude to wider HST COSMOS data. These simulations have already been useful for calibrating multicolor image analysis techniques and for better optimizing the design of proposed space telescopes.Comment: 14 pages, 15 figures, accepted to Astroparticle Physic

    Kernel-based methods for combining information of several frame surveys

    Get PDF
    A sample selected from a single sampling frame may not represent adequatly the entire population. Multiple frame surveys are becoming increasingly used and popular among statistical agencies and private organizations, in particular in situations where several sampling frames may provide better coverage or can reduce sampling costs for estimating population quantities of interest. Auxiliary information available at the population level is often categorical in nature, so that incorporating categorical and continuous information can improve the efficiency of the method of estimation. Nonparametric regression methods represent a widely used and flexible estimation approach in the survey context. We propose a kernel regression estimator for dual frame surveys that can handle both continuous and categorical data. This methodology is extended to multiple frame surveys. We derive theoretical properties of the proposed methods and numerical experiments indicate that the proposed estimator perform well in practical settings under different scenarios.Ministerio de Economía y CompetitividadConsejería de Economía, Innovación, Ciencia y Emple

    <i>Gaia</i> Data Release 1. Summary of the astrometric, photometric, and survey properties

    Get PDF
    Context. At about 1000 days after the launch of Gaia we present the first Gaia data release, Gaia DR1, consisting of astrometry and photometry for over 1 billion sources brighter than magnitude 20.7. Aims. A summary of Gaia DR1 is presented along with illustrations of the scientific quality of the data, followed by a discussion of the limitations due to the preliminary nature of this release. Methods. The raw data collected by Gaia during the first 14 months of the mission have been processed by the Gaia Data Processing and Analysis Consortium (DPAC) and turned into an astrometric and photometric catalogue. Results. Gaia DR1 consists of three components: a primary astrometric data set which contains the positions, parallaxes, and mean proper motions for about 2 million of the brightest stars in common with the HIPPARCOS and Tycho-2 catalogues – a realisation of the Tycho-Gaia Astrometric Solution (TGAS) – and a secondary astrometric data set containing the positions for an additional 1.1 billion sources. The second component is the photometric data set, consisting of mean G-band magnitudes for all sources. The G-band light curves and the characteristics of ∼3000 Cepheid and RR-Lyrae stars, observed at high cadence around the south ecliptic pole, form the third component. For the primary astrometric data set the typical uncertainty is about 0.3 mas for the positions and parallaxes, and about 1 mas yr−1 for the proper motions. A systematic component of ∼0.3 mas should be added to the parallax uncertainties. For the subset of ∼94 000 HIPPARCOS stars in the primary data set, the proper motions are much more precise at about 0.06 mas yr−1. For the secondary astrometric data set, the typical uncertainty of the positions is ∼10 mas. The median uncertainties on the mean G-band magnitudes range from the mmag level to ∼0.03 mag over the magnitude range 5 to 20.7. Conclusions. Gaia DR1 is an important milestone ahead of the next Gaia data release, which will feature five-parameter astrometry for all sources. Extensive validation shows that Gaia DR1 represents a major advance in the mapping of the heavens and the availability of basic stellar data that underpin observational astrophysics. Nevertheless, the very preliminary nature of this first Gaia data release does lead to a number of important limitations to the data quality which should be carefully considered before drawing conclusions from the data
    corecore