20,375 research outputs found
Recommended from our members
The quest for a donor: probability based methods offer help
When a patient in need of a stem cell transplant has no compatible donor within his or her closest family, and no matched unrelated donor can be found, a remaining option is to search within the patient’s extended family. This situation often arises when the patient is of an ethnic minority, originating from a country that lacks a well-developed stem cell donor program, and has HLA haplotypes that are rare in his or her country of residence. Searching within the extended family may be time-consuming and expensive, and tools to calculate the probability of a match within groups of untested relatives would facilitate the search. We present a general approach to calculating the probability of a match in a given relative, or group of relatives, based on the pedigree, and on knowledge of the genotypes of some of the individuals. The method extends previous approaches by allowing the pedigrees to be consanguineous and arbitrarily complex, with deviations from Hardy-Weinberg equilibrium. We show how this extension has a considerable effect on results, in particular for rare haplotypes. The methods are exemplified using freeware programs to solve a case of practical importance
Bayesian inference from photometric redshift surveys
We show how to enhance the redshift accuracy of surveys consisting of tracers
with highly uncertain positions along the line of sight. Photometric surveys
with redshift uncertainty delta_z ~ 0.03 can yield final redshift uncertainties
of delta_z_f ~ 0.003 in high density regions. This increased redshift precision
is achieved by imposing an isotropy and 2-point correlation prior in a Bayesian
analysis and is completely independent of the process that estimates the
photometric redshift. As a byproduct, the method also infers the three
dimensional density field, essentially super-resolving high density regions in
redshift space. Our method fully takes into account the survey mask and
selection function. It uses a simplified Poissonian picture of galaxy
formation, relating preferred locations of galaxies to regions of higher
density in the matter field. The method quantifies the remaining uncertainties
in the three dimensional density field and the true radial locations of
galaxies by generating samples that are constrained by the survey data. The
exploration of this high dimensional, non-Gaussian joint posterior is made
feasible using multiple-block Metropolis-Hastings sampling. We demonstrate the
performance of our implementation on a simulation containing 2.0 x 10^7
galaxies. These results bear out the promise of Bayesian analysis for upcoming
photometric large scale structure surveys with tens of millions of galaxies.Comment: 17 pages, 12 figure
A Deep Embedding Model for Co-occurrence Learning
Co-occurrence Data is a common and important information source in many
areas, such as the word co-occurrence in the sentences, friends co-occurrence
in social networks and products co-occurrence in commercial transaction data,
etc, which contains rich correlation and clustering information about the
items. In this paper, we study co-occurrence data using a general energy-based
probabilistic model, and we analyze three different categories of energy-based
model, namely, the , and models, which are able to capture
different levels of dependency in the co-occurrence data. We also discuss how
several typical existing models are related to these three types of energy
models, including the Fully Visible Boltzmann Machine (FVBM) (), Matrix
Factorization (), Log-BiLinear (LBL) models (), and the Restricted
Boltzmann Machine (RBM) model (). Then, we propose a Deep Embedding Model
(DEM) (an model) from the energy model in a \emph{principled} manner.
Furthermore, motivated by the observation that the partition function in the
energy model is intractable and the fact that the major objective of modeling
the co-occurrence data is to predict using the conditional probability, we
apply the \emph{maximum pseudo-likelihood} method to learn DEM. In consequence,
the developed model and its learning method naturally avoid the above
difficulties and can be easily used to compute the conditional probability in
prediction. Interestingly, our method is equivalent to learning a special
structured deep neural network using back-propagation and a special sampling
strategy, which makes it scalable on large-scale datasets. Finally, in the
experiments, we show that the DEM can achieve comparable or better results than
state-of-the-art methods on datasets across several application domains
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
- …