34,688 research outputs found
Fast computation of distance estimators
BACKGROUND: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n(3)). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n(2). Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. RESULTS: We give an advanced algorithm for Computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity Symbols. This new method is shown to be more accurate as well as faster than earlier methods. CONCLUSION: Our novel algorithm for Computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds
A new V-fold type procedure based on robust tests
We define a general V-fold cross-validation type method based on robust
tests, which is an extension of the hold-out defined by Birg{\'e} [7, Section
9]. We give some theoretical results showing that, under some weak assumptions
on the considered statistical procedures, our selected estimator satisfies an
oracle type inequality. We also introduce a fast algorithm that implements our
method. Moreover we show in our simulations that this V-fold performs generally
well for estimating a density for different sample sizes, and can handle
well-known problems, such as binwidth selection for histograms or bandwidth
selection for kernels. We finally provide a comparison with other classical
V-fold methods and study empirically the influence of the value of V on the
risk
Ly-alpha forest: efficient unbiased estimation of second-order properties with missing data
Context. One important step in the statistical analysis of the Ly-alpha
forest data is the study of their second order properties. Usually, this is
accomplished by means of the two-point correlation function or, alternatively,
the K-function. In the computation of these functions it is necessary to take
into account the presence of strong metal line complexes and strong Ly-alpha
lines that can hidden part of the Ly-alpha forest and represent a non
negligible source of bias. Aims. In this work, we show quantitatively what are
the effects of the gaps introduced in the spectrum by the strong lines if they
are not properly accounted for in the computation of the correlation
properties. We propose a geometric method which is able to solve this problem
and is computationally more efficient than the Monte Carlo (MC) technique that
is typically adopted in Cosmology studies. The method is implemented in two
different algorithms. The first one permits to obtain exact results, whereas
the second one provides approximated results but is computationally very
efficient. The proposed approach can be easily extended to deal with the case
of two or more lists of lines that have to be analyzed at the same time.
Methods. Numerical experiments are presented that illustrate the consequences
to neglect the effects due to the strong lines and the excellent performances
of the proposed approach. Results. The proposed method is able to remarkably
improve the estimates of both the two-point correlation function and the
K-function.Comment: A&A accepted, 12 pages, 15 figure
A New Estimator of Intrinsic Dimension Based on the Multipoint Morisita Index
The size of datasets has been increasing rapidly both in terms of number of
variables and number of events. As a result, the empty space phenomenon and the
curse of dimensionality complicate the extraction of useful information. But,
in general, data lie on non-linear manifolds of much lower dimension than that
of the spaces in which they are embedded. In many pattern recognition tasks,
learning these manifolds is a key issue and it requires the knowledge of their
true intrinsic dimension. This paper introduces a new estimator of intrinsic
dimension based on the multipoint Morisita index. It is applied to both
synthetic and real datasets of varying complexities and comparisons with other
existing estimators are carried out. The proposed estimator turns out to be
fairly robust to sample size and noise, unaffected by edge effects, able to
handle large datasets and computationally efficient
- …