34,688 research outputs found

    Fast computation of distance estimators

    Get PDF
    BACKGROUND: Some distance methods are among the most commonly used methods for reconstructing phylogenetic trees from sequence data. The input to a distance method is a distance matrix, containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., the famous and popular Neighbor Joining (NJ) algorithm reconstructs a phylogeny of n taxa in time O(n(3)). Unfortunately, the fastest practical algorithms known for Computing the distance matrix, from n sequences of length l, takes time proportional to l·n(2). Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. This bottleneck is especially apparent in reconstruction of large phylogenies or in applications where many trees have to be reconstructed, e.g., bootstrapping and genome wide applications. RESULTS: We give an advanced algorithm for Computing the number of mutational events between DNA sequences which is significantly faster than both Phylip and Paup. Moreover, we give a new method for estimating pairwise distances between sequences which contain ambiguity Symbols. This new method is shown to be more accurate as well as faster than earlier methods. CONCLUSION: Our novel algorithm for Computing distance estimators provides a valuable tool in phylogeny reconstruction. Since the running time of our distance estimation algorithm is comparable to that of most distance methods, the previous bottleneck is removed. All distance methods, such as NJ, require a distance matrix as input and, hence, our novel algorithm significantly improves the overall running time of all distance methods. In particular, we show for real world biological applications how the running time of phylogeny reconstruction using NJ is improved from a matter of hours to a matter of seconds

    A new V-fold type procedure based on robust tests

    Get PDF
    We define a general V-fold cross-validation type method based on robust tests, which is an extension of the hold-out defined by Birg{\'e} [7, Section 9]. We give some theoretical results showing that, under some weak assumptions on the considered statistical procedures, our selected estimator satisfies an oracle type inequality. We also introduce a fast algorithm that implements our method. Moreover we show in our simulations that this V-fold performs generally well for estimating a density for different sample sizes, and can handle well-known problems, such as binwidth selection for histograms or bandwidth selection for kernels. We finally provide a comparison with other classical V-fold methods and study empirically the influence of the value of V on the risk

    Ly-alpha forest: efficient unbiased estimation of second-order properties with missing data

    Full text link
    Context. One important step in the statistical analysis of the Ly-alpha forest data is the study of their second order properties. Usually, this is accomplished by means of the two-point correlation function or, alternatively, the K-function. In the computation of these functions it is necessary to take into account the presence of strong metal line complexes and strong Ly-alpha lines that can hidden part of the Ly-alpha forest and represent a non negligible source of bias. Aims. In this work, we show quantitatively what are the effects of the gaps introduced in the spectrum by the strong lines if they are not properly accounted for in the computation of the correlation properties. We propose a geometric method which is able to solve this problem and is computationally more efficient than the Monte Carlo (MC) technique that is typically adopted in Cosmology studies. The method is implemented in two different algorithms. The first one permits to obtain exact results, whereas the second one provides approximated results but is computationally very efficient. The proposed approach can be easily extended to deal with the case of two or more lists of lines that have to be analyzed at the same time. Methods. Numerical experiments are presented that illustrate the consequences to neglect the effects due to the strong lines and the excellent performances of the proposed approach. Results. The proposed method is able to remarkably improve the estimates of both the two-point correlation function and the K-function.Comment: A&A accepted, 12 pages, 15 figure

    A New Estimator of Intrinsic Dimension Based on the Multipoint Morisita Index

    Full text link
    The size of datasets has been increasing rapidly both in terms of number of variables and number of events. As a result, the empty space phenomenon and the curse of dimensionality complicate the extraction of useful information. But, in general, data lie on non-linear manifolds of much lower dimension than that of the spaces in which they are embedded. In many pattern recognition tasks, learning these manifolds is a key issue and it requires the knowledge of their true intrinsic dimension. This paper introduces a new estimator of intrinsic dimension based on the multipoint Morisita index. It is applied to both synthetic and real datasets of varying complexities and comparisons with other existing estimators are carried out. The proposed estimator turns out to be fairly robust to sample size and noise, unaffected by edge effects, able to handle large datasets and computationally efficient
    corecore