88 research outputs found

    Recognizing Treelike k-Dissimilarities

    Full text link
    A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of size k subsets of X to the real numbers. Such maps naturally arise from edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is defined to be the total length of the smallest subtree of T with leaf-set Y . In case k = 2, it is well-known that 2-dissimilarities arising in this way can be characterized by the so-called "4-point condition". However, in case k > 2 Pachter and Speyer recently posed the following question: Given an arbitrary k-dissimilarity, how do we test whether this map comes from a tree? In this paper, we provide an answer to this question, showing that for k >= 3 a k-dissimilarity on a set X arises from a tree if and only if its restriction to every 2k-element subset of X arises from some tree, and that 2k is the least possible subset size to ensure that this is the case. As a corollary, we show that there exists a polynomial-time algorithm to determine when a k-dissimilarity arises from a tree. We also give a 6-point condition for determining when a 3-dissimilarity arises from a tree, that is similar to the aforementioned 4-point condition.Comment: 18 pages, 4 figure

    A flexible framework for sparse simultaneous component based data integration

    Get PDF
    <p>Abstract</p> <p>1 Background</p> <p>High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.</p> <p>2 Results</p> <p>We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of <it>Escherichia coli </it>samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.</p> <p>3 Conclusion</p> <p>Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).</p> <p>4 Availability</p> <p>The additional file contains a MATLAB implementation of the sparse simultaneous component method.</p

    MTV and MGV: Two Criteria for Nonlinear PCA

    Full text link
    ized Variance) are popular criteria for PCA with optimal scaling. They are adopted by the SAS-PRINQUAL procedure and OSMOD (Saito and Otsu,1988). MTV is an intuitive generalization of linear PCA criterion. We will show some proper-ties of nonlinear PCA with these criteria in an application to the data of NLSY79 (Zagorsky,1997), a large panel survey in the U.S., conducted over twenty years. We will show the following. (1) The effectiveness of PCA with optimal scaling as a tool for large social research data analysis. We can obtain useful results when it complements analyses by regression models. (2) Features of MTV and MGV, especially their abilities and deficiencies in real data analysis. 1

    Interpreting degenerate solutions in unfolding by use of the vector model and the compensatory distance model

    No full text
    catholic university of leuven In this paper, we reconsider the merits of unfolding solutions based on loss functions involving a normalization on the variance per subject. In the literature, solutions based on Stress-2 are often diagnosed to be degenerate in the majority of cases. Here, the focus lies on two frequently occurring types of degen-eracies. The first type typically locates some subject points far away from a compact cluster of the other points. In the second type of solution, the object points lie on a circle. In this paper, we argue that these degenerate solutions are well fitting and informative. To reveal the information, we introduce mixtures of plots based on the ideal point model of unfolding, the vector model, and on the signed distance model. In addition to a different representation, we provide a new iterative majorization algorithm to optimize the average squared correlation between the distances in the configuration and the transformed data per individual. It is shown that this approach is equivalent to minimizing Kruskal’s Stress-2

    Interpreting degenerate solutions in unfolding by use of the vector model and the compensatory distance model

    No full text
    corecore