88 research outputs found
Recognizing Treelike k-Dissimilarities
A k-dissimilarity D on a finite set X, |X| >= k, is a map from the set of
size k subsets of X to the real numbers. Such maps naturally arise from
edge-weighted trees T with leaf-set X: Given a subset Y of X of size k, D(Y) is
defined to be the total length of the smallest subtree of T with leaf-set Y .
In case k = 2, it is well-known that 2-dissimilarities arising in this way can
be characterized by the so-called "4-point condition". However, in case k > 2
Pachter and Speyer recently posed the following question: Given an arbitrary
k-dissimilarity, how do we test whether this map comes from a tree? In this
paper, we provide an answer to this question, showing that for k >= 3 a
k-dissimilarity on a set X arises from a tree if and only if its restriction to
every 2k-element subset of X arises from some tree, and that 2k is the least
possible subset size to ensure that this is the case. As a corollary, we show
that there exists a polynomial-time algorithm to determine when a
k-dissimilarity arises from a tree. We also give a 6-point condition for
determining when a 3-dissimilarity arises from a tree, that is similar to the
aforementioned 4-point condition.Comment: 18 pages, 4 figure
A phase I trial of DNA vaccination with a plasmid expressing prostate-specific antigen in patients with hormone-refractory prostate cancer
A flexible framework for sparse simultaneous component based data integration
<p>Abstract</p> <p>1 Background</p> <p>High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.</p> <p>2 Results</p> <p>We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of <it>Escherichia coli </it>samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.</p> <p>3 Conclusion</p> <p>Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).</p> <p>4 Availability</p> <p>The additional file contains a MATLAB implementation of the sparse simultaneous component method.</p
MTV and MGV: Two Criteria for Nonlinear PCA
ized Variance) are popular criteria for PCA with optimal scaling. They are adopted by the SAS-PRINQUAL procedure and OSMOD (Saito and Otsu,1988). MTV is an intuitive generalization of linear PCA criterion. We will show some proper-ties of nonlinear PCA with these criteria in an application to the data of NLSY79 (Zagorsky,1997), a large panel survey in the U.S., conducted over twenty years. We will show the following. (1) The effectiveness of PCA with optimal scaling as a tool for large social research data analysis. We can obtain useful results when it complements analyses by regression models. (2) Features of MTV and MGV, especially their abilities and deficiencies in real data analysis. 1
Recommended from our members
Computational solutions for omics data
High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.National Institutes of Health (U.S.) (Grant GM081871
Diagnostics for single-peakedness of item responses with ordered conditional means (OCM)
Interpreting degenerate solutions in unfolding by use of the vector model and the compensatory distance model
catholic university of leuven In this paper, we reconsider the merits of unfolding solutions based on loss functions involving a normalization on the variance per subject. In the literature, solutions based on Stress-2 are often diagnosed to be degenerate in the majority of cases. Here, the focus lies on two frequently occurring types of degen-eracies. The first type typically locates some subject points far away from a compact cluster of the other points. In the second type of solution, the object points lie on a circle. In this paper, we argue that these degenerate solutions are well fitting and informative. To reveal the information, we introduce mixtures of plots based on the ideal point model of unfolding, the vector model, and on the signed distance model. In addition to a different representation, we provide a new iterative majorization algorithm to optimize the average squared correlation between the distances in the configuration and the transformed data per individual. It is shown that this approach is equivalent to minimizing Kruskal’s Stress-2
Interpreting degenerate solutions in unfolding by use of the vector model and the compensatory distance model
- …