171 research outputs found

    Efficient use of simultaneous multi-band observations for variable star analysis

    Full text link
    The luminosity changes of most types of variable stars are correlated in the different wavelengths, and these correlations may be exploited for several purposes: for variability detection, for distinction of microvariability from noise, for period search or for classification. Principal component analysis is a simple and well-developed statistical tool to analyze correlated data. We will discuss its use on variable objects of Stripe 82 of the Sloan Digital Sky Survey, with the aim of identifying new RR Lyrae and SX Phoenicis-type candidates. The application is not straightforward because of different noise levels in the different bands, the presence of outliers that can be confused with real extreme observations, under- or overestimated errors and the dependence of errors on the magnitudes. These particularities require robust methods to be applied together with the principal component analysis. The results show that PCA is a valuable aid in variability analysis with multi-band data.Comment: 8 pages, 5 figures, Workshop on Astrostatistics and Data Mining in Astronomical Databases, May 29-June 4 2011, La Palm

    Relaxed 2-D Principal Component Analysis by LpL_p Norm for Face Recognition

    Full text link
    A relaxed two dimensional principal component analysis (R2DPCA) approach is proposed for face recognition. Different to the 2DPCA, 2DPCA-L1L_1 and G2DPCA, the R2DPCA utilizes the label information (if known) of training samples to calculate a relaxation vector and presents a weight to each subset of training data. A new relaxed scatter matrix is defined and the computed projection axes are able to increase the accuracy of face recognition. The optimal LpL_p-norms are selected in a reasonable range. Numerical experiments on practical face databased indicate that the R2DPCA has high generalization ability and can achieve a higher recognition rate than state-of-the-art methods.Comment: 19 pages, 11 figure

    Searching for motifs in the behaviour of larval Drosophila melanogaster and Caenorhabditis elegans reveals continuity between behavioural states

    Get PDF
    We present a novel method for the unsupervised discovery of behavioural motifs in larval Drosophila melanogaster and Caenorhabditis elegans. A motif is defined as a particular sequence of postures that recurs frequently. The animal's changing posture is represented by an eigenshape time series, and we look for motifs in this time series. To find motifs, the eigenshape time series is segmented, and the segments clustered using spline regression. Unlike previous approaches, our method can classify sequences of unequal duration as the same motif. The behavioural motifs are used as the basis of a probabilistic behavioural annotator, the eigenshape annotator (ESA). Probabilistic annotation avoids rigid threshold values and allows classification uncertainty to be quantified. We apply eigenshape annotation to both larval Drosophila and C. elegans and produce a good match to hand annotation of behavioural states. However, we find many behavioural events cannot be unambiguously classified. By comparing the results with ESA of an artificial agent's behaviour, we argue that the ambiguity is due to greater continuity between behavioural states than is generally assumed for these organisms

    Sparsest factor analysis for clustering variables: a matrix decomposition approach

    Get PDF
    We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA

    The projection score - an evaluation criterion for variable subset selection in PCA visualization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In many scientific domains, it is becoming increasingly common to collect high-dimensional data sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by removing the variables with the smallest variances or by only including the variables most highly related to a specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the number of retained variables. To our knowledge, there exists no objective method for determining the optimal inclusion criterion in the context of visualization.</p> <p>Results</p> <p>We present the projection score, which is a straightforward, intuitively appealing measure of the informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also that the projection score can be applied in general contexts, to compare the informativeness of any variable subsets with respect to visualization by PCA.</p> <p>Conclusions</p> <p>We conclude that the projection score provides an easily interpretable and universally applicable measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to systematically find the most interpretable PCA visualization in practical exploratory analysis.</p

    Data-Driven Understanding of Smart Service Systems Through Text Mining

    Get PDF
    Smart service systems are everywhere, in homes and in the transportation, energy, and healthcare sectors. However, such systems have yet to be fully understood in the literature. Given the widespread applications of and research on smart service systems, we used text mining to develop a unified understanding of such systems in a data-driven way. Specifically, we used a combination of metrics and machine learning algorithms to preprocess and analyze text data related to smart service systems, including text from the scientific literature and news articles. By analyzing 5,378 scientific articles and 1,234 news articles, we identify important keywords, 16 research topics, 4 technology factors, and 13 application areas. We define ???smart service system??? based on the analytics results. Furthermore, we discuss the theoretical and methodological implications of our work, such as the 5Cs (connection, collection, computation, and communications for co-creation) of smart service systems and the text mining approach to understand service research topics. We believe this work, which aims to establish common ground for understanding these systems across multiple disciplinary perspectives, will encourage further research and development of modern service systems

    Sparse principal component analysis for natural language processing

    Get PDF
    High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations

    Semi-sparse PCA

    Get PDF
    It is well-known that the classical exploratory factor analysis (EFA) of data with more observations than variables has several types of indeterminacy. We study the factor indeterminacy and show some new aspects of this problem by considering EFA as a specific data matrix decomposition. We adopt a new approach to the EFA estimation and achieve a new characterization of the factor indeterminacy problem. A new alternative model is proposed, which gives determinate factors and can be seen as a semi-sparse principal component analysis (PCA). An alternating algorithm is developed, where in each step a Procrustes problem is solved. It is demonstrated that the new model/algorithm can act as a specific sparse PCA and as a low-rank-plus-sparse matrix decomposition. Numerical examples with several large data sets illustrate the versatility of the new model, and the performance and behaviour of its algorithmic implementation

    A flexible framework for sparse simultaneous component based data integration

    Get PDF
    <p>Abstract</p> <p>1 Background</p> <p>High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account.</p> <p>2 Results</p> <p>We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of <it>Escherichia coli </it>samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks.</p> <p>3 Conclusion</p> <p>Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach).</p> <p>4 Availability</p> <p>The additional file contains a MATLAB implementation of the sparse simultaneous component method.</p

    Cage Matching: Head to Head Competition Experiments of an Invasive Plant Species from Different Regions as a Means to Test for Differentiation

    Get PDF
    Many hypotheses are prevalent in the literature predicting why some plant species can become invasive. However, in some respects, we lack a standard approach to compare the breadth of various studies and differentiate between alternative explanations. Furthermore, most of these hypotheses rely on ‘changes in density’ of an introduced species to infer invasiveness. Here, we propose a simple method to screen invasive plant species for potential differences in density effects between novel regions. Studies of plant competition using density series are a fundamental tool applied to virtually every aspect of plant population ecology to better understand evolution. Hence, we use a simple density series with substitution contrasting the performance of Centaurea solstitialis in monoculture (from one region) to mixtures (seeds from two regions). All else being equal, if there is no difference between the introduced species in the two novel regions compared, Argentina and California, then there should be no competitive differences between intra and inter-regional competition series. Using a replicated regression design, seeds of each species were sown in the greenhouse at 5 densities in monoculture and mixed and grown till onset of flowering. Centaurea seeds from California had higher germination while seedlings had significantly greater survival than Argentina. There was no evidence for density dependence in any measure for the California region but negative density dependence was detected in the germination of seeds from Argentina. The relative differences in competition also differed between regions with no evidence of differential competitive effects of seeds from Argentina in mixture versus monoculture while seeds from California expressed a relative cost in germination and relative growth rate in mixtures with Argentina. In the former instance, lack of difference does not mean ‘no ecological differences’ but does suggest that local adaptation in competitive abilities has not occurred. Importantly, this method successfully detected differences in the response of an invasive species to changes in density between novel regions which suggests that it is a useful preliminary means to explore invasiveness
    corecore