Search CORE

171 research outputs found

Efficient use of simultaneous multi-band observations for variable star analysis

Author: B Sesar
B Sesar
DH McNamara
DH McNamara
DM Bramich
IT Jolliffe
JA Frieman
JK Adelman-McCarthy
M Zechmeister
PJ Rousseeuw
T Hastie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/06/2011
Field of study

The luminosity changes of most types of variable stars are correlated in the different wavelengths, and these correlations may be exploited for several purposes: for variability detection, for distinction of microvariability from noise, for period search or for classification. Principal component analysis is a simple and well-developed statistical tool to analyze correlated data. We will discuss its use on variable objects of Stripe 82 of the Sloan Digital Sky Survey, with the aim of identifying new RR Lyrae and SX Phoenicis-type candidates. The application is not straightforward because of different noise levels in the different bands, the presence of outliers that can be confused with real extreme observations, under- or overestimated errors and the dependence of errors on the magnitudes. These particularities require robust methods to be applied together with the principal component analysis. The results show that PCA is a valuable aid in variability analysis with multi-band data.Comment: 8 pages, 5 figures, Workshop on Astrostatistics and Data Mining in Astronomical Databases, May 29-June 4 2011, La Palm

arXiv.org e-Print Archive

Crossref

Relaxed 2-D Principal Component Analysis by $L_p$ Norm for Face Recognition

Author: A d’Aspremont
A Pentland
D Meng
DM Witten
H Shen
H Wang
H Zou
I Jolliffe
J Wang
J Yang
J Ye
L Sirovich
L Zhao
M Kirby
M Turk
M Zhao
N Kwak
N Kwak
Q Chang
R Ma
X Li
Z Jia
Z Jia
Z Jia
Z Jia
Z Jia
Z Liang
Z-G Jia
ZZ Liang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/05/2019
Field of study

A relaxed two dimensional principal component analysis (R2DPCA) approach is proposed for face recognition. Different to the 2DPCA, 2DPCA-

L_1

and G2DPCA, the R2DPCA utilizes the label information (if known) of training samples to calculate a relaxation vector and presents a weight to each subset of training data. A new relaxed scatter matrix is defined and the computed projection axes are able to increase the accuracy of face recognition. The optimal

L_p

-norms are selected in a reasonable range. Numerical experiments on practical face databased indicate that the R2DPCA has high generalization ability and can achieve a higher recognition rate than state-of-the-art methods.Comment: 19 pages, 11 figure

arXiv.org e-Print Archive

Crossref

Searching for motifs in the behaviour of larval Drosophila melanogaster and Caenorhabditis elegans reveals continuity between behavioural states

Author: Ajinkya Deogade
Balázs Szigeti
Barbara Webb
Davies A
Jolliffe I
Marin AG
MATLAB
Powers DM
Publication venue: 'The Royal Society'
Publication date: 06/12/2015
Field of study

We present a novel method for the unsupervised discovery of behavioural motifs in larval Drosophila melanogaster and Caenorhabditis elegans. A motif is defined as a particular sequence of postures that recurs frequently. The animal's changing posture is represented by an eigenshape time series, and we look for motifs in this time series. To find motifs, the eigenshape time series is segmented, and the segments clustered using spline regression. Unlike previous approaches, our method can classify sequences of unequal duration as the same motif. The behavioural motifs are used as the basis of a probabilistic behavioural annotator, the eigenshape annotator (ESA). Probabilistic annotation avoids rigid threshold values and allows classification uncertainty to be quantified. We apply eigenshape annotation to both larval Drosophila and C. elegans and produce a good match to hand annotation of behavioural states. However, we find many behavioural events cannot be unambiguously classified. By comparing the results with ESA of an artificial agent's behaviour, we argue that the ambiguity is due to greater continuity between behavioural states than is generally assumed for these organisms

Crossref

PubMed Central

Edinburgh Research Explorer

Sparsest factor analysis for clustering variables: a matrix decomposition approach

Author: A Stegeman
AJ Izenman
BS Everitt
C Spearman
CC Aggarwal
D Knowles
DM Zou
G Gan
GAF Seber
HH Harman
IT Jolliffe
J de Leeuw
JMF ten Berge
JMF ten Berge
K Adachi
K Adachi
K Adachi
K Hirose
K Hirose
Kohei Adachi
L Eldén
LR Goldberg
M Rattray
M Vichi
MJ Zaki
Nickolay T. Trendafilov
Nickolay T. Trendafilov
NT Trendafilov
NT Trendafilov
PT Costa
R Mazumder
R Reyment
S Unkel
SA Mulaik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/04/2017
Field of study

We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA

Crossref

Open Research Online (The Open University)

The projection score - an evaluation criterion for variable subset selection in PCA visualization

Author: AA Shabalin
AE Raftery
C Boutsidis
C Haslinger
Charlotte Soneson
DA Jackson
DM Witten
DT Ross
E Bair
GP McCabe
H Hotelling
H Hotelling
H Shen
H Zou
H Zou
I Guyon
IM Johnstone
IM Johnstone
IT Jolliffe
IT Jolliffe
IT Jolliffe
K Hoffmann
K Pearson
M Lee
Magnus Fontes
ME Ross
MG Tadesse
O Modlich
PR Peres-Neto
R Tibshirani
R Varshavsky
S Bungaro
S Dray
SY Kassim
T Hastie
T Hastie
TR Golub
WJ Krzanowski
Y Liu
Y Lu
ZD Bai
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In many scientific domains, it is becoming increasingly common to collect high-dimensional data sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by removing the variables with the smallest variances or by only including the variables most highly related to a specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the number of retained variables. To our knowledge, there exists no objective method for determining the optimal inclusion criterion in the context of visualization. Results We present the projection score, which is a straightforward, intuitively appealing measure of the informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also that the projection score can be applied in general contexts, to compare the informativeness of any variable subsets with respect to visualization by PCA. Conclusions We conclude that the projection score provides an easily interpretable and universally applicable measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to systematically find the most interpretable PCA visualization in practical exploratory analysis.</p

Lund University Publications

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Data-Driven Understanding of Smart Service Systems Through Text Mining

Author: Allmendinger G
Bird S
Bishop CM
Blei DM
Chen TM
Hartigan JA
Jolliffe I
Lee YC
Lim J
Lo CH
Maglio P
Maglio PP
Montgomery DC
Noor A
Normann R
Pedregosa F
Prahalad CK
Prahalad CK
Pérez González D
Raghupathi W
Yadav K
Zhuge H
Publication venue: 'Institute for Operations Research and the Management Sciences (INFORMS)'
Publication date: 01/06/2018
Field of study

Smart service systems are everywhere, in homes and in the transportation, energy, and healthcare sectors. However, such systems have yet to be fully understood in the literature. Given the widespread applications of and research on smart service systems, we used text mining to develop a unified understanding of such systems in a data-driven way. Specifically, we used a combination of metrics and machine learning algorithms to preprocess and analyze text data related to smart service systems, including text from the scientific literature and news articles. By analyzing 5,378 scientific articles and 1,234 news articles, we identify important keywords, 16 research topics, 4 technology factors, and 13 application areas. We define ???smart service system??? based on the analytics results. Furthermore, we discuss the theoretical and methodological implications of our work, such as the 5Cs (connection, collection, computation, and communications for co-creation) of smart service systems and the text mining approach to understand service research topics. We believe this work, which aims to establish common ground for understanding these systems across multiple disciplinary perspectives, will encourage further research and development of modern service systems

Crossref

ScholarWorks@UNIST

Sparse principal component analysis for natural language processing

Author: CC Aggarwal
D Olson
DM Witten
E Haddi
H Trevor
IT Jolliffe
IT Jolliffe
J Camacho
K Rao
N Japkowicz
R Drikvandi
R Drikvandi
R Drikvandi
S Ning-min
T Robert
T Sirimongkolkasem
W Zhang
Y Shi
Y Shi
Y Shi
Z Hui
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

High dimensional data are rapidly growing in many different disciplines, particularly in natural language processing. The analysis of natural language processing requires working with high dimensional matrices of word embeddings obtained from text data. Those matrices are often sparse in the sense that they contain many zero elements. Sparse principal component analysis is an advanced mathematical tool for the analysis of high dimensional data. In this paper, we study and apply the sparse principal component analysis for natural language processing, which can effectively handle large sparse matrices. We study several formulations for sparse principal component analysis, together with algorithms for implementing those formulations. Our work is motivated and illustrated by a real text dataset. We find that the sparse principal component analysis performs as good as the ordinary principal component analysis in terms of accuracy and precision, while it shows two major advantages: faster calculations and easier interpretation of the principal components. These advantages are very helpful especially in big data situations

Durham Research Online

Crossref

E-space: Manchester Metropolitan University's Research Repository

Semi-sparse PCA

Author: A Edelman
DM Witten
EJ Candès
GH Golub
H Shen
HH Harman
IT Jolliffe
J Leeuw De
J-F Cai
JH Steiger
JH Steiger
K Adachi
L Eldén
Lars Eldén
M Journée
N Trendafilov
Nickolay Trendafilov
NT Trendafilov
P-A Absil
S Unkel
SA Armstrong
SA Mulaik
SA Mulaik
X Yuan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/03/2019
Field of study

It is well-known that the classical exploratory factor analysis (EFA) of data with more observations than variables has several types of indeterminacy. We study the factor indeterminacy and show some new aspects of this problem by considering EFA as a specific data matrix decomposition. We adopt a new approach to the EFA estimation and achieve a new characterization of the factor indeterminacy problem. A new alternative model is proposed, which gives determinate factors and can be seen as a semi-sparse principal component analysis (PCA). An alternating algorithm is developed, where in each step a Procrustes problem is solved. It is demonstrated that the new model/algorithm can act as a specific sparse PCA and as a low-rank-plus-sparse matrix decomposition. Numerical examples with several large data sets illustrate the versatility of the new model, and the performance and behaviour of its algorithmic implementation

Publikationer från Linköpings universitet

Crossref

Open Research Online (The Open University)

Digitala Vetenskapliga Arkivet - Academic Archive On-line

A flexible framework for sparse simultaneous component based data integration

Author: AE Hoerl
AL Barabasi
Anestis Antoniadis
D Lee
DM Witten
GJ McLachlan
H Kiers
H Zou
H Zou
HAL Kiers
I Borg
I Jolliffe
IT Jolliffe
Iven Van Mechelen
J de Leeuw
J Friedman
J Huang
JMF Ten Berge
K Lange
K Lemmens
K Van Deun
K Van Deun
K Van Deun
KA Le Cao
Katrijn Van Deun
KR Gabriel
L Meier
M de Tayrac
M Kowalski
M Yuan
MJ van der Werf
N Ishii
O Alter
P Zhao
PJF Groenen
R Jenatton
R Tibshirani
R van den Berg
Robert A van den Berg
S Hochreiter
S Ma
T Wilderjans
TF Wilderjans
Tom F Wilderjans
WJ Heiser
Y Kim
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract 1 Background High throughput data are complex and methods that reveal structure underlying the data are most useful. Principal component analysis, frequently implemented as a singular value decomposition, is a popular technique in this respect. Nowadays often the challenge is to reveal structure in several sources of information (e.g., transcriptomics, proteomics) that are available for the same biological entities under study. Simultaneous component methods are most promising in this respect. However, the interpretation of the principal and simultaneous components is often daunting because contributions of each of the biomolecules (transcripts, proteins) have to be taken into account. 2 Results We propose a sparse simultaneous component method that makes many of the parameters redundant by shrinking them to zero. It includes principal component analysis, sparse principal component analysis, and ordinary simultaneous component analysis as special cases. Several penalties can be tuned that account in different ways for the block structure present in the integrated data. This yields known sparse approaches as the lasso, the ridge penalty, the elastic net, the group lasso, sparse group lasso, and elitist lasso. In addition, the algorithmic results can be easily transposed to the context of regression. Metabolomics data obtained with two measurement platforms for the same set of <it>Escherichia coli </it>samples are used to illustrate the proposed methodology and the properties of different penalties with respect to sparseness across and within data blocks. 3 Conclusion Sparse simultaneous component analysis is a useful method for data integration: First, simultaneous analyses of multiple blocks offer advantages over sequential and separate analyses and second, interpretation of the results is highly facilitated by their sparseness. The approach offered is flexible and allows to take the block structure in different ways into account. As such, structures can be found that are exclusively tied to one data platform (group lasso approach) as well as structures that involve all data platforms (Elitist lasso approach). 4 Availability The additional file contains a MATLAB implementation of the sparse simultaneous component method.</p

Lirias

Crossref

Hal - Université Grenoble Alpes

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Cage Matching: Head to Head Competition Experiments of an Invasive Plant Species from Different Regions as a Means to Test for Differentiation

Author: A Pauchard
AJ Willis
Andrea Zikovitz
Andrew Wilby
AR Dyer
AS MacDougall
AS MacDougall
B Blossey
B Von Holle
BJ Genton
C Armas
CB Benefield
Christopher J. Lortie
CJ Lortie
CJ Lortie
CM D'Antonio
CM D'Antonio
DB Joley
DB Joley
DE Goldberg
DE Goldberg
DF Sax
DJ Gibson
DM Maddox
DM Richardson
EM White
J Joshi
JD Gerlach
JJ Ewel
JL Hierro
JL Hierro
JL Maron
Jose Hierro
JP Bakker
KL Cottingham
M Carrete
M Rejmanek
MC Mack
MD Moran
ME Figueroa
Michael Munshaw
MJ Pitcairn
P Alpert
PA Jolliffe
PG Meirmans
RL Sheley
TJ Stohlgren
Publication venue: Public Library of Science
Publication date: 01/03/2009
Field of study

Many hypotheses are prevalent in the literature predicting why some plant species can become invasive. However, in some respects, we lack a standard approach to compare the breadth of various studies and differentiate between alternative explanations. Furthermore, most of these hypotheses rely on ‘changes in density’ of an introduced species to infer invasiveness. Here, we propose a simple method to screen invasive plant species for potential differences in density effects between novel regions. Studies of plant competition using density series are a fundamental tool applied to virtually every aspect of plant population ecology to better understand evolution. Hence, we use a simple density series with substitution contrasting the performance of Centaurea solstitialis in monoculture (from one region) to mixtures (seeds from two regions). All else being equal, if there is no difference between the introduced species in the two novel regions compared, Argentina and California, then there should be no competitive differences between intra and inter-regional competition series. Using a replicated regression design, seeds of each species were sown in the greenhouse at 5 densities in monoculture and mixed and grown till onset of flowering. Centaurea seeds from California had higher germination while seedlings had significantly greater survival than Argentina. There was no evidence for density dependence in any measure for the California region but negative density dependence was detected in the germination of seeds from Argentina. The relative differences in competition also differed between regions with no evidence of differential competitive effects of seeds from Argentina in mixture versus monoculture while seeds from California expressed a relative cost in germination and relative growth rate in mixtures with Argentina. In the former instance, lack of difference does not mean ‘no ecological differences’ but does suggest that local adaptation in competitive abilities has not occurred. Importantly, this method successfully detected differences in the response of an invasive species to changes in density between novel regions which suggests that it is a useful preliminary means to explore invasiveness

Public Library of Science (PLOS)

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

Directory of Open Access Journals

PubMed Central