6 research outputs found
Estimating basis functions in massive fields under the spatial random effects model
Spatial prediction is commonly achieved under the assumption of a Gaussian random field (GRF) by obtaining maximum likelihood estimates of parameters, and then using the kriging equations to arrive at predicted values. For massive datasets, fixed rank kriging using the Expectation-Maximization (EM) algorithm for estimation has been proposed as an alternative to the usual but computationally prohibitive kriging method. The method reduces computation cost of estimation by redefining the spatial process as a linear combination of basis functions and spatial random effects. A disadvantage of this method is that it imposes constraints on the relationship between the observed locations and the knots. We develop an alternative method that utilizes the Spatial Mixed Effects (SME) model, but allows for additional flexibility by estimating the range of the spatial dependence between the observations and the knots via an Alternating Expectation Conditional Maximization (AECM) algorithm. Experiments show that our methodology improves estimation without sacrificing prediction accuracy while also minimizing the additional computational burden of extra parameter estimation. The methodology is applied to a temperature data set archived by the United States National Climate Data Center, with improved results over previous methodology
Solution Path Clustering with Robust Loss and Concave Penalty
The main purpose of this dissertation is to demonstrate that using a robust loss function (instead of the usual least squares loss) improves the clustering quality in the solution path clustering scheme. Cluster analysis simultaneously attempts to determine the number of clusters and estimate cluster location and membership. Convex clustering, distinguishing itself from other popular clustering methods, casts the clustering objective as a convex optimization problem and thus admits a global solution. It is a useful exploratory technique which outputs a solution path, evoking the name, ``solution path clustering." The solution path is a tree-like structure with cluster results ranging from n clusters down to a single cluster. Now, the benefits of convex clustering come at a cost since the use of a convex penalty can seriously bias the results and ruin the search for good cluster results. To lessen the bias, Ma and Huang (2017) proposed concave penalties to form the cluster centers. While the clustering objective is no longer convex, the quality of the solutions is improved. We extend the solution path clustering scheme by implementing robust loss functions instead of the usual least squares loss. Following Ma and Huang (2017), we also use a concave penalty to form clusters. The robust loss and concave penalty work together to mitigate the influence of outliers and minimize bias in the estimation of cluster locations, especially when the true distance between clusters is large. We introduce the IRLS-ADMM algorithm to minimize our proposed objective function and prove its convergence to a local minimum. Any loss function that admits an IRLS formulation or a majorizing surrogate can be used. We also study asymptotic and oracle properties of the estimator. Finally, we demonstrate the performance of our proposed method through simulation experiments and on real data sets, as well as provide some preliminary results on choosing the number of clusters via the modified BIC (Wang, Li, and Leng, 2009).Â
New methodological contributions in time series clustering
Programa Oficial de Doutoramento en EstatÃstica e Investigación Operativa. 555V01[Abstract]
This thesis presents new procedures to address the analysis cluster of time
series. First of all a two-stage procedure based on comparing frequencies and
magnitudes of the absolute maxima of the spectral densities is proposed. Assuming
that the clustering purpose is to group series according to the underlying
dependence structures, a detailed study of the behavior in clustering of a dissimilarity
based on comparing estimated quantile autocovariance functions (QAF)
is also carried out. A prediction-based resampling algorithm proposed by Dudoit
and Fridlyand is adjusted to select the optimal number of clusters. The
asymptotic behavior of the sample quantile autocovariances is studied and an
algorithm to determine optimal combinations of lags and pairs of quantile levels
to perform clustering is introduced. The proposed metric is used to perform
hard and soft partitioning-based clustering. First, a broad simulation study
examines the behavior of the proposed metric in crisp clustering using hierarchkal
and PAM procedure. Then, a novel fuzzy C-mcdoids algorithm based on
the QAF-dissimilarity is proposed. Three different robust versions of this fuzzy
algorithm are also presented to deal with data containing outlier time series.
Finally, other ways of soft clustering analysis are explored, namely probabilistic
0-clustering and clustering based on mixture models.[Resumo]
Esta tese presenta novos procedementos para abordar a análise cluster de
series temporais. En primeiro lugar proponse un procedemento en dúas etapas
baseádo na comparación de frecuencias e magnitudes dos máximos absolutos das
densidades espectrais. Supoñendo que o propósito é agrupar series dacordo coas
estruturas de dependencia subxaccntes, tamén se leva a cabo un estudo detallado
do comportamento en clustering dunha disimilaridade basea.da na comparación
das funcións estimadas das autocovarianzas cuantil (QAF). Un algoritmo de remostraxe
baseado na predición proposto por Dudoit e Fridlyand adáptase para
selecionar o número óptimo de clusters. Tamén se estuda o comportamento
asintótico das autocovarianzas cuantÃs e se introduce un algoritmo para determinar
as combinacións óptimas de lags e pares de niveles de cuantÃs para levar
a cabo a clasificación. A métrica proposta utilÃzase para realizar análise cluster
baseado en particións "hard" e "soft". En primeiro lugar, un amplo estudo de
simulación examina o comportamento da métrica proposta en clústering "hard"
utilizando os procedementos xerárquico e PAM. A continuación, proponse un
novo algoritmo "fuzzy" C-medoides baseado na disimilaridade QAF. Tamén se
presentan tres versións robustas deste algoritmo "fuzzy" para tratar con datos
que conteñan valores atÃpicos. Finalmente, explóranse outras vÃas de análise
cluster "soft", concretamente, D-clustering probabilÃstico e clustering baseado
en modelos mixtos.[Resumen]
Esta tesis presenta nuevos procedimientos para abordar el análisis cluster de
series temporales. En primer lugar se propone un procedimiento en dos etapas
basado en la comparación de frecuencias y magnitudes de los máximos absolutos
de las densidades espectrales. Suponiendo que el propósito es agrupar series
de acuerdo con las estructuras de dependencia subyacentes, también se lleva. a
cabo un estudio detallado del comportamiento en clustering de una disimilaridad
basada en la comparación de las funciones estimadas de las autoco,'afiancias
cuantil (QAF). Un algoritmo de remuestreo basado en predicción propuesto por
Dudoit y Fridlyand se adapta para seleccionar el número óptimo de clusters.
También se estudia el comportamiento asintótico de las autocovariancias cuantites
y se introduce un algoritmo para determinar las combinaciones óptimas de
lags y pares de niveles de cuantiles para llevar a cabo la clasificación. La. métrica
propuesta se utiliza para realizar análisis cluster basado en particiones "hard"
y ''soft". En primer lugar, un amplio elltudio de simulación examina el comportamiento
de la métrica propuesta en clúster "hard" utilizando los procedimientos
jerárquico y PAM. A continuación, se propone un nuevo algoritmo "fuzzy" Cmedoides
basado en la disimilaridad QAF. También se presentan tres versiones
robustas de este algoritmo "fuzzy" para tratar con datos que contengan atÃpicos.
Finalmente, se exploran otras vÃas de análisis clus ter "soft", concretamente,
D-clustering probabilÃstico y clustering basado en modelos mixtos