Search CORE

180 research outputs found

Clustering mixed-type data using a probabilistic distance algorithm[Formula presented]

Author: Palumbo Francesco
Tortora Cristina
Publication venue: 'Elsevier BV'
Publication date: 01/11/2022
Field of study

Cluster analysis is a broadly used unsupervised data analysis technique for finding groups of homogeneous units in a data set. Probabilistic distance clustering adjusted for cluster size (PDQ), discussed in this contribution, falls within the broad category of clustering methods initially developed to deal with continuous data; it has the advantage of fuzzy membership and robustness. However, a common issue in clustering deals with treating mixed-type data: continuous and categorical, which are among the most common types of data. This paper extends PDQ for mixed-type data using different dissimilarities for different kinds of variables. At first, the PDQ for mixed-type data is defined, then a simulation design shows its advantages compared to some state of the art techniques, and ultimately, it is used on a real data set. The conclusion includes some future developments

SJSU ScholarWorks

Spectral Clustering of Mixed-Type Data

Author: Mbuga Felix
Tortora Cristina
Publication venue: 'MDPI AG'
Publication date: 01/03/2022
Field of study

Cluster analysis seeks to assign objects with similar characteristics into groups called clusters so that objects within a group are similar to each other and dissimilar to objects in other groups. Spectral clustering has been shown to perform well in different scenarios on continuous data: it can detect convex and non-convex clusters, and can detect overlapping clusters. However, the constraint on continuous data can be limiting in real applications where data are often of mixed-type, i.e., data that contains both continuous and categorical features. This paper looks at extending spectral clustering to mixed-type data. The new method replaces the Euclidean-based similarity distance used in conventional spectral clustering with different dissimilarity measures for continuous and categorical variables. A global dissimilarity measure is than computed using a weighted sum, and a Gaussian kernel is used to convert the dissimilarity matrix into a similarity matrix. The new method includes an automatic tuning of the variable weight and kernel parameter. The performance of spectral clustering in different scenarios is compared with that of two state-of-the-art mixed-type data clustering methods, k-prototypes and KAMILA, using several simulated and real data sets

SJSU ScholarWorks

Unsupervised Learning via Mixtures of Skewed Distributions with Hypercube Contours

Author: Browne Ryan P.
Franczak Brian C.
McNicholas Paul D.
Tortora Cristina
Publication venue: 'Elsevier BV'
Publication date: 17/09/2014
Field of study

Mixture models whose components have skewed hypercube contours are developed via a generalization of the multivariate shifted asymmetric Laplace density. Specifically, we develop mixtures of multiple scaled shifted asymmetric Laplace distributions. The component densities have two unique features: they include a multivariate weight function, and the marginal distributions are also asymmetric Laplace. We use these mixtures of multiple scaled shifted asymmetric Laplace distributions for clustering applications, but they could equally well be used in the supervised or semi-supervised paradigms. The expectation-maximization algorithm is used for parameter estimation and the Bayesian information criterion is used for model selection. Simulated and real data sets are used to illustrate the approach and, in some cases, to visualize the skewed hypercube structure of the components

arXiv.org e-Print Archive

Cluster Correspondence Analysis and Reduced K-Means: A Two-Step Approach to Cluster Low Back Pain Patients

Author: Gupta Sucharu
Liu Fengmei
Tortora Cristina
Publication venue
Publication date: 14/05/2019
Field of study

For the IFCS 2017 data challenge on low back pain (LBP) patients clustering, we used a two-step approach. Two of the challenging characteristics of the data set are the presence of missing values and mixed type variables. After a specific pretreatment, in the first step, we performed domain clustering using cluster correspondence analysis (clusCA). Upon the output variables from each domain, we did the second step, reduced K-means clustering, to get the final clusters of patients. The conclusion section shows the final clustering results and a profile plot of the clusters. Every cluster is highly interpretable and evaluated well with some descriptive variables which are used for measuring the clustering results

KITopen

Back Pain: A Spectral Clustering Approach

Author: Fitch Joseph
Khan Nazia
Tortora Cristina
Publication venue
Publication date: 30/04/2019
Field of study

We used a spectral clustering algorithm to find clusters among medical patients with lower back pain symptoms, and then we assessed the health outcomes within each cluster. First, we mapped all of the variables onto [0,1] intervals. This allowed us to compute a similarity score between every pair of patients, using an adaptation of Pearson correlation. We then calculated the spectral (eigen) decomposition of this similarity matrix, and we used the first few eigenvectors to create a low-dimensional subspace. Finally, we performed k–means clustering in this new subspace to find four clusters. We compared the cluster means and variances for each recovery assessment variable to differentiate the health outcomes for each cluster. Lastly, we highlighted the identifying symptoms of each patient cluster by inspecting any variable whose within–cluster average is extraordinarily low or high, relative to the other clusters

KITopen

Evaluation of Coordinated Ramp Metering (CRM) Implemented By Caltrans

Author: Loh Jacky
Molan Amir
Murugesan Nivedha
Pande Anurag
Rahman Faridur
Shams Alireza
Tortora Cristina
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2020
Field of study

Coordinated ramp metering (CRM) is a critical component of smart freeway corridors that rely on real-time traffic data from ramps and freeway mainline to improve decision-making by the motorists and Traffic Management Center (TMC) personnel. CRM uses an algorithm that considers real-time traffic volumes on freeway mainline and ramps and then adjusts the metering rates on the ramps accordingly for optimal flow along the entire corridor. Improving capacity through smart corridors is less costly and easier to deploy than freeway widening due to high costs associated with right-of-way acquisition and construction. Nevertheless, conversion to smart corridors still represents a sizable investment for public agencies. However, in the U.S. there have been limited evaluations of smart corridors in general, and CRM in particular, based on real operational data. This project examined the recent Smart Corridor implementation on Interstate 80 (I-80) in the Bay Area and State Route 99 (SR-99, SR99) in Sacramento based on travel time reliability measures, efficiency measures, and before-and-after safety evaluation using the Empirical Bayes (EB) approach. As such, this evaluation represents the most complete before-and-after evaluation of such systems. The reliability measures include buffer index, planning time, and measures from the literature that account for both the skew and width of the travel time distribution. For efficiency, the study estimates the ratio of vehicle miles traveled vs. vehicle hour traveled. The research contextualizes before-and-after comparisons for efficiency and reliability measures through similar measures from another corridor (i.e., the control corridor of I-280 in District 4 and I-5 in District 3) from the same region, which did not have CRM implemented. The results show there has been an improvement in freeway operation based on efficiency data. Post-CRM implementation, travel time reliability measures do not show a similar improvement. The report also provides a counterfactual estimate of expected crashes in the post-implementation period, which can be compared with the actual number of crashes in the “after” period to evaluate effectiveness

SJSU ScholarWorks

K-Means Clustering on Multiple Correspondence Analysis Coordinates

Author: Liu Hongzhe
Phan Le
Tortora Cristina
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 26/02/2019
Field of study

On April 18, 2017, the International Federation of Classification Societies (IFCS) issued a challenge to its members and the classification community to analyze a data set of 928 low back pain patients. In this paper, we present our contribution in terms of a cluster analysis of this data set. We will discuss our data cleaning process, which we view as a two-pronged approach: inferring values that are missing not at random and imputing values that are missing at random. We will also discuss the challenges in clustering mixed data types and the required data transformation prior to applying a clustering algorithm. We call our proposed data transformation process split-then-join. Finally, we offer our interpretation of the clustering results with respect to validation variables and we present some thoughts on selecting important variables to classify new observations

KITopen