Search CORE

147,956 research outputs found

Differential Performance Debugging with Discriminant Regression Trees

Author: Cerny Pavol
Chang Bor-Yuh Evan
Tizpaz-Niari Saeid
Trivedi Ashutosh
Publication venue
Publication date: 28/11/2017
Field of study

Differential performance debugging is a technique to find performance problems. It applies in situations where the performance of a program is (unexpectedly) different for different classes of inputs. The task is to explain the differences in asymptotic performance among various input classes in terms of program internals. We propose a data-driven technique based on discriminant regression tree (DRT) learning problem where the goal is to discriminate among different classes of inputs. We propose a new algorithm for DRT learning that first clusters the data into functional clusters, capturing different asymptotic performance classes, and then invokes off-the-shelf decision tree learning algorithms to explain these clusters. We focus on linear functional clusters and adapt classical clustering algorithms (K-means and spectral) to produce them. For the K-means algorithm, we generalize the notion of the cluster centroid from a point to a linear function. We adapt spectral clustering by defining a novel kernel function to capture the notion of linear similarity between two data points. We evaluate our approach on benchmarks consisting of Java programs where we are interested in debugging performance. We show that our algorithm significantly outperforms other well-known regression tree learning algorithms in terms of running time and accuracy of classification.Comment: To Appear in AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Bacteria classification using Cyranose 320 electronic nose

Author: C Di Natale
Evor L Hines
HW Shin
JSR Jang
Julian W Gardner
JW Gardner
JW Gardner
JW Gardner
Pascal Boilot
Ritaban Dutta
T Kohonen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2002
Field of study

Background An electronic nose (e-nose), the Cyrano Sciences' Cyranose 320, comprising an array of thirty-two polymer carbon black composite sensors has been used to identify six species of bacteria responsible for eye infections when present at a range of concentrations in saline solutions. Readings were taken from the headspace of the samples by manually introducing the portable e-nose system into a sterile glass containing a fixed volume of bacteria in suspension. Gathered data were a very complex mixture of different chemical compounds. Method Linear Principal Component Analysis (PCA) method was able to classify four classes of bacteria out of six classes though in reality other two classes were not better evident from PCA analysis and we got 74% classification accuracy from PCA. An innovative data clustering approach was investigated for these bacteria data by combining the 3-dimensional scatter plot, Fuzzy C Means (FCM) and Self Organizing Map (SOM) network. Using these three data clustering algorithms simultaneously better 'classification' of six eye bacteria classes were represented. Then three supervised classifiers, namely Multi Layer Perceptron (MLP), Probabilistic Neural network (PNN) and Radial basis function network (RBF), were used to classify the six bacteria classes. Results A [6 × 1] SOM network gave 96% accuracy for bacteria classification which was best accuracy. A comparative evaluation of the classifiers was conducted for this application. The best results suggest that we are able to predict six classes of bacteria with up to 98% accuracy with the application of the RBF network. Conclusion This type of bacteria data analysis and feature extraction is very difficult. But we can conclude that this combined use of three nonlinear methods can solve the feature extraction problem with very complex data and enhance the performance of Cyranose 320

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Warwick Research Archives Portal Repository

Adaptation of K-means-type algorithms to the Grassmann manifold, An

Author: Stiverson Shannon J.
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2019
Field of study

2019 Spring.Includes bibliographical references.The Grassmann manifold provides a robust framework for analysis of high-dimensional data through the use of subspaces. Treating data as subspaces allows for separability between data classes that is not otherwise achieved in Euclidean space, particularly with the use of the smallest principal angle pseudometric. Clustering algorithms focus on identifying similarities within data and highlighting the underlying structure. To exploit the properties of the Grassmannian for unsupervised data analysis, two variations of the popular K-means algorithm are adapted to perform clustering directly on the manifold. We provide the theoretical foundations needed for computations on the Grassmann manifold and detailed derivations of the key equations. Both algorithms are then thoroughly tested on toy data and two benchmark data sets from machine learning: the MNIST handwritten digit database and the AVIRIS Indian Pines hyperspectral data. Performance of algorithms is tested on manifolds of varying dimension. Unsupervised classification results on the benchmark data are compared to those currently found in the literature

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Unsupervised classification and areal measurement of land and water coastal features on the Texas coast

Author: Flores L. M.
Hixon S. B.
Paris J. F.
Reeves C. A.
Publication venue
Publication date
Field of study

Multispectral scanner (MSS) digital data from ERTS-1 was used to delineate coastal land, vegetative, and water features in two portions of the Texas Coastal Zone. Data (Scene ID's 1037-16244 and 1037-16251) acquired on August 29, 1972, were analyzed on NASA Johnson Space Center systems through the use of two clustering algorithms. Seventeen to 30 spectrally homogeneous classes were so defined. Many classes were identified as being pure features such as water masses, salt marsh, beaches, pine, hardwoods, and exposed soil or construction materials. Most classes were identified to be mixtures of the pure class types. Using an objective technique for measuring the percentage of wetland along salt marsh boundaries, an analysis was made of the accuracy of areal measurement of salt marshes. Accuracies ranged from 89 to 99 percent. Aircraft photography was used as the basis for determining the true areal size of salt marshes in the study sites

NASA Technical Reports Server

Consistency of spectral clustering

Author: Belkin Mikhail
Bousquet Olivier
von Luxburg Ulrike
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2007
Field of study

Consistency is a key property of all statistical procedures analyzing randomly sampled data. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that, for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result, we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.Comment: Published in at http://dx.doi.org/10.1214/009053607000000640 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

MPG.PuRe

DCSI -- An improved measure of cluster separability based on separation and connectedness

Author: Gauss Jana
Herrmann Moritz
Scheipl Fabian
Publication venue
Publication date: 19/10/2023
Field of study

Whether class labels in a given data set correspond to meaningful clusters is crucial for the evaluation of clustering algorithms using real-world data sets. This property can be quantified by separability measures. A review of the existing literature shows that neither classification-based complexity measures nor cluster validity indices (CVIs) adequately incorporate the central aspects of separability for density-based clustering: between-class separation and within-class connectedness. A newly developed measure (density cluster separability index, DCSI) aims to quantify these two characteristics and can also be used as a CVI. Extensive experiments on synthetic data indicate that DCSI correlates strongly with the performance of DBSCAN measured via the adjusted rand index (ARI) but lacks robustness when it comes to multi-class data sets with overlapping classes that are ill-suited for density-based hard clustering. Detailed evaluation on frequently used real-world data sets shows that DCSI can correctly identify touching or overlapping classes that do not form meaningful clusters

arXiv.org e-Print Archive

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

Author: A Ruepp
AI Saeed
C Fraley
F Al-Shahrour
FD Gibbons
I Gat-Viks
J Dopazo
J Handl
J Herrero
J Quackenbush
JA Hartigan
K Yeung
KM Kerr
L Kaufman
M Ashburner
MC Abba
MD Robinson
N Bolshakova
NG Waller
P Resnik
P Toronen
PH Sneath
PJ Rousseeuw
R Shamir
S Chu
S Datta
S Datta
S Dudoit
SG Lee
Somnath Datta
Susmita Datta
T Kohonen
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species. RESULTS: In this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency. CONCLUSION: Functional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central