2,924 research outputs found
Recommended from our members
Statistical clustering of data
textCluster analysis aims at segmenting objects into groups with similar members and, therefore helps to discover distribution of properties and correlations in large datasets. Data clustering has been widely studied as it arises in many domains in marketing, engineering, and social sciences. Especially, the occurrence of transactional and experimental datasets in large scale in recent years significantly increased the necessity of clustering techniques to reduce the size of the existing objects, to achieve a better knowledge of the data. This report introduced fundamental concepts related to cluster analysis, addressed the similarity and dissimilarity measurements for cluster definition, and clarified three major clustering algorithms-hierarchical clustering, K-means clustering and Gaussian mixture model fitted by Expectation-Maximization (EM) algorithm-theoretically and experimentally to illustrate the process of clustering. Finally, methods of determining the number of clusters and validating the clustering were presented as for clustering evaluation.Statistic
Statistical clustering of temporal networks through a dynamic stochastic block model
Statistical node clustering in discrete time dynamic networks is an emerging
field that raises many challenges. Here, we explore statistical properties and
frequentist inference in a model that combines a stochastic block model (SBM)
for its static part with independent Markov chains for the evolution of the
nodes groups through time. We model binary data as well as weighted dynamic
random graphs (with discrete or continuous edges values). Our approach,
motivated by the importance of controlling for label switching issues across
the different time steps, focuses on detecting groups characterized by a stable
within group connectivity behavior. We study identifiability of the model
parameters, propose an inference procedure based on a variational expectation
maximization algorithm as well as a model selection criterion to select for the
number of groups. We carefully discuss our initialization strategy which plays
an important role in the method and compare our procedure with existing ones on
synthetic datasets. We also illustrate our approach on dynamic contact
networks, one of encounters among high school students and two others on animal
interactions. An implementation of the method is available as a R package
called dynsbm
A Simple BATSE Measure of GRB Duty Cycle
We introduce a definition of gamma-ray burst (GRB) duty cycle that describes
the GRB's efficiency as an emitter; it is the GRB's average flux relative to
the peak flux. This GRB duty cycle is easily described in terms of measured
BATSE parameters; it is essentially fluence divided by the quantity peak flux
times duration. Since fluence and duration are two of the three defining
characteristics of the GRB classes identified by statistical clustering
techniques (the other is spectral hardness), duty cycle is a potentially
valuable probe for studying properties of these classes.Comment: 4 pages, 1 figure, presented at the 5th Huntsville Gamma-Ray Burst
Symposiu
Recommended from our members
RGFGA: An efficient representation and crossover for grouping genetic algorithms
There is substantial research into genetic algorithms that are used to group large numbers of
objects into mutually exclusive subsets based upon some fitness function. However, nearly all
methods involve degeneracy to some degree.
We introduce a new representation for grouping genetic algorithms, the restricted growth function
genetic algorithm, that effectively removes all degeneracy, resulting in a more efficient search. A new crossover operator is also described that exploits a measure of similarity between chromosomes in a population. Using several synthetic datasets, we compare the performance of our representation and crossover with another well known state-of-the-art GA method, a strawman
optimisation method and a well-established statistical clustering algorithm, with encouraging results
Properties of Gamma-Ray Burst Classes
The three gamma-ray burst (GRB) classes identified by statistical clustering
analysis (Mukherjee et al. 1998) are examined using the pattern recognition
algorithm C4.5 (Quinlan 1986). Although the statistical existence of Class 3
(intermediate duration, intermediate fluence, soft) is supported, the
properties of this class do not need to arise from a distinct source
population. Class 3 properties can easily be produced from Class 1 (long, high
fluence, intermediate hardness) by a combination of measurement error,
hardness/intensity correlation, and a newly-identified BATSE bias (the fluence
duration bias). Class 2 (short, low fluence, hard) does not appear to be
related to Class 1.Comment: 5 pages, 4 imbedded figures, presented at the 5th Huntsville
Gamma-Ray Burst Symposiu
Statistical Clustering of Glioblastoma Multiforme for Graph Theory Analysis
In statistical clustering, proteins that cluster together are likely to possess a functional relationship with each other. By statistically clustering and filtering proteomic data, networks can be created so that the vast perplexity of protein-protein interaction data can be understood and meaningfully analyzed. Here, glioblastoma and glioblastoma multiforme phosphorylation data was obtained from PhosphoSitePlus and subsequently analyzed using R. The binary data were input into a dataframe and collapsed by their gene names. The Spearman-Euclidean and Euclidean distances were then calculated, with t-stochastic neighbor embedding being performed separately on the outputs. The results were then divided into discrete clusters. Offensively large clusters were broken down to a manageable size via a penalized matrix decomposition. The rank of the penalized matrix decomposition was determined by interpolating values of the data cluster using DINEOF, running PCA on the populated dataframe, plotting the number of principle components against the proportion of variance explained, and finally choosing the point of diminishing returns that still explained over 90% of the variance. Clusters were transformed into network and then visualized in Cytoscape. The final networks represent a useful tool for researchers concerned with protein-protein interactions in glioblastomas. Work is being done to integrate these networks with those obtained from mass spectrometry peak intensities, allowing meaningful analysis of legacy datasets
Discovery of Activities via Statistical Clustering of Fixation Patterns
Human behavior often consists of a series of distinct activities, each characterized by a unique signature of visual behavior. This is true even in a restricted domain, such as piloting an aircraft, where patterns of visual signatures might represent activities like communicating, navigating, and monitoring. We propose a novel analysis method for gaze-tracking data, to perform blind discovery of these activities based on their behavioral signatures. The method is in some respects similar to recurrence analysis, but here we compare not individual fixations, but groups of fixations aggregated over a fixed time interval. The duration of this interval is a parameter that we will refer to as . We assume that the environment has been divided into a set of N different areas-of-interest (AOIs). For a given interval of time of duration , we compute the proportion of time spent fixating each AOI, resulting in an N-dimensional vector. These proportions can be converted to counts by multiplying by divided by the average fixation duration (another parameter that we fix at 280 milliseconds). We compare different intervals by computing the chi-square statistic. The p-value associated with the statistic is the likelihood of observing the data under the hypothesis that the data in the two intervals were generated by a single process with a single set of probabilities governing the fixation of each AOI. We have investigated the method using a set of 10 synthetic "activities," that sample 4 AOIs. Four of these activities visit 3 of the 4 AOIs, with equal probability; as there are four different ways to leave-one- out, there are four such activities. Similarly, there are six different activities that leave-two-out. Sequences of simulated behavior were generated by running each activity for 40 seconds, in sequence, for a total of 6.7 minutes. The figure to the right shows the matrix of chi-square statistics, using a value of 2.8 seconds for , corresponding to 10 fixations. Low values (dark) indicate poor evidence for activity differences, while high values (bright) indicate strong evidence. The dark squares along the main diagonal each correspond to the forty second intervals in which the activity was held constant; the 4x4 block at the lower left corresponds to the four leave-one-out activities, while the 6x6 block in the upper right corresponds to the leave-two-out activities. (The anti-diagonal pattern of white squares indicates those activity pairs that share no AOIs.) The chi-square values can be binarized by choosing a particular significance level; we are interested in grouping bins that represent the same activity, effectively accepting the null hypothesis. Therefore, we may adopt a relatively lax criterion; for example, choosing a p-value of 0.2 means that two behaviors that have only a 1-in-5 chance of being produced by a single activity might nevertheless be clustered together. We have explored several methods to perform clustering on the data and solving for the activity probabilities. Greedy methods begin by selecting the time bin that is similar to the most (or least) other bins, and then forming a cluster from it and all other non-discriminable bins. These methods show mediocre performance, as they do not take into account temporal contiguity. Preliminary results indicate that methods that "grow" clusters in time from seed points perform better
Novel Statistical Clustering Method for Accurate Characterization of Word Pronunciation
This paper discusses the development method to determine the accuracy of pronunciation of the word using global statistical signal analysis parameters. An engineering word that has been chosen is âleachingâ. The pronunciation of the word âleachingâ in the French language has been recorded from 1 native speaker and 4 students. The recording processes use a microphone-laptop system configuration and the signal analyzing processes use MATLAB software. Time and frequency domain plots show a variety of waveforms according to the recorded pronunciation. For data processing, statistical signal analysis parameters involved to extract the signalâs features are kurtosis, root mean square and skewness. The mapping process has been performed to cluster each data. The position of the samples from the students is referred to the samples from the native speaker. The result of the accuracy of the pronunciation of words for each student can be evaluated through the comparison of the position of all the samples. In conclusion, the development of mapping and clustering methods are able to characterize the accuracy of the pronunciation of words
- âŠ