Search CORE

2,924 research outputs found

Recommended from our members

Statistical clustering of data

Author: Zhang Lihao
Publication venue
Publication date: 16/11/2015
Field of study

textCluster analysis aims at segmenting objects into groups with similar members and, therefore helps to discover distribution of properties and correlations in large datasets. Data clustering has been widely studied as it arises in many domains in marketing, engineering, and social sciences. Especially, the occurrence of transactional and experimental datasets in large scale in recent years significantly increased the necessity of clustering techniques to reduce the size of the existing objects, to achieve a better knowledge of the data. This report introduced fundamental concepts related to cluster analysis, addressed the similarity and dissimilarity measurements for cluster definition, and clarified three major clustering algorithms-hierarchical clustering, K-means clustering and Gaussian mixture model fitted by Expectation-Maximization (EM) algorithm-theoretically and experimentally to illustrate the process of clustering. Finally, methods of determining the number of clusters and validating the clustering were presented as for clustering evaluation.Statistic

Texas ScholarWorks

Statistical clustering of temporal networks through a dynamic stochastic block model

Author: Matias Catherine
Miele Vincent
Publication venue
Publication date: 22/06/2016
Field of study

Statistical node clustering in discrete time dynamic networks is an emerging field that raises many challenges. Here, we explore statistical properties and frequentist inference in a model that combines a stochastic block model (SBM) for its static part with independent Markov chains for the evolution of the nodes groups through time. We model binary data as well as weighted dynamic random graphs (with discrete or continuous edges values). Our approach, motivated by the importance of controlling for label switching issues across the different time steps, focuses on detecting groups characterized by a stable within group connectivity behavior. We study identifiability of the model parameters, propose an inference procedure based on a variational expectation maximization algorithm as well as a model selection criterion to select for the number of groups. We carefully discuss our initialization strategy which plays an important role in the method and compare our procedure with existing ones on synthetic datasets. We also illustrate our approach on dynamic contact networks, one of encounters among high school students and two others on animal interactions. An implementation of the method is available as a R package called dynsbm

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

A Simple BATSE Measure of GRB Duty Cycle

Author: Hakkila Jon
Pendleton Geoffrey N.
Preece Robert D.
Publication venue: 'AIP Publishing'
Publication date: 01/01/2000
Field of study

We introduce a definition of gamma-ray burst (GRB) duty cycle that describes the GRB's efficiency as an emitter; it is the GRB's average flux relative to the peak flux. This GRB duty cycle is easily described in terms of measured BATSE parameters; it is essentially fluence divided by the quantity peak flux times duration. Since fluence and duration are two of the three defining characteristics of the GRB classes identified by statistical clustering techniques (the other is spectral hardness), duty cycle is a potentially valuable probe for studying properties of these classes.Comment: 4 pages, 1 figure, presented at the 5th Huntsville Gamma-Ray Burst Symposiu

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Recommended from our members

RGFGA: An efficient representation and crossover for grouping genetic algorithms

Author: Crampton J
Swift S
Tucker A
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2005
Field of study

There is substantial research into genetic algorithms that are used to group large numbers of objects into mutually exclusive subsets based upon some fitness function. However, nearly all methods involve degeneracy to some degree. We introduce a new representation for grouping genetic algorithms, the restricted growth function genetic algorithm, that effectively removes all degeneracy, resulting in a more efficient search. A new crossover operator is also described that exploits a measure of similarity between chromosomes in a population. Using several synthetic datasets, we compare the performance of our representation and crossover with another well known state-of-the-art GA method, a strawman optimisation method and a well-established statistical clustering algorithm, with encouraging results

Brunel University Research Archive

Properties of Gamma-Ray Burst Classes

Author: Haglin David J.
Hakkila Jon
Mallozzi Robert S.
Meegan Charles A.
Pendleton Geoffrey N.
Roiger Richard J.
Publication venue: 'AIP Publishing'
Publication date: 01/01/2000
Field of study

The three gamma-ray burst (GRB) classes identified by statistical clustering analysis (Mukherjee et al. 1998) are examined using the pattern recognition algorithm C4.5 (Quinlan 1986). Although the statistical existence of Class 3 (intermediate duration, intermediate fluence, soft) is supported, the properties of this class do not need to arise from a distinct source population. Class 3 properties can easily be produced from Class 1 (long, high fluence, intermediate hardness) by a combination of measurement error, hardness/intensity correlation, and a newly-identified BATSE bias (the fluence duration bias). Class 2 (short, low fluence, hard) does not appear to be related to Class 1.Comment: 5 pages, 4 imbedded figures, presented at the 5th Huntsville Gamma-Ray Burst Symposiu

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Statistical Clustering of Glioblastoma Multiforme for Graph Theory Analysis

Author: Syrenne Jed
Publication venue: ScholarWorks at University of Montana
Publication date: 27/04/2018
Field of study

In statistical clustering, proteins that cluster together are likely to possess a functional relationship with each other. By statistically clustering and filtering proteomic data, networks can be created so that the vast perplexity of protein-protein interaction data can be understood and meaningfully analyzed. Here, glioblastoma and glioblastoma multiforme phosphorylation data was obtained from PhosphoSitePlus and subsequently analyzed using R. The binary data were input into a dataframe and collapsed by their gene names. The Spearman-Euclidean and Euclidean distances were then calculated, with t-stochastic neighbor embedding being performed separately on the outputs. The results were then divided into discrete clusters. Offensively large clusters were broken down to a manageable size via a penalized matrix decomposition. The rank of the penalized matrix decomposition was determined by interpolating values of the data cluster using DINEOF, running PCA on the populated dataframe, plotting the number of principle components against the proportion of variance explained, and finally choosing the point of diminishing returns that still explained over 90% of the variance. Clusters were transformed into network and then visualized in Cytoscape. The final networks represent a useful tool for researchers concerned with protein-protein interactions in glioblastomas. Work is being done to integrate these networks with those obtained from mass spectrometry peak intensities, allowing meaningful analysis of legacy datasets

University of Montana

Discovery of Activities via Statistical Clustering of Fixation Patterns

Author: Mulligan Jeffrey B
Publication venue: 'Purdue University (bepress)'
Publication date: 16/05/2018
Field of study

Human behavior often consists of a series of distinct activities, each characterized by a unique signature of visual behavior. This is true even in a restricted domain, such as piloting an aircraft, where patterns of visual signatures might represent activities like communicating, navigating, and monitoring. We propose a novel analysis method for gaze-tracking data, to perform blind discovery of these activities based on their behavioral signatures. The method is in some respects similar to recurrence analysis, but here we compare not individual fixations, but groups of fixations aggregated over a fixed time interval. The duration of this interval is a parameter that we will refer to as . We assume that the environment has been divided into a set of N different areas-of-interest (AOIs). For a given interval of time of duration , we compute the proportion of time spent fixating each AOI, resulting in an N-dimensional vector. These proportions can be converted to counts by multiplying by divided by the average fixation duration (another parameter that we fix at 280 milliseconds). We compare different intervals by computing the chi-square statistic. The p-value associated with the statistic is the likelihood of observing the data under the hypothesis that the data in the two intervals were generated by a single process with a single set of probabilities governing the fixation of each AOI. We have investigated the method using a set of 10 synthetic "activities," that sample 4 AOIs. Four of these activities visit 3 of the 4 AOIs, with equal probability; as there are four different ways to leave-one- out, there are four such activities. Similarly, there are six different activities that leave-two-out. Sequences of simulated behavior were generated by running each activity for 40 seconds, in sequence, for a total of 6.7 minutes. The figure to the right shows the matrix of chi-square statistics, using a value of 2.8 seconds for , corresponding to 10 fixations. Low values (dark) indicate poor evidence for activity differences, while high values (bright) indicate strong evidence. The dark squares along the main diagonal each correspond to the forty second intervals in which the activity was held constant; the 4x4 block at the lower left corresponds to the four leave-one-out activities, while the 6x6 block in the upper right corresponds to the leave-two-out activities. (The anti-diagonal pattern of white squares indicates those activity pairs that share no AOIs.) The chi-square values can be binarized by choosing a particular significance level; we are interested in grouping bins that represent the same activity, effectively accepting the null hypothesis. Therefore, we may adopt a relatively lax criterion; for example, choosing a p-value of 0.2 means that two behaviors that have only a 1-in-5 chance of being produced by a single activity might nevertheless be clustered together. We have explored several methods to perform clustering on the data and solving for the activity probabilities. Greedy methods begin by selecting the time bin that is similar to the most (or least) other bins, and then forming a cluster from it and all other non-discriminable bins. These methods show mediocre performance, as they do not take into account temporal contiguity. Preliminary results indicate that methods that "grow" clusters in time from seed points perform better

NASA Technical Reports Server

Purdue E-Pubs

Novel Statistical Clustering Method for Accurate Characterization of Word Pronunciation

Author: Bahari Abdul Rahim
Mat Saad Suziana
Musa Aminatuzzaharah
Nuawi Mohd Zaki
Rizman Zairi Ismael
Publication venue: 'Insight Society'
Publication date: 31/08/2017
Field of study

This paper discusses the development method to determine the accuracy of pronunciation of the word using global statistical signal analysis parameters. An engineering word that has been chosen is ‘leaching’. The pronunciation of the word ‘leaching’ in the French language has been recorded from 1 native speaker and 4 students. The recording processes use a microphone-laptop system configuration and the signal analyzing processes use MATLAB software. Time and frequency domain plots show a variety of waveforms according to the recorded pronunciation. For data processing, statistical signal analysis parameters involved to extract the signal’s features are kurtosis, root mean square and skewness. The mapping process has been performed to cluster each data. The position of the samples from the students is referred to the samples from the native speaker. The result of the accuracy of the pronunciation of words for each student can be evaluated through the comparison of the position of all the samples. In conclusion, the development of mapping and clustering methods are able to characterize the accuracy of the pronunciation of words

International Journal on Advanced Science, Engineering and Information Technology