462,751 research outputs found
Segregation Indices for Disease Clustering
Spatial clustering has important implications in various fields. In
particular, disease clustering is of major public concern in epidemiology. In
this article, we propose the use of two distance-based segregation indices to
test the significance of disease clustering among subjects whose locations are
from a homogeneous or an inhomogeneous population. We derive their asymptotic
distributions and compare them with other distance-based disease clustering
tests in terms of empirical size and power by extensive Monte Carlo
simulations. The null pattern we consider is the random labeling (RL) of cases
and controls to the given locations. Along this line, we investigate the
sensitivity of the size of these tests to the underlying background pattern
(e.g., clustered or homogenous) on which the RL is applied, the level of
clustering and number of clusters, or differences in relative abundances of the
classes. We demonstrate that differences in relative abundance has the highest
impact on the empirical sizes of the tests. We also propose various non-RL
patterns as alternatives to the RL pattern and assess the empirical power
performance of the tests under these alternatives. We illustrate the methods on
two real-life examples from epidemiology.Comment: 31 pages, 13 figures, 3 table
Frequent-pattern based iterative projected clustering
Irrelevant attributes add noise to high dimensional clusters and make traditional clustering techniques inappropriate. Projected clustering algorithms have been proposed to find the clusters in hidden subspaces. We realize the analogy between mining frequent itemsets and discovering the relevant subspace for a given cluster. We propose a methodology for finding projected clusters by mining frequent itemsets and present heuristics that improve its quality. Our techniques are evaluated with synthetic and real data; they are scalable and discover projected clusters accurately. © 2003 IEEE.published_or_final_versio
Frequent-pattern based iterative projected clustering
Irrelevant attributes add noise to high dimensional clusters and make traditional clustering techniques inappropriate. Projected clustering algorithms have been proposed to find the clusters in hidden subspaces. We realize the analogy between mining frequent itemsets and discovering the relevant subspace for a given cluster. We propose a methodology for finding projected clusters by mining frequent itemsets and present heuristics that improve its quality. Our techniques are evaluated with synthetic and real data; they are scalable and discover projected clusters accurately. © 2003 IEEE.published_or_final_versio
Modeling Asymmetric Volatility Clusters Using Copulas and High Frequency Data
Volatility clustering is a well-known stylized feature of financial asset returns. In this paper, we investigate the asymmetric pattern of volatility clustering on both the stock and foreign exchange rate markets. To this end, we employ copula-based semi-parametric univariate time-series models that accommodate the clusters of both large and small volatilities in the analysis. Using daily realized volatilities of the individual company stocks, stock indices and foreign exchange rates constructed from high frequency data, we find that volatility clustering is strongly asymmetric in the sense that clusters of large volatilities tend to be much stronger than those of small volatilities. In addition, the asymmetric pattern of volatility clusters continues to be visible even when the clusters are allowed to be changing over time, and the volatility clusters themselves remain persistent even after forty days.Volatility clustering, Copulas, Realized volatility, High-frequency data.
Floodplain connectivity, disturbance and change: a palaeoentomological investigation of floodplain ecology from south-west England
1. Floodplain environments are increasingly subject to enhancement and restoration, with the purpose of increasing their biodiversity and returning them to a more 'natural' state. Defining such a state based solely upon neoecological data is problematic and has led several authors to suggest the use of a palaeoecological approach.2. Fossil Coleopteran assemblages recovered from multiple palaeochannel fills in south-west England were used to investigate past floodplain and channel characteristics during the mid- to late-Holocene. Ordination of coleopteran data was performed using Detrended Correspondence Analysis (DCA) and produced clear and discrete clustering. This clustering pattern is related to the nature of the environment in which assemblages were deposited and hence channel configuration and dynamics.3. The DCA clustering pattern is strongly related to measures of ecological evenness, and a strong relationship between these indices and the composition of the water beetle assemblage within samples was revealed. Repeating the ordination with presence–absence data results in a similar pattern of clustering, implying that assemblage composition is crucial in determining cluster placement.4. As assemblage composition is primarily a function of floodplain topography and hence disturbance regime, we attempt to relate these data to the Intermediate Disturbance Hypothesis (IDH). A significant positive correlation was found between ecological diversity (Shannon's H') and Axis 1 of all ordinations in predominantly aquatic assemblages
Clustering based on Random Graph Model embedding Vertex Features
Large datasets with interactions between objects are common to numerous
scientific fields (i.e. social science, internet, biology...). The interactions
naturally define a graph and a common way to explore or summarize such dataset
is graph clustering. Most techniques for clustering graph vertices just use the
topology of connections ignoring informations in the vertices features. In this
paper, we provide a clustering algorithm exploiting both types of data based on
a statistical model with latent structure characterizing each vertex both by a
vector of features as well as by its connectivity. We perform simulations to
compare our algorithm with existing approaches, and also evaluate our method
with real datasets based on hyper-textual documents. We find that our algorithm
successfully exploits whatever information is found both in the connectivity
pattern and in the features
Pattern Classification Based On Multi-Hyperellipsoid Clustering.
Traditional model-based pattern classification is based on the assumption that the distribution of the training samples of each pattern class can be formulated by a single statistical function. It is difficult to make an accurate classification by the traditional method when the training samples of different classes do not bind to this assumption. The main contribution of this research is the development of a new clustering technique, called Multi-Hyperellipsoid Clustering, that is able to handle any irregular pattern distributions. The new method uses a supervised maximum likelihood estimation to derive a set of distribution functions for the training samples of each class, and then uses an improved Bayesian probability decision model to partition the pattern space. The new classifier achieved a higher rate of correct classification than the traditional method, with respect to some rather complex pattern distributions in a number of test examples
Recommended from our members
Clustering Scatter Plots Using Data Depth Measures.
Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains such as bioinformatics and text mining. However, the existing methods can only deal with a data matrix of scalars. In this paper, we introduce a hierarchical clustering procedure that can handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We then combine hypothesis testing with hierarchical clustering to simultaneously cluster the rows and columns of the data matrix of scatter plots. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our new clustering method through simulation studies and application to a microbe-host-interaction study
- …