17,586 research outputs found
Noise resistant generalized parametric validity index of clustering for gene expression data
This article has been made available through the Brunel Open Access Publishing Fund.Validity indices have been investigated for decades. However, since there is no study of noise-resistance performance of these indices in the literature, there is no guideline for determining the best clustering in noisy data sets, especially microarray data sets. In this paper, we propose a generalized parametric validity (GPV) index which employs two tunable parameters α and β to control the proportions of objects being considered to calculate the dissimilarities. The greatest advantage of the proposed GPV index is its noise-resistance ability, which results from the flexibility of tuning the parameters. Several rules are set to guide the selection of parameter values. To illustrate the noise-resistance performance of the proposed index, we evaluate the GPV index for assessing five clustering algorithms in two gene expression data simulation models with different noise levels and compare the ability of determining the number of clusters with eight existing indices. We also test the GPV in three groups of real gene expression data sets. The experimental results suggest that the proposed GPV index has superior noise-resistance ability and provides fairly accurate judgements
Relational visual cluster validity
The assessment of cluster validity plays a very important role in cluster analysis. Most commonly used cluster validity methods are based on statistical hypothesis testing or finding the best clustering scheme by computing a number of different cluster validity indices. A number of visual methods of cluster validity have been produced to display directly the validity of clusters by mapping data into two- or three-dimensional space. However, these methods may lose too much information to correctly estimate the results of clustering algorithms. Although the visual cluster validity (VCV) method of Hathaway and Bezdek can successfully solve this problem, it can only be applied for object data, i.e. feature measurements. There are very few validity methods that can be used to analyze the validity of data where only a similarity or dissimilarity relation exists – relational data. To tackle this problem, this paper presents a relational visual cluster validity (RVCV) method to assess the validity of clustering relational data. This is done by combining the results of the non-Euclidean relational fuzzy c-means (NERFCM) algorithm with a modification of the VCV method to produce a visual representation of cluster validity. RVCV can cluster complete and incomplete relational data and adds to the visual cluster validity theory. Numeric examples using synthetic and real data are presente
Multifractal current distribution in random diode networks
Recently it has been shown analytically that electric currents in a random
diode network are distributed in a multifractal manner [O. Stenull and H. K.
Janssen, Europhys. Lett. 55, 691 (2001)]. In the present work we investigate
the multifractal properties of a random diode network at the critical point by
numerical simulations. We analyze the currents running on a directed
percolation cluster and confirm the field-theoretic predictions for the scaling
behavior of moments of the current distribution. It is pointed out that a
random diode network is a particularly good candidate for a possible
experimental realization of directed percolation.Comment: RevTeX, 4 pages, 5 eps figure
Organizational Pay Mix: The Implications of Various Theoretical Perspectives for the Conceptualization and Measurement of Individual Pay Components
While pay mix is one of the most frequently used variables in recent compensation research, its theoretical relevance and measurement remains underdeveloped. There is little agreement among studies on the definitions of the various forms of pay that go into pay mix. Even studies that examine the same theories tend to overlook the implications of differences in the measures and meanings of pay mix used in other studies. Our study explores the meaning of pay mix using several theories commonly used in recent compensation research (agency, efficiency wage, expectancy, equity, and person-organization fit). Recent studies generally use a single measure of mix (e.g., bonus/base, or stock options/total, or benefits/base). We argue that to fully understand the effects of employee compensation, the multiple forms of compensation must be taken into account. Therefore, we derived pay mix measures from the theories commonly used in compensation research. We classified the pay mix policies of 478 firms using cluster-analytic techniques. We found that the classification of organizations based on their pay mix depends on the measures used. We suggest that as more realistic measures of pay mix leads to reinterpretation of compensation research and offers directions for theory development
Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data
Recent years have seen the rise of more sophisticated attacks including
advanced persistent threats (APTs) which pose severe risks to organizations and
governments by targeting confidential proprietary information. Additionally,
new malware strains are appearing at a higher rate than ever before. Since many
of these malware are designed to evade existing security products, traditional
defenses deployed by most enterprises today, e.g., anti-virus, firewalls,
intrusion detection systems, often fail at detecting infections at an early
stage.
We address the problem of detecting early-stage infection in an enterprise
setting by proposing a new framework based on belief propagation inspired from
graph theory. Belief propagation can be used either with "seeds" of compromised
hosts or malicious domains (provided by the enterprise security operation
center -- SOC) or without any seeds. In the latter case we develop a detector
of C&C communication particularly tailored to enterprises which can detect a
stealthy compromise of only a single host communicating with the C&C server.
We demonstrate that our techniques perform well on detecting enterprise
infections. We achieve high accuracy with low false detection and false
negative rates on two months of anonymized DNS logs released by Los Alamos
National Lab (LANL), which include APT infection attacks simulated by LANL
domain experts. We also apply our algorithms to 38TB of real-world web proxy
logs collected at the border of a large enterprise. Through careful manual
investigation in collaboration with the enterprise SOC, we show that our
techniques identified hundreds of malicious domains overlooked by
state-of-the-art security products
- …