5,872 research outputs found
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
-MLE: A fast algorithm for learning statistical mixture models
We describe -MLE, a fast and efficient local search algorithm for learning
finite statistical mixtures of exponential families such as Gaussian mixture
models. Mixture models are traditionally learned using the
expectation-maximization (EM) soft clustering technique that monotonically
increases the incomplete (expected complete) likelihood. Given prescribed
mixture weights, the hard clustering -MLE algorithm iteratively assigns data
to the most likely weighted component and update the component models using
Maximum Likelihood Estimators (MLEs). Using the duality between exponential
families and Bregman divergences, we prove that the local convergence of the
complete likelihood of -MLE follows directly from the convergence of a dual
additively weighted Bregman hard clustering. The inner loop of -MLE can be
implemented using any -means heuristic like the celebrated Lloyd's batched
or Hartigan's greedy swap updates. We then show how to update the mixture
weights by minimizing a cross-entropy criterion that implies to update weights
by taking the relative proportion of cluster points, and reiterate the mixture
parameter update and mixture weight update processes until convergence. Hard EM
is interpreted as a special case of -MLE when both the component update and
the weight update are performed successively in the inner loop. To initialize
-MLE, we propose -MLE++, a careful initialization of -MLE guaranteeing
probabilistically a global bound on the best possible complete likelihood.Comment: 31 pages, Extend preliminary paper presented at IEEE ICASSP 201
Techniques for clustering gene expression data
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered
Adaptive Evolutionary Clustering
In many practical applications of clustering, the objects to be clustered
evolve over time, and a clustering result is desired at each time step. In such
applications, evolutionary clustering typically outperforms traditional static
clustering by producing clustering results that reflect long-term trends while
being robust to short-term variations. Several evolutionary clustering
algorithms have recently been proposed, often by adding a temporal smoothness
penalty to the cost function of a static clustering method. In this paper, we
introduce a different approach to evolutionary clustering by accurately
tracking the time-varying proximities between objects followed by static
clustering. We present an evolutionary clustering framework that adaptively
estimates the optimal smoothing parameter using shrinkage estimation, a
statistical approach that improves a naive estimate using additional
information. The proposed framework can be used to extend a variety of static
clustering algorithms, including hierarchical, k-means, and spectral
clustering, into evolutionary clustering algorithms. Experiments on synthetic
and real data sets indicate that the proposed framework outperforms static
clustering and existing evolutionary clustering algorithms in many scenarios.Comment: To appear in Data Mining and Knowledge Discovery, MATLAB toolbox
available at http://tbayes.eecs.umich.edu/xukevin/affec
Containing epidemic outbreaks by message-passing techniques
The problem of targeted network immunization can be defined as the one of
finding a subset of nodes in a network to immunize or vaccinate in order to
minimize a tradeoff between the cost of vaccination and the final (stationary)
expected infection under a given epidemic model. Although computing the
expected infection is a hard computational problem, simple and efficient
mean-field approximations have been put forward in the literature in recent
years. The optimization problem can be recast into a constrained one in which
the constraints enforce local mean-field equations describing the average
stationary state of the epidemic process. For a wide class of epidemic models,
including the susceptible-infected-removed and the
susceptible-infected-susceptible models, we define a message-passing approach
to network immunization that allows us to study the statistical properties of
epidemic outbreaks in the presence of immunized nodes as well as to find
(nearly) optimal immunization sets for a given choice of parameters and costs.
The algorithm scales linearly with the size of the graph and it can be made
efficient even on large networks. We compare its performance with topologically
based heuristics, greedy methods, and simulated annealing
Eigenvector Synchronization, Graph Rigidity and the Molecule Problem
The graph realization problem has received a great deal of attention in
recent years, due to its importance in applications such as wireless sensor
networks and structural biology. In this paper, we extend on previous work and
propose the 3D-ASAP algorithm, for the graph realization problem in
, given a sparse and noisy set of distance measurements. 3D-ASAP
is a divide and conquer, non-incremental and non-iterative algorithm, which
integrates local distance information into a global structure determination.
Our approach starts with identifying, for every node, a subgraph of its 1-hop
neighborhood graph, which can be accurately embedded in its own coordinate
system. In the noise-free case, the computed coordinates of the sensors in each
patch must agree with their global positioning up to some unknown rigid motion,
that is, up to translation, rotation and possibly reflection. In other words,
to every patch there corresponds an element of the Euclidean group Euc(3) of
rigid transformations in , and the goal is to estimate the group
elements that will properly align all the patches in a globally consistent way.
Furthermore, 3D-ASAP successfully incorporates information specific to the
molecule problem in structural biology, in particular information on known
substructures and their orientation. In addition, we also propose 3D-SP-ASAP, a
faster version of 3D-ASAP, which uses a spectral partitioning algorithm as a
preprocessing step for dividing the initial graph into smaller subgraphs. Our
extensive numerical simulations show that 3D-ASAP and 3D-SP-ASAP are very
robust to high levels of noise in the measured distances and to sparse
connectivity in the measurement graph, and compare favorably to similar
state-of-the art localization algorithms.Comment: 49 pages, 8 figure
- …