4,776 research outputs found
Data clustering using a model granular magnet
We present a new approach to clustering, based on the physical properties of
an inhomogeneous ferromagnet. No assumption is made regarding the underlying
distribution of the data. We assign a Potts spin to each data point and
introduce an interaction between neighboring points, whose strength is a
decreasing function of the distance between the neighbors. This magnetic system
exhibits three phases. At very low temperatures it is completely ordered; all
spins are aligned. At very high temperatures the system does not exhibit any
ordering and in an intermediate regime clusters of relatively strongly coupled
spins become ordered, whereas different clusters remain uncorrelated. This
intermediate phase is identified by a jump in the order parameters. The
spin-spin correlation function is used to partition the spins and the
corresponding data points into clusters. We demonstrate on three synthetic and
three real data sets how the method works. Detailed comparison to the
performance of other techniques clearly indicates the relative success of our
method.Comment: 46 pages, postscript, 15 ps figures include
Discovery of new stellar groups in the Orion complex
We test the ability of two unsupervised machine learning algorithms,
\textit{EnLink} and Shared Nearest Neighbour (SNN), to identify stellar
groupings in the Orion star-forming complex as an application to the
5-dimensional astrometric data from \textit{Gaia} DR2. The algorithms represent
two distinct approaches to limiting user bias when selecting parameter values
and evaluating the relative weights among astrometric parameters.
\textit{EnLink} adopts a locally adaptive distance metric and eliminates the
need of parameter tuning through automation. The original SNN relies only on
human input for parameter tuning so we modified SNN to run in two stages. We
first ran the original SNN 7,000 times, each with a randomly generated sample
according to within-source co-variance matrices provided in \textit{Gaia} DR2
and random parameter values within reasonable ranges. During the second stage,
we modified SNN to identify the most repeating stellar groups from 25,798 we
obtained in the first stage. We reveal 21 spatially- and kinematically-coherent
groups in the Orion complex, 12 of which previously unknown. The groups show a
wide distribution of distances extending as far as about 150 pc in front of the
star-forming Orion molecular clouds, to about 50 pc beyond them where we find,
unexpectedly, several groups. Our results expose to view the wealth of
sub-structure in the OB association, within and beyond the classical Blaauw
Orion OBI sub-groups. A full characterization of the new groups is of the
essence as it offers the potential to unveil how star formation proceeds
globally in large complexes such as Orion. The data and code that generated the
groups in this work as well as the final table can be found at \protect\url{
https://github.com/BoquanErwinChen/GaiaDR2_Orion_Dissection}.Comment: 9 pages, 4 figures. Accepted by A&A. Comments welcom
A novel double-hybrid learning method for modal frequency-based damage assessment of bridge structures under different environmental variation patterns
Monitoring of modal frequencies under an unsupervised learning framework is a practical strategy for damage assessment of civil structures, especially bridges. However, the key challenge is related to high sensitivity of modal frequencies to environmental and/or operational changes that may lead to economic and safety losses. The other challenge pertains to different environmental and/or operational variation patterns in modal frequencies due to differences in structural types, materials, and applications, measurement periods in terms of short and/or long monitoring programs, geographical locations of structures, weather conditions, and influences of single or multiple environmental and/or operational factors, which may cause barriers to employing stateof-the-art unsupervised learning approaches. To cope with these issues, this paper proposes a novel double-hybrid learning technique in an unsupervised manner. It contains two stages of data partitioning and anomaly detection, both of which comprise two hybrid algorithms. For the first stage, an improved hybrid clustering method based on a coupling of shared nearest neighbor searching and density peaks clustering is proposed to prepare local information for anomaly detection with the focus on mitigating environmental and/or operational effects. For the second stage, this paper proposes an innovative non-parametric hybrid anomaly detector based on local outlier factor. In both stages, the number of nearest neighbors is the key hyperparameter that is automatically determined by leveraging a self-adaptive neighbor searching algorithm. Modal frequencies of two full-scale bridges are utilized to validate the proposed technique with several comparisons. Results indicate that this technique is able to successfully eliminate different environmental and/or operational variations and correctly detect damage
PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search
This paper studies density-based clustering of point sets. These methods use
dense regions of points to detect clusters of arbitrary shapes. In particular,
we study variants of density peaks clustering, a popular type of algorithm that
has been shown to work well in practice. Our goal is to cluster large
high-dimensional datasets, which are prevalent in practice. Prior solutions are
either sequential, and cannot scale to large data, or are specialized for
low-dimensional data.
This paper unifies the different variants of density peaks clustering into a
single framework, PECANN, by abstracting out several key steps common to this
class of algorithms. One such key step is to find nearest neighbors that
satisfy a predicate function, and one of the main contributions of this paper
is an efficient way to do this predicate search using graph-based approximate
nearest neighbor search (ANNS). To provide ample parallelism, we propose a
doubling search technique that enables points to find an approximate nearest
neighbor satisfying the predicate in a small number of rounds. Our technique
can be applied to many existing graph-based ANNS algorithms, which can all be
plugged into PECANN.
We implement five clustering algorithms with PECANN and evaluate them on
synthetic and real-world datasets with up to 1.28 million points and up to 1024
dimensions on a 30-core machine with two-way hyper-threading. Compared to the
state-of-the-art FASTDP algorithm for high-dimensional density peaks
clustering, which is sequential, our best algorithm is 45x-734x faster while
achieving competitive ARI scores. Compared to the state-of-the-art parallel
DPC-based algorithm, which is optimized for low dimensions, we show that PECANN
is two orders of magnitude faster. As far as we know, our work is the first to
evaluate DPC variants on large high-dimensional real-world image and text
embedding datasets
- …