5,761 research outputs found
Machine learning for crystal identification and discovery
As computers get faster, researchers -- not hardware or algorithms -- become
the bottleneck in scientific discovery. Computational study of colloidal
self-assembly is one area that is keenly affected: even after computers
generate massive amounts of raw data, performing an exhaustive search to
determine what (if any) ordered structures occur in a large parameter space of
many simulations can be excruciating. We demonstrate how machine learning can
be applied to discover interesting areas of parameter space in colloidal self
assembly. We create numerical fingerprints -- inspired by bond orientational
order diagrams -- of structures found in self-assembly studies and use these
descriptors to both find interesting regions in a phase diagram and identify
characteristic local environments in simulations in an automated manner for
simple and complex crystal structures. Utilizing these methods allows analysis
methods to keep up with the data generation ability of modern high-throughput
computing environments.Comment: Fixed typo, added missing acknowledgment, added supplementary
informatio
A novel double-hybrid learning method for modal frequency-based damage assessment of bridge structures under different environmental variation patterns
Monitoring of modal frequencies under an unsupervised learning framework is a practical strategy for damage assessment of civil structures, especially bridges. However, the key challenge is related to high sensitivity of modal frequencies to environmental and/or operational changes that may lead to economic and safety losses. The other challenge pertains to different environmental and/or operational variation patterns in modal frequencies due to differences in structural types, materials, and applications, measurement periods in terms of short and/or long monitoring programs, geographical locations of structures, weather conditions, and influences of single or multiple environmental and/or operational factors, which may cause barriers to employing stateof-the-art unsupervised learning approaches. To cope with these issues, this paper proposes a novel double-hybrid learning technique in an unsupervised manner. It contains two stages of data partitioning and anomaly detection, both of which comprise two hybrid algorithms. For the first stage, an improved hybrid clustering method based on a coupling of shared nearest neighbor searching and density peaks clustering is proposed to prepare local information for anomaly detection with the focus on mitigating environmental and/or operational effects. For the second stage, this paper proposes an innovative non-parametric hybrid anomaly detector based on local outlier factor. In both stages, the number of nearest neighbors is the key hyperparameter that is automatically determined by leveraging a self-adaptive neighbor searching algorithm. Modal frequencies of two full-scale bridges are utilized to validate the proposed technique with several comparisons. Results indicate that this technique is able to successfully eliminate different environmental and/or operational variations and correctly detect damage
Data clustering using a model granular magnet
We present a new approach to clustering, based on the physical properties of
an inhomogeneous ferromagnet. No assumption is made regarding the underlying
distribution of the data. We assign a Potts spin to each data point and
introduce an interaction between neighboring points, whose strength is a
decreasing function of the distance between the neighbors. This magnetic system
exhibits three phases. At very low temperatures it is completely ordered; all
spins are aligned. At very high temperatures the system does not exhibit any
ordering and in an intermediate regime clusters of relatively strongly coupled
spins become ordered, whereas different clusters remain uncorrelated. This
intermediate phase is identified by a jump in the order parameters. The
spin-spin correlation function is used to partition the spins and the
corresponding data points into clusters. We demonstrate on three synthetic and
three real data sets how the method works. Detailed comparison to the
performance of other techniques clearly indicates the relative success of our
method.Comment: 46 pages, postscript, 15 ps figures include
PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search
This paper studies density-based clustering of point sets. These methods use
dense regions of points to detect clusters of arbitrary shapes. In particular,
we study variants of density peaks clustering, a popular type of algorithm that
has been shown to work well in practice. Our goal is to cluster large
high-dimensional datasets, which are prevalent in practice. Prior solutions are
either sequential, and cannot scale to large data, or are specialized for
low-dimensional data.
This paper unifies the different variants of density peaks clustering into a
single framework, PECANN, by abstracting out several key steps common to this
class of algorithms. One such key step is to find nearest neighbors that
satisfy a predicate function, and one of the main contributions of this paper
is an efficient way to do this predicate search using graph-based approximate
nearest neighbor search (ANNS). To provide ample parallelism, we propose a
doubling search technique that enables points to find an approximate nearest
neighbor satisfying the predicate in a small number of rounds. Our technique
can be applied to many existing graph-based ANNS algorithms, which can all be
plugged into PECANN.
We implement five clustering algorithms with PECANN and evaluate them on
synthetic and real-world datasets with up to 1.28 million points and up to 1024
dimensions on a 30-core machine with two-way hyper-threading. Compared to the
state-of-the-art FASTDP algorithm for high-dimensional density peaks
clustering, which is sequential, our best algorithm is 45x-734x faster while
achieving competitive ARI scores. Compared to the state-of-the-art parallel
DPC-based algorithm, which is optimized for low dimensions, we show that PECANN
is two orders of magnitude faster. As far as we know, our work is the first to
evaluate DPC variants on large high-dimensional real-world image and text
embedding datasets
- …