300,793 research outputs found
Cluster-based ensemble means for climate model intercomparison
Clustering – the automated grouping of similar data – can provide powerful and unique insight into large and complex data sets, in a fast and computationally efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) has yet to be fully realised. In this paper, we explore the specific application of clustering to a multi-model climate ensemble. We hypothesise that clustering techniques can provide (a) a flexible, data-driven method of testing model–observation agreement and (b) a mechanism with which to identify model development priorities. We focus our analysis on chemistry–climate model (CCM) output of tropospheric ozone – an important greenhouse gas – from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Tropospheric column ozone from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm. We find that a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location offers a reduction of up to ∼ 20 % in the global absolute mean bias between the MMM and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMM. On a spatial basis, the bias is reduced at ∼ 62 % of all locations, with the largest bias reductions occurring in the Northern Hemisphere – where ozone concentrations are relatively large. However, the bias is unchanged at 9 % of all locations and increases at 29 %, particularly in the Southern Hemisphere. The latter demonstrates that although cluster-based subsampling acts to remove outlier model data, such data may in fact be closer to observed values in some locations. We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement among models and a measure of diversity across an ensemble. Finally, we discuss caveats of the clustering techniques and note that while we have focused on tropospheric ozone, the principles underlying the cluster-based MMMs are applicable to other prognostic variables from climate models
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
SEED: efficient clustering of next-generation sequences.
MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications
Wireless sensor networks monitor dynamic environments that change rapidly
over time. This dynamic behavior is either caused by external factors or
initiated by the system designers themselves. To adapt to such conditions,
sensor networks often adopt machine learning techniques to eliminate the need
for unnecessary redesign. Machine learning also inspires many practical
solutions that maximize resource utilization and prolong the lifespan of the
network. In this paper, we present an extensive literature review over the
period 2002-2013 of machine learning methods that were used to address common
issues in wireless sensor networks (WSNs). The advantages and disadvantages of
each proposed algorithm are evaluated against the corresponding problem. We
also provide a comparative guide to aid WSN designers in developing suitable
machine learning solutions for their specific application challenges.Comment: Accepted for publication in IEEE Communications Surveys and Tutorial
Adaptive Multilevel Cluster Analysis by Self-Organizing Box Maps
Title and table of contents 1
Introduction 3
1\. Cluster Analysis in High-Dimensional Data 7
1.1 Modeling 8
1.2 Problem reduction via representative clustering 13
1.3 Efficient cluster description 16
1.4 How many clusters? 21
2\. Decomposition 23
2.1 General Definition 23
2.2 Approximate box decomposition 25
2.3 Decomposition based representative clustering 27
2.4 Efficient cluster description via approximate box decomposition 34
3\. Adaptive Decomposition by Self-Organized Neural Networks 41
3.1 Self-Organizing Maps (SOM) 42
3.2 Self-Organizing Box Maps (SOBM) 44
3.3 Comparison SOM-SOBM 53
3.4 Computational complexity 56
3.5 Practical extensions 57
4\. Multilevel Representative Clustering 59
4.1 General approach 59
4.2 Adaptive decomposition refinement 60
4.3 Approach based on Perron Cluster analysis 61
5\. Applications 73
5.1 Conformational Analysis of biomolecules 73
5.2 Cluster analysis of insurance customers 87
Conclusion 91
Appendix 93
Symbols 95
Bibliography 97The aim of this thesis is a fruitful combination of Perron Cluster analysis
and self-organized neural networks within an adaptive multilevel clustering
approach that allows a fast and robust identification and an efficient
description of clusters in high-dimensional data. In a general variant that
needs a correct number of clusters k as an input, this new approach is
relevant for a great number of cluster problems since it uses a cluster model
that covers geometrically, but also dynamically based clusters. Its essential
part is a method called representative clustering that guarantees the
applicability to large cluster problems: Based on an adaptive decomposition of
the object space via self-organized neural networks, the original problem is
reduced to a smaller cluster problem. The general clustering approach can be
extended by Perron Cluster analysis so that it can be used for large
reversible dynamic cluster problems, even if a correct number of clusters k is
unknown a priori. The basic application of the extended clustering approach is
the conformational analysis of biomolecules, with great impact in the field of
Drug Design. Here, for the first time the analysis of practically relevant and
large molecules like an HIV protease inhibitor becomes possible.Als Cluster Analyse bezeichnet man den Prozess der Suche und Beschreibung von
Gruppen (Clustern) von Objekten, so daß die Objekte innerhalb eines Clusters
bezüglich eines gegebenen Maßes maximal homogen sind. Die Homogenität der
Objekte hängt dabei direkt oder indirekt von den Ausprägungen ab, die sie für
eine Anzahl festgelegter Attribute besitzen. Die Suche nach Clustern läßt sich
somit als Optimierungsproblem auffassen, wobei die Anzahl der Cluster vorher
bekannt sein muß. Wenn die Anzahl der Objekte und der Attribute groß ist,
spricht man von komplexen, hoch-dimensionalen Cluster Problemen. In diesem
Fall ist eine direkte Optimierung zu aufwendig, und man benötigt entweder
heuristische Optimierungsverfahren oder Methoden zur Reduktion der
Komplexität. In der Vergangenheit wurden in der Forschung fast ausschließlich
Verfahren für geometrisch basierte Clusterprobleme entwickelt. Bei diesen
Problemen lassen sich die Objekte als Punkte in einem von den Attributen
aufgespannten metrischen Raum modellieren; das verwendete Homogenitätsmaß
basiert auf der geometrischen Distanz der den Objekten zugeordneten Punkte.
Insbesondere zur Bestimmung sogenannter metastabiler Cluster sind solche
Verfahren aber offensichtlich nicht geeignet, da metastabile Cluster, die z.B.
in der Konformationsanalyse von Biomolekülen von zentraler Bedeutung sind,
nicht auf einer geometrischen, sondern einer dynamischen Ähnlichkeit beruhen.
In der vorliegenden Arbeit wird ein allgemeines Clustermodell vorgeschlagen,
das zur Modellierung geometrischer, wie auch dynamischer Clusterprobleme
geeignet ist. Es wird eine Methode zur Komplexitätsreduktion von
Clusterproblemen vorgestellt, die auf einer zuvor generierten Komprimierung
der Objekte innerhalb des Datenraumes basiert. Dabei wird bewiesen, daß eine
solche Reduktion die Clusterstruktur nicht zerstört, wenn die Komprimierung
fein genug ist. Mittels selbstorganisierter neuronaler Netze lassen sich
geeignete Komprimierungen berechnen. Um eine signifikante
Komplexitätsreduktion ohne Zerstörung der Clusterstruktur zu erzielen, werden
die genannten Methoden in ein mehrstufiges Verfahren eingebettet. Da neben der
Identifizierung der Cluster auch deren effiziente Beschreibung notwendig ist,
wird ferner eine spezielle Art der Komprimierung vorgestellt, der eine
Boxdiskretisierung des Datenraumes zugrunde liegt. Diese ermöglicht die
einfache Generierung von regelbasierten Clusterbeschreibungen. Für einen
speziellen Typ von Homogenitätsfunktionen, die eine stochastische Eigenschaft
besitzen, wird das mehrstufige Clusterverfahren um eine Perroncluster Analyse
erweitert. Dadurch wird die Anzahl der Cluster, im Gegensatz zu herkömmlichen
Verfahren, nicht mehr als Eingabeparameter benötigt. Mit dem entwickelten
Clusterverfahren kann erstmalig eine computergestützte Konformationsanalyse
großer, für die Praxis relevanter Biomoleküle durchgeführt werden. Am Beispiel
des HIV Protease Inhibitors VX-478 wird dies detailliert beschrieben
- …