300,793 research outputs found

    Cluster-based ensemble means for climate model intercomparison

    Get PDF
    Clustering – the automated grouping of similar data – can provide powerful and unique insight into large and complex data sets, in a fast and computationally efficient manner. While clustering has been used in a variety of fields (from medical image processing to economics), its application within atmospheric science has been fairly limited to date, and the potential benefits of the application of advanced clustering techniques to climate data (both model output and observations) has yet to be fully realised. In this paper, we explore the specific application of clustering to a multi-model climate ensemble. We hypothesise that clustering techniques can provide (a) a flexible, data-driven method of testing model–observation agreement and (b) a mechanism with which to identify model development priorities. We focus our analysis on chemistry–climate model (CCM) output of tropospheric ozone – an important greenhouse gas – from the recent Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP). Tropospheric column ozone from the ACCMIP ensemble was clustered using the Data Density based Clustering (DDC) algorithm. We find that a multi-model mean (MMM) calculated using members of the most-populous cluster identified at each location offers a reduction of up to  ∼  20 % in the global absolute mean bias between the MMM and an observed satellite-based tropospheric ozone climatology, with respect to a simple, all-model MMM. On a spatial basis, the bias is reduced at  ∼  62 % of all locations, with the largest bias reductions occurring in the Northern Hemisphere – where ozone concentrations are relatively large. However, the bias is unchanged at 9 % of all locations and increases at 29 %, particularly in the Southern Hemisphere. The latter demonstrates that although cluster-based subsampling acts to remove outlier model data, such data may in fact be closer to observed values in some locations. We further demonstrate that clustering can provide a viable and useful framework in which to assess and visualise model spread, offering insight into geographical areas of agreement among models and a measure of diversity across an ensemble. Finally, we discuss caveats of the clustering techniques and note that while we have focused on tropospheric ozone, the principles underlying the cluster-based MMMs are applicable to other prognostic variables from climate models

    Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

    Full text link
    TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

    Methods of Hierarchical Clustering

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

    Machine Learning in Wireless Sensor Networks: Algorithms, Strategies, and Applications

    Get PDF
    Wireless sensor networks monitor dynamic environments that change rapidly over time. This dynamic behavior is either caused by external factors or initiated by the system designers themselves. To adapt to such conditions, sensor networks often adopt machine learning techniques to eliminate the need for unnecessary redesign. Machine learning also inspires many practical solutions that maximize resource utilization and prolong the lifespan of the network. In this paper, we present an extensive literature review over the period 2002-2013 of machine learning methods that were used to address common issues in wireless sensor networks (WSNs). The advantages and disadvantages of each proposed algorithm are evaluated against the corresponding problem. We also provide a comparative guide to aid WSN designers in developing suitable machine learning solutions for their specific application challenges.Comment: Accepted for publication in IEEE Communications Surveys and Tutorial

    Adaptive Multilevel Cluster Analysis by Self-Organizing Box Maps

    Get PDF
    Title and table of contents 1 Introduction 3 1\. Cluster Analysis in High-Dimensional Data 7 1.1 Modeling 8 1.2 Problem reduction via representative clustering 13 1.3 Efficient cluster description 16 1.4 How many clusters? 21 2\. Decomposition 23 2.1 General Definition 23 2.2 Approximate box decomposition 25 2.3 Decomposition based representative clustering 27 2.4 Efficient cluster description via approximate box decomposition 34 3\. Adaptive Decomposition by Self-Organized Neural Networks 41 3.1 Self-Organizing Maps (SOM) 42 3.2 Self-Organizing Box Maps (SOBM) 44 3.3 Comparison SOM-SOBM 53 3.4 Computational complexity 56 3.5 Practical extensions 57 4\. Multilevel Representative Clustering 59 4.1 General approach 59 4.2 Adaptive decomposition refinement 60 4.3 Approach based on Perron Cluster analysis 61 5\. Applications 73 5.1 Conformational Analysis of biomolecules 73 5.2 Cluster analysis of insurance customers 87 Conclusion 91 Appendix 93 Symbols 95 Bibliography 97The aim of this thesis is a fruitful combination of Perron Cluster analysis and self-organized neural networks within an adaptive multilevel clustering approach that allows a fast and robust identification and an efficient description of clusters in high-dimensional data. In a general variant that needs a correct number of clusters k as an input, this new approach is relevant for a great number of cluster problems since it uses a cluster model that covers geometrically, but also dynamically based clusters. Its essential part is a method called representative clustering that guarantees the applicability to large cluster problems: Based on an adaptive decomposition of the object space via self-organized neural networks, the original problem is reduced to a smaller cluster problem. The general clustering approach can be extended by Perron Cluster analysis so that it can be used for large reversible dynamic cluster problems, even if a correct number of clusters k is unknown a priori. The basic application of the extended clustering approach is the conformational analysis of biomolecules, with great impact in the field of Drug Design. Here, for the first time the analysis of practically relevant and large molecules like an HIV protease inhibitor becomes possible.Als Cluster Analyse bezeichnet man den Prozess der Suche und Beschreibung von Gruppen (Clustern) von Objekten, so daß die Objekte innerhalb eines Clusters bezüglich eines gegebenen Maßes maximal homogen sind. Die Homogenität der Objekte hängt dabei direkt oder indirekt von den Ausprägungen ab, die sie für eine Anzahl festgelegter Attribute besitzen. Die Suche nach Clustern läßt sich somit als Optimierungsproblem auffassen, wobei die Anzahl der Cluster vorher bekannt sein muß. Wenn die Anzahl der Objekte und der Attribute groß ist, spricht man von komplexen, hoch-dimensionalen Cluster Problemen. In diesem Fall ist eine direkte Optimierung zu aufwendig, und man benötigt entweder heuristische Optimierungsverfahren oder Methoden zur Reduktion der Komplexität. In der Vergangenheit wurden in der Forschung fast ausschließlich Verfahren für geometrisch basierte Clusterprobleme entwickelt. Bei diesen Problemen lassen sich die Objekte als Punkte in einem von den Attributen aufgespannten metrischen Raum modellieren; das verwendete Homogenitätsmaß basiert auf der geometrischen Distanz der den Objekten zugeordneten Punkte. Insbesondere zur Bestimmung sogenannter metastabiler Cluster sind solche Verfahren aber offensichtlich nicht geeignet, da metastabile Cluster, die z.B. in der Konformationsanalyse von Biomolekülen von zentraler Bedeutung sind, nicht auf einer geometrischen, sondern einer dynamischen Ähnlichkeit beruhen. In der vorliegenden Arbeit wird ein allgemeines Clustermodell vorgeschlagen, das zur Modellierung geometrischer, wie auch dynamischer Clusterprobleme geeignet ist. Es wird eine Methode zur Komplexitätsreduktion von Clusterproblemen vorgestellt, die auf einer zuvor generierten Komprimierung der Objekte innerhalb des Datenraumes basiert. Dabei wird bewiesen, daß eine solche Reduktion die Clusterstruktur nicht zerstört, wenn die Komprimierung fein genug ist. Mittels selbstorganisierter neuronaler Netze lassen sich geeignete Komprimierungen berechnen. Um eine signifikante Komplexitätsreduktion ohne Zerstörung der Clusterstruktur zu erzielen, werden die genannten Methoden in ein mehrstufiges Verfahren eingebettet. Da neben der Identifizierung der Cluster auch deren effiziente Beschreibung notwendig ist, wird ferner eine spezielle Art der Komprimierung vorgestellt, der eine Boxdiskretisierung des Datenraumes zugrunde liegt. Diese ermöglicht die einfache Generierung von regelbasierten Clusterbeschreibungen. Für einen speziellen Typ von Homogenitätsfunktionen, die eine stochastische Eigenschaft besitzen, wird das mehrstufige Clusterverfahren um eine Perroncluster Analyse erweitert. Dadurch wird die Anzahl der Cluster, im Gegensatz zu herkömmlichen Verfahren, nicht mehr als Eingabeparameter benötigt. Mit dem entwickelten Clusterverfahren kann erstmalig eine computergestützte Konformationsanalyse großer, für die Praxis relevanter Biomoleküle durchgeführt werden. Am Beispiel des HIV Protease Inhibitors VX-478 wird dies detailliert beschrieben
    corecore