Search CORE

51,358 research outputs found

Parallel D2-Clustering: Large-Scale Clustering of Discrete Distributions

Author: Li Jia
Wang James Z.
Zhang Yu
Publication venue
Publication date: 06/02/2013
Field of study

The discrete distribution clustering algorithm, namely D2-clustering, has demonstrated its usefulness in image classification and annotation where each object is represented by a bag of weighed vectors. The high computational complexity of the algorithm, however, limits its applications to large-scale problems. We present a parallel D2-clustering algorithm with substantially improved scalability. A hierarchical structure for parallel computing is devised to achieve a balance between the individual-node computation and the integration process of the algorithm. Additionally, it is shown that even with a single CPU, the hierarchical structure results in significant speed-up. Experiments on real-world large-scale image data, Youtube video data, and protein sequence data demonstrate the efficiency and wide applicability of the parallel D2-clustering algorithm. The loss in clustering accuracy is minor in comparison with the original sequential algorithm

arXiv.org e-Print Archive

A Distributed Approach towards Discriminative Distance Metric Learning

Author: Li Jun
Lin Xun
Rui Xiaoguang
Rui Yong
Tao Dacheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/05/2019
Field of study

Distance metric learning is successful in discovering intrinsic relations in data. However, most algorithms are computationally demanding when the problem size becomes large. In this paper, we propose a discriminative metric learning algorithm, and develop a distributed scheme learning metrics on moderate-sized subsets of data, and aggregating the results into a global solution. The technique leverages the power of parallel computation. The algorithm of the aggregated distance metric learning (ADML) scales well with the data size and can be controlled by the partition. We theoretically analyse and provide bounds for the error induced by the distributed treatment. We have conducted experimental evaluation of ADML, both on specially designed tests and on practical image annotation tasks. Those tests have shown that ADML achieves the state-of-the-art performance at only a fraction of the cost incurred by most existing methods

arXiv.org e-Print Archive

Convolutional Neural Networks for Skull-stripping in Brain MR Imaging using Consensus-based Silver standard Masks

Author: Frayne Richard
Lotufo Roberto
Lucena Oeslle
Rittner Leticia
Souza Roberto
Publication venue
Publication date: 13/04/2018
Field of study

Convolutional neural networks (CNN) for medical imaging are constrained by the number of annotated data required in the training stage. Usually, manual annotation is considered to be the "gold standard". However, medical imaging datasets that include expert manual segmentation are scarce as this step is time-consuming, and therefore expensive. Moreover, single-rater manual annotation is most often used in data-driven approaches making the network optimal with respect to only that single expert. In this work, we propose a CNN for brain extraction in magnetic resonance (MR) imaging, that is fully trained with what we refer to as silver standard masks. Our method consists of 1) developing a dataset with "silver standard" masks as input, and implementing both 2) a tri-planar method using parallel 2D U-Net-based CNNs (referred to as CONSNet) and 3) an auto-context implementation of CONSNet. The term CONSNet refers to our integrated approach, i.e., training with silver standard masks and using a 2D U-Net-based architecture. Our results showed that we outperformed (i.e., larger Dice coefficients) the current state-of-the-art SS methods. Our use of silver standard masks reduced the cost of manual annotation, decreased inter-intra-rater variability, and avoided CNN segmentation super-specialization towards one specific manual annotation guideline that can occur when gold standard masks are used. Moreover, the usage of silver standard masks greatly enlarges the volume of input annotated data because we can relatively easily generate labels for unlabeled data. In addition, our method has the advantage that, once trained, it takes only a few seconds to process a typical brain image volume using modern hardware, such as a high-end graphics processing unit. In contrast, many of the other competitive methods have processing times in the order of minutes

arXiv.org e-Print Archive

HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

Author: Cheng Jiechao
He Xiwen
Luo Chunjie
Ren Rui
Wang Lei
Zhan Jianfeng
Publication venue
Publication date: 21/11/2017
Field of study

With tremendous growing interests in Big Data systems, analyzing and facilitating their performance improvement become increasingly important. Although there have much research efforts for improving Big Data systems performance, efficiently analysing and diagnosing performance bottlenecks over these massively distributed systems remain a major challenge. In this paper, we propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data applications, which can associate the multi-level performance data fine-grained. On the basis of correlation data, we define some priori rules, select features and vectorize the corresponding datasets for different performance bottlenecks, such as, workload imbalance, data skew, abnormal node and outlier metrics. And then, we utilize the data and model driven algorithms for bottlenecks detection and diagnosis. In addition, we design and develop a lightweight, extensible tool HybridTune, and validate the diagnosis effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80%. At last, we report several Spark and Hadoop use cases, which are demonstrated how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications, and our experiences demonstrate HybridTune can help users find the performance bottlenecks and provide optimization recommendations

arXiv.org e-Print Archive

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

Author: Schubert Erich
Zimek Arthur
Publication venue
Publication date: 10/02/2019
Field of study

This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality

arXiv.org e-Print Archive

Machine Learning in Network Centrality Measures: Tutorial and Outlook

Author: Grando Felipe
Granville Lisando Z.
Lamb Luis C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/10/2018
Field of study

Complex networks are ubiquitous to several Computer Science domains. Centrality measures are an important analysis mechanism to uncover vital elements of complex networks. However, these metrics have high computational costs and requirements that hinder their applications in large real-world networks. In this tutorial, we explain how the use of neural network learning algorithms can render the application of the metrics in complex networks of arbitrary size. Moreover, the tutorial describes how to identify the best configuration for neural network training and learning such for tasks, besides presenting an easy way to generate and acquire training data. We do so by means of a general methodology, using complex network models adaptable to any application. We show that a regression model generated by the neural network successfully approximates the metric values and therefore are a robust, effective alternative in real-world applications. The methodology and proposed machine learning model use only a fraction of time with respect to other approximation algorithms, which is crucial in complex network applications.Comment: 7 tables, 9 figures, version accepted at ACM Computing Surveys. https://doi.org/10.1145/323719

arXiv.org e-Print Archive

Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition

Author: Cervone Dan
Dorie Vincent
Hill Jennifer
Scott Marc
Shalit Uri
Publication venue
Publication date: 20/07/2018
Field of study

Statisticians have made great progress in creating methods that reduce our reliance on parametric assumptions. However this explosion in research has resulted in a breadth of inferential strategies that both create opportunities for more reliable inference as well as complicate the choices that an applied researcher has to make and defend. Relatedly, researchers advocating for new methods typically compare their method to at best 2 or 3 other causal inference strategies and test using simulations that may or may not be designed to equally tease out flaws in all the competing methods. The causal inference data analysis challenge, "Is Your SATT Where It's At?", launched as part of the 2016 Atlantic Causal Inference Conference, sought to make progress with respect to both of these issues. The researchers creating the data testing grounds were distinct from the researchers submitting methods whose efficacy would be evaluated. Results from 30 competitors across the two versions of the competition (black box algorithms and do-it-yourself analyses) are presented along with post-hoc analyses that reveal information about the characteristics of causal inference strategies and settings that affect performance. The most consistent conclusion was that methods that flexibly model the response surface perform better overall than methods that fail to do so. Finally new methods are proposed that combine features of several of the top-performing submitted methods

arXiv.org e-Print Archive

Neurally Plausible Model of Robot Reaching Inspired by Infant Motor Babbling

Author: MacLennan Bruce
Mahoor Zahra
McBride Allen
Publication venue
Publication date: 31/12/2017
Field of study

In this paper we present a neurally plausible model of robot reaching inspired by human infant reaching that is based on embodied artificial intelligence, which emphasizes the importance of the sensory-motor interaction of an agent and the world. This model encompasses both learning sensory-motor correlations through motor babbling and also arm motion planning using spreading activation. This model is organized in three layers of neural maps with parallel structures representing the same sensory-motor space. The motor babbling period shapes the structure of the three neural maps as well as the connections within and between them. We describe an implementation of this model and an investigation of this implementation using a simple reaching task on a humanoid robot. The robot has learned successfully to plan reaching motions from a test set with high accuracy and smoothness

arXiv.org e-Print Archive

Fast Dynamic Routing Based on Weighted Kernel Density Estimation

Author: Wu Xiaofu
Zhang Suofei
Zhao Wei
Zhou Quan
Publication venue
Publication date: 31/08/2018
Field of study

Capsules as well as dynamic routing between them are most recently proposed structures for deep neural networks. A capsule groups data into vectors or matrices as poses rather than conventional scalars to represent specific properties of target instance. Besides of pose, a capsule should be attached with a probability (often denoted as activation) for its presence. The dynamic routing helps capsules achieve more generalization capacity with many fewer model parameters. However, the bottleneck that prevents widespread applications of capsule is the expense of computation during routing. To address this problem, we generalize existing routing methods within the framework of weighted kernel density estimation, and propose two fast routing methods with different optimization strategies. Our methods prompt the time efficiency of routing by nearly 40\% with negligible performance degradation. By stacking a hybrid of convolutional layers and capsule layers, we construct a network architecture to handle inputs at a resolution of

64\times{64}

pixels. The proposed models achieve a parallel performance with other leading methods in multiple benchmarks.Comment: 16 pages, 4 figures, submitted to eccv 201

arXiv.org e-Print Archive

Building pattern recognition applications with the SPARE library

Author: Del Vescovo Guido
Livi Lorenzo
Mascioli Fabio Massimo Frattale
Rizzi Antonello
Publication venue
Publication date: 20/02/2015
Field of study

This paper presents the SPARE C++ library, an open source software tool conceived to build pattern recognition and soft computing systems. The library follows the requirement of the generality: most of the implemented algorithms are able to process user-defined input data types transparently, such as labeled graphs and sequences of objects, as well as standard numeric vectors. Here we present a high-level picture of the SPARE library characteristics, focusing instead on the specific practical possibility of constructing pattern recognition systems for different input data types. In particular, as a proof of concept, we discuss two application instances involving clustering of real-valued multidimensional sequences and classification of labeled graphs.Comment: Home page: https://sourceforge.net/p/libspare/home/Spare

arXiv.org e-Print Archive