51,358 research outputs found
Parallel D2-Clustering: Large-Scale Clustering of Discrete Distributions
The discrete distribution clustering algorithm, namely D2-clustering, has
demonstrated its usefulness in image classification and annotation where each
object is represented by a bag of weighed vectors. The high computational
complexity of the algorithm, however, limits its applications to large-scale
problems. We present a parallel D2-clustering algorithm with substantially
improved scalability. A hierarchical structure for parallel computing is
devised to achieve a balance between the individual-node computation and the
integration process of the algorithm. Additionally, it is shown that even with
a single CPU, the hierarchical structure results in significant speed-up.
Experiments on real-world large-scale image data, Youtube video data, and
protein sequence data demonstrate the efficiency and wide applicability of the
parallel D2-clustering algorithm. The loss in clustering accuracy is minor in
comparison with the original sequential algorithm
A Distributed Approach towards Discriminative Distance Metric Learning
Distance metric learning is successful in discovering intrinsic relations in
data. However, most algorithms are computationally demanding when the problem
size becomes large. In this paper, we propose a discriminative metric learning
algorithm, and develop a distributed scheme learning metrics on moderate-sized
subsets of data, and aggregating the results into a global solution. The
technique leverages the power of parallel computation. The algorithm of the
aggregated distance metric learning (ADML) scales well with the data size and
can be controlled by the partition. We theoretically analyse and provide bounds
for the error induced by the distributed treatment. We have conducted
experimental evaluation of ADML, both on specially designed tests and on
practical image annotation tasks. Those tests have shown that ADML achieves the
state-of-the-art performance at only a fraction of the cost incurred by most
existing methods
Convolutional Neural Networks for Skull-stripping in Brain MR Imaging using Consensus-based Silver standard Masks
Convolutional neural networks (CNN) for medical imaging are constrained by
the number of annotated data required in the training stage. Usually, manual
annotation is considered to be the "gold standard". However, medical imaging
datasets that include expert manual segmentation are scarce as this step is
time-consuming, and therefore expensive. Moreover, single-rater manual
annotation is most often used in data-driven approaches making the network
optimal with respect to only that single expert. In this work, we propose a CNN
for brain extraction in magnetic resonance (MR) imaging, that is fully trained
with what we refer to as silver standard masks. Our method consists of 1)
developing a dataset with "silver standard" masks as input, and implementing
both 2) a tri-planar method using parallel 2D U-Net-based CNNs (referred to as
CONSNet) and 3) an auto-context implementation of CONSNet. The term CONSNet
refers to our integrated approach, i.e., training with silver standard masks
and using a 2D U-Net-based architecture. Our results showed that we
outperformed (i.e., larger Dice coefficients) the current state-of-the-art SS
methods. Our use of silver standard masks reduced the cost of manual
annotation, decreased inter-intra-rater variability, and avoided CNN
segmentation super-specialization towards one specific manual annotation
guideline that can occur when gold standard masks are used. Moreover, the usage
of silver standard masks greatly enlarges the volume of input annotated data
because we can relatively easily generate labels for unlabeled data. In
addition, our method has the advantage that, once trained, it takes only a few
seconds to process a typical brain image volume using modern hardware, such as
a high-end graphics processing unit. In contrast, many of the other competitive
methods have processing times in the order of minutes
HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems
With tremendous growing interests in Big Data systems, analyzing and
facilitating their performance improvement become increasingly important.
Although there have much research efforts for improving Big Data systems
performance, efficiently analysing and diagnosing performance bottlenecks over
these massively distributed systems remain a major challenge. In this paper, we
propose a spatio-temporal correlation analysis approach based on stage
characteristic and distribution characteristic of Big Data applications, which
can associate the multi-level performance data fine-grained. On the basis of
correlation data, we define some priori rules, select features and vectorize
the corresponding datasets for different performance bottlenecks, such as,
workload imbalance, data skew, abnormal node and outlier metrics. And then, we
utilize the data and model driven algorithms for bottlenecks detection and
diagnosis. In addition, we design and develop a lightweight, extensible tool
HybridTune, and validate the diagnosis effectiveness of our tool with
BigDataBench on several benchmark experiments in which the outperform
state-of-the-art methods. Our experiments show that the accuracy of
abnormal/outlier detection we obtained reaches about 80%. At last, we report
several Spark and Hadoop use cases, which are demonstrated how HybridTune
supports users to carry out the performance analysis and diagnosis efficiently
on the Spark and Hadoop applications, and our experiences demonstrate
HybridTune can help users find the performance bottlenecks and provide
optimization recommendations
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Machine Learning in Network Centrality Measures: Tutorial and Outlook
Complex networks are ubiquitous to several Computer Science domains.
Centrality measures are an important analysis mechanism to uncover vital
elements of complex networks. However, these metrics have high computational
costs and requirements that hinder their applications in large real-world
networks. In this tutorial, we explain how the use of neural network learning
algorithms can render the application of the metrics in complex networks of
arbitrary size. Moreover, the tutorial describes how to identify the best
configuration for neural network training and learning such for tasks, besides
presenting an easy way to generate and acquire training data. We do so by means
of a general methodology, using complex network models adaptable to any
application. We show that a regression model generated by the neural network
successfully approximates the metric values and therefore are a robust,
effective alternative in real-world applications. The methodology and proposed
machine learning model use only a fraction of time with respect to other
approximation algorithms, which is crucial in complex network applications.Comment: 7 tables, 9 figures, version accepted at ACM Computing Surveys.
https://doi.org/10.1145/323719
Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition
Statisticians have made great progress in creating methods that reduce our
reliance on parametric assumptions. However this explosion in research has
resulted in a breadth of inferential strategies that both create opportunities
for more reliable inference as well as complicate the choices that an applied
researcher has to make and defend. Relatedly, researchers advocating for new
methods typically compare their method to at best 2 or 3 other causal inference
strategies and test using simulations that may or may not be designed to
equally tease out flaws in all the competing methods. The causal inference data
analysis challenge, "Is Your SATT Where It's At?", launched as part of the 2016
Atlantic Causal Inference Conference, sought to make progress with respect to
both of these issues. The researchers creating the data testing grounds were
distinct from the researchers submitting methods whose efficacy would be
evaluated. Results from 30 competitors across the two versions of the
competition (black box algorithms and do-it-yourself analyses) are presented
along with post-hoc analyses that reveal information about the characteristics
of causal inference strategies and settings that affect performance. The most
consistent conclusion was that methods that flexibly model the response surface
perform better overall than methods that fail to do so. Finally new methods are
proposed that combine features of several of the top-performing submitted
methods
Neurally Plausible Model of Robot Reaching Inspired by Infant Motor Babbling
In this paper we present a neurally plausible model of robot reaching
inspired by human infant reaching that is based on embodied artificial
intelligence, which emphasizes the importance of the sensory-motor interaction
of an agent and the world. This model encompasses both learning sensory-motor
correlations through motor babbling and also arm motion planning using
spreading activation. This model is organized in three layers of neural maps
with parallel structures representing the same sensory-motor space. The motor
babbling period shapes the structure of the three neural maps as well as the
connections within and between them. We describe an implementation of this
model and an investigation of this implementation using a simple reaching task
on a humanoid robot. The robot has learned successfully to plan reaching
motions from a test set with high accuracy and smoothness
Fast Dynamic Routing Based on Weighted Kernel Density Estimation
Capsules as well as dynamic routing between them are most recently proposed
structures for deep neural networks. A capsule groups data into vectors or
matrices as poses rather than conventional scalars to represent specific
properties of target instance. Besides of pose, a capsule should be attached
with a probability (often denoted as activation) for its presence. The dynamic
routing helps capsules achieve more generalization capacity with many fewer
model parameters. However, the bottleneck that prevents widespread applications
of capsule is the expense of computation during routing. To address this
problem, we generalize existing routing methods within the framework of
weighted kernel density estimation, and propose two fast routing methods with
different optimization strategies. Our methods prompt the time efficiency of
routing by nearly 40\% with negligible performance degradation. By stacking a
hybrid of convolutional layers and capsule layers, we construct a network
architecture to handle inputs at a resolution of pixels. The
proposed models achieve a parallel performance with other leading methods in
multiple benchmarks.Comment: 16 pages, 4 figures, submitted to eccv 201
Building pattern recognition applications with the SPARE library
This paper presents the SPARE C++ library, an open source software tool
conceived to build pattern recognition and soft computing systems. The library
follows the requirement of the generality: most of the implemented algorithms
are able to process user-defined input data types transparently, such as
labeled graphs and sequences of objects, as well as standard numeric vectors.
Here we present a high-level picture of the SPARE library characteristics,
focusing instead on the specific practical possibility of constructing pattern
recognition systems for different input data types. In particular, as a proof
of concept, we discuss two application instances involving clustering of
real-valued multidimensional sequences and classification of labeled graphs.Comment: Home page: https://sourceforge.net/p/libspare/home/Spare
- …