2 research outputs found
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Efficient Toxicity Prediction via Simple Features Using Shallow Neural Networks and Decision Trees
Toxicity prediction of chemical compounds is a grand challenge. Lately, it
achieved significant progress in accuracy but using a huge set of features,
implementing a complex blackbox technique such as a deep neural network, and
exploiting enormous computational resources. In this paper, we strongly argue
for the models and methods that are simple in machine learning characteristics,
efficient in computing resource usage, and powerful to achieve very high
accuracy levels. To demonstrate this, we develop a single task-based chemical
toxicity prediction framework using only 2D features that are less compute
intensive. We effectively use a decision tree to obtain an optimum number of
features from a collection of thousands of them. We use a shallow neural
network and jointly optimize it with decision tree taking both network
parameters and input features into account. Our model needs only a minute on a
single CPU for its training while existing methods using deep neural networks
need about 10 min on NVidia Tesla K40 GPU. However, we obtain similar or better
performance on several toxicity benchmark tasks. We also develop a cumulative
feature ranking method which enables us to identify features that can help
chemists perform prescreening of toxic compounds effectively