6,252 research outputs found
Product Classification in E-Commerce using Distributional Semantics
Product classification is the task of automatically predicting a taxonomy
path for a product in a predefined taxonomy hierarchy given a textual product
description or title. For efficient product classification we require a
suitable representation for a document (the textual description of a product)
feature vector and efficient and fast algorithms for prediction. To address the
above challenges, we propose a new distributional semantics representation for
document vector formation. We also develop a new two-level ensemble approach
utilizing (with respect to the taxonomy tree) a path-wise, node-wise and
depth-wise classifiers for error reduction in the final product classification.
Our experiments show the effectiveness of the distributional representation and
the ensemble approach on data sets from a leading e-commerce platform and
achieve better results on various evaluation metrics compared to earlier
approaches
Wisdom of Crowds cluster ensemble
The Wisdom of Crowds is a phenomenon described in social science that
suggests four criteria applicable to groups of people. It is claimed that, if
these criteria are satisfied, then the aggregate decisions made by a group will
often be better than those of its individual members. Inspired by this concept,
we present a novel feedback framework for the cluster ensemble problem, which
we call Wisdom of Crowds Cluster Ensemble (WOCCE). Although many conventional
cluster ensemble methods focusing on diversity have recently been proposed,
WOCCE analyzes the conditions necessary for a crowd to exhibit this collective
wisdom. These include decentralization criteria for generating primary results,
independence criteria for the base algorithms, and diversity criteria for the
ensemble members. We suggest appropriate procedures for evaluating these
measures, and propose a new measure to assess the diversity. We evaluate the
performance of WOCCE against some other traditional base algorithms as well as
state-of-the-art ensemble methods. The results demonstrate the efficiency of
WOCCE's aggregate decision-making compared to other algorithms.Comment: Intelligent Data Analysis (IDA), IOS Pres
Clustering and Learning from Imbalanced Data
A learning classifier must outperform a trivial solution, in case of
imbalanced data, this condition usually does not hold true. To overcome this
problem, we propose a novel data level resampling method - Clustering Based
Oversampling for improved learning from class imbalanced datasets. The
essential idea behind the proposed method is to use the distance between a
minority class sample and its respective cluster centroid to infer the number
of new sample points to be generated for that minority class sample. The
proposed algorithm has very less dependence on the technique used for finding
cluster centroids and does not effect the majority class learning in any way.
It also improves learning from imbalanced data by incorporating the
distribution structure of minority class samples in generation of new data
samples. The newly generated minority class data is handled in a way as to
prevent outlier production and overfitting. Implementation analysis on
different datasets using deep neural networks as the learning classifier shows
the effectiveness of this method as compared to other synthetic data resampling
techniques across several evaluation metrics.Comment: 9 pages, To Appear at NIPS 2018 Workshop
Scalable Constrained Clustering: A Generalized Spectral Method
We present a simple spectral approach to the well-studied constrained
clustering problem. It captures constrained clustering as a generalized
eigenvalue problem with graph Laplacians. The algorithm works in nearly-linear
time and provides concrete guarantees for the quality of the clusters, at least
for the case of 2-way partitioning. In practice this translates to a very fast
implementation that consistently outperforms existing spectral approaches both
in speed and quality.Comment: accepted to appear in AISTATS 2016. arXiv admin note: text overlap
with arXiv:1504.0065
Machine learning based hyperspectral image analysis: A survey
Hyperspectral sensors enable the study of the chemical properties of scene
materials remotely for the purpose of identification, detection, and chemical
composition analysis of objects in the environment. Hence, hyperspectral images
captured from earth observing satellites and aircraft have been increasingly
important in agriculture, environmental monitoring, urban planning, mining, and
defense. Machine learning algorithms due to their outstanding predictive power
have become a key tool for modern hyperspectral image analysis. Therefore, a
solid understanding of machine learning techniques have become essential for
remote sensing researchers and practitioners. This paper reviews and compares
recent machine learning-based hyperspectral image analysis methods published in
literature. We organize the methods by the image analysis task and by the type
of machine learning algorithm, and present a two-way mapping between the image
analysis tasks and the types of machine learning algorithms that can be applied
to them. The paper is comprehensive in coverage of both hyperspectral image
analysis tasks and machine learning algorithms. The image analysis tasks
considered are land cover classification, target detection, unmixing, and
physical parameter estimation. The machine learning algorithms covered are
Gaussian models, linear regression, logistic regression, support vector
machines, Gaussian mixture model, latent linear models, sparse linear models,
Gaussian mixture models, ensemble learning, directed graphical models,
undirected graphical models, clustering, Gaussian processes, Dirichlet
processes, and deep learning. We also discuss the open challenges in the field
of hyperspectral image analysis and explore possible future directions
SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners
Deep Convolutional Neural Networks (CNN) has achieved significant success in
computer vision field. However, the high computational cost of the deep complex
models prevents the deployment on edge devices with limited memory and
computational resource. In this paper, we proposed a novel filter pruning for
convolutional neural networks compression, namely spectral clustering filter
pruning with soft self-adaption manners (SCSP). We first apply spectral
clustering on filters layer by layer to explore their intrinsic connections and
only count on efficient groups. By self-adaption manners, the pruning
operations can be done in few epochs to let the network gradually choose
meaningful groups. According to this strategy, we not only achieve model
compression while keeping considerable performance, but also find a novel angle
to interpret the model compression process
Diversity in Machine Learning
Machine learning methods have achieved good performance and been widely
applied in various real-world applications. They can learn the model adaptively
and be better fit for special requirements of different tasks. Generally, a
good machine learning system is composed of plentiful training data, a good
model training process, and an accurate inference. Many factors can affect the
performance of the machine learning process, among which the diversity of the
machine learning process is an important one. The diversity can help each
procedure to guarantee a total good machine learning: diversity of the training
data ensures that the training data can provide more discriminative information
for the model, diversity of the learned model (diversity in parameters of each
model or diversity among different base models) makes each parameter/model
capture unique or complement information and the diversity in inference can
provide multiple choices each of which corresponds to a specific plausible
local optimal result. Even though the diversity plays an important role in
machine learning process, there is no systematical analysis of the
diversification in machine learning system. In this paper, we systematically
summarize the methods to make data diversification, model diversification, and
inference diversification in the machine learning process, respectively. In
addition, the typical applications where the diversity technology improved the
machine learning performance have been surveyed, including the remote sensing
imaging tasks, machine translation, camera relocalization, image segmentation,
object detection, topic modeling, and others. Finally, we discuss some
challenges of the diversity technology in machine learning and point out some
directions in future work.Comment: Accepted by IEEE Acces
Processing techniques development
There are no author-identified significant results in this report
Structure fusion based on graph convolutional networks for semi-supervised classification
Suffering from the multi-view data diversity and complexity for
semi-supervised classification, most of existing graph convolutional networks
focus on the networks architecture construction or the salient graph structure
preservation, and ignore the the complete graph structure for semi-supervised
classification contribution. To mine the more complete distribution structure
from multi-view data with the consideration of the specificity and the
commonality, we propose structure fusion based on graph convolutional networks
(SF-GCN) for improving the performance of semi-supervised classification.
SF-GCN can not only retain the special characteristic of each view data by
spectral embedding, but also capture the common style of multi-view data by
distance metric between multi-graph structures. Suppose the linear relationship
between multi-graph structures, we can construct the optimization function of
structure fusion model by balancing the specificity loss and the commonality
loss. By solving this function, we can simultaneously obtain the fusion
spectral embedding from the multi-view data and the fusion structure as
adjacent matrix to input graph convolutional networks for semi-supervised
classification. Experiments demonstrate that the performance of SF-GCN
outperforms that of the state of the arts on three challenging datasets, which
are Cora,Citeseer and Pubmed in citation networks
- …