418 research outputs found
Evolutionary Granular Kernel Machines
Kernel machines such as Support Vector Machines (SVMs) have been widely used in various data mining applications with good generalization properties. Performance of SVMs for solving nonlinear problems is highly affected by kernel functions. The complexity of SVMs training is mainly related to the size of a training dataset. How to design a powerful kernel, how to speed up SVMs training and how to train SVMs with millions of examples are still challenging problems in the SVMs research. For these important problems, powerful and flexible kernel trees called Evolutionary Granular Kernel Trees (EGKTs) are designed to incorporate prior domain knowledge. Granular Kernel Tree Structure Evolving System (GKTSES) is developed to evolve the structures of Granular Kernel Trees (GKTs) without prior knowledge. A voting scheme is also proposed to reduce the prediction deviation of GKTSES. To speed up EGKTs optimization, a master-slave parallel model is implemented. To help SVMs challenge large-scale data mining, a Minimum Enclosing Ball (MEB) based data reduction method is presented, and a new MEB-SVM algorithm is designed. All these kernel methods are designed based on Granular Computing (GrC). In general, Evolutionary Granular Kernel Machines (EGKMs) are investigated to optimize kernels effectively, speed up training greatly and mine huge amounts of data efficiently
Grounding semantics in robots for Visual Question Answering
In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces
© 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces
Exploring the mobility of mobile phone users
Mobile phone datasets allow for the analysis of human behavior on an
unprecedented scale. The social network, temporal dynamics and mobile behavior
of mobile phone users have often been analyzed independently from each other
using mobile phone datasets. In this article, we explore the connections
between various features of human behavior extracted from a large mobile phone
dataset. Our observations are based on the analysis of communication data of
100000 anonymized and randomly chosen individuals in a dataset of
communications in Portugal. We show that clustering and principal component
analysis allow for a significant dimension reduction with limited loss of
information. The most important features are related to geographical location.
In particular, we observe that most people spend most of their time at only a
few locations. With the help of clustering methods, we then robustly identify
home and office locations and compare the results with official census data.
Finally, we analyze the geographic spread of users' frequent locations and show
that commuting distances can be reasonably well explained by a gravity model.Comment: 16 pages, 12 figure
Design and analysis of algorithms for similarity search based on intrinsic dimension
One of the most fundamental operations employed in data mining tasks such as classification, cluster analysis, and anomaly detection, is that of similarity search. It has been used in numerous fields of application such as multimedia, information retrieval, recommender systems and pattern recognition. Specifically, a similarity query aims to retrieve from the database the most similar objects to a query object, where the underlying similarity measure is usually expressed as a distance function.
The cost of processing similarity queries has been typically assessed in terms of the representational dimension of the data involved, that is, the number of features used to represent individual data objects. It is generally the case that high representational dimension would result in a significant increase in the processing cost of similarity queries. This relation is often attributed to an amalgamation of phenomena, collectively referred to as the curse of dimensionality. However, the observed effects of dimensionality in practice may not be as severe as expected. This has led to the development of models quantifying the complexity of data in terms of some measure of the intrinsic dimensionality.
The generalized expansion dimension (GED) is one of such models, which estimates the intrinsic dimension in the vicinity of a query point q through the observation of the ranks and distances of pairs of neighbors with respect to q. This dissertation is mainly concerned with the design and analysis of search algorithms, based on the GED model. In particular, three variants of similarity search problem are considered, including adaptive similarity search, flexible aggregate similarity search, and subspace similarity search. The good practical performance of the proposed algorithms demonstrates the effectiveness of dimensionality-driven design of search algorithms
Multi-Label Takagi-Sugeno-Kang Fuzzy System
Multi-label classification can effectively identify the relevant labels of an
instance from a given set of labels. However,the modeling of the relationship
between the features and the labels is critical to the classification
performance. To this end, we propose a new multi-label classification method,
called Multi-Label Takagi-Sugeno-Kang Fuzzy System (ML-TSK FS), to improve the
classification performance. The structure of ML-TSK FS is designed using fuzzy
rules to model the relationship between features and labels. The fuzzy system
is trained by integrating fuzzy inference based multi-label correlation
learning with multi-label regression loss. The proposed ML-TSK FS is evaluated
experimentally on 12 benchmark multi-label datasets. 1 The results show that
the performance of ML-TSK FS is competitive with existing methods in terms of
various evaluation metrics, indicating that it is able to model the
feature-label relationship effectively using fuzzy inference rules and enhances
the classification performance.Comment: This work has been accepted by IEEE Transactions on Fuzzy System
- …