27 research outputs found

    Non-Metric Multi-Dimensional Scaling for Distance-Based Privacy-Preserving Data Mining

    Get PDF
    Recent advances in the field of data mining have led to major concerns about privacy. Sharing data with external parties for analysis puts private information at risk. The original data are often perturbed before external release to protect private information. However, data perturbation can decrease the utility of the output. A good perturbation technique requires balance between privacy and utility. This study proposes a new method for data perturbation in the context of distance-based data mining. We propose the use of non-metric multi-dimensional scaling (MDS) as a suitable technique to perturb data that are intended for distance-based data mining. The basic premise of this approach is to transform the original data into a lower dimensional space and generate new data that protect private details while maintaining good utility for distance-based data mining analysis. We investigate the extent the perturbed data are able to preserve useful statistics for distance-based analysis and to provide protection against malicious attacks. We demonstrate that our method provides an adequate alternative to data randomisation approaches and other dimensionality reduction approaches. Testing is conducted on a wide range of benchmarked datasets and against some existing perturbation methods. The results confirm that our method has very good overall performance, is competitive with other techniques, and produces clustering and classification results at least as good, and in some cases better, than the results obtained from the original data

    A Genetic Bayesian Approach for Texture-Aided Urban Land-Use/Land-Cover Classification

    Get PDF
    Urban land-use/land-cover classification is entering a new era with the increased availability of high-resolution satellite imagery and new methods such as texture analysis and artificial intelligence classifiers. Recent research demonstrated exciting improvements of using fractal dimension, lacunarity, and Moran鈥檚 I in classification but the integration of these spatial metrics has seldom been investigated. Also, previous research focuses more on developing new classifiers than improving the robust, simple, and fast maximum likelihood classifier. The goal of this dissertation research is to develop a new approach that utilizes a texture vector (fractal dimension, lacunarity, and Moran鈥檚 I), combined with a new genetic Bayesian classifier, to improve urban land-use/land-cover classification accuracy. Examples of different land-use/land-covers using post-Katrina IKONOS imagery of New Orleans were demonstrated. Because previous geometric-step and arithmetic-step implementations of the triangular prism algorithm can result in significant unutilized pixels when measuring local fractal dimension, the divisor-step method was developed and found to yield more accurate estimation. In addition, a new lacunarity estimator based on the triangular prism method and the gliding-box algorithm was developed and found better than existing gray-scale estimators for classifying land-use/land-cover from IKONOS imagery. The accuracy of fractal dimension-aided classification was less sensitive to window size than lacunarity and Moran鈥檚 I. In general, the optimal window size for the texture vector-aided approach is 27x27 to 37x37 pixels (i.e., 108x108 to 148x148 meters). As expected, a texture vector-aided approach yielded 2-16% better accuracy than individual textural index-aided approach. Compared to the per-pixel maximum likelihood classification, the proposed genetic Bayesian classifier yielded 12% accuracy improvement by optimizing prior probabilities with the genetic algorithm; whereas the integrated approach with a texture vector and the genetic Bayesian classifier significantly improved classification accuracy by 17-21%. Compared to the neural network classifier and genetic algorithm-support vector machines, the genetic Bayesian classifier was slightly less accurate but more computationally efficient and required less human supervision. This research not only develops a new approach of integrating texture analysis with artificial intelligence for classification, but also reveals a promising avenue of using advanced texture analysis and classification methods to associate socioeconomic statuses with remote sensing image textures

    Techniques for data pattern selection and abstraction

    Get PDF
    This thesis concerns the problem of prototype reduction in instance-based learning. In order to deal with problems such as storage requirements, sensitivity to noise and computational complexity, various algorithms have been presented that condense the number of stored prototypes, while maintaining competent classification accuracy. Instance selection, which recovers a smaller subset of the original training set, is the most widely used technique for instance reduction. But, prototype abstraction that generates new prototypes to replace the initial ones has also gained a lot of interest recently. The major contribution of this work is the proposal of four novel frameworks for performing prototype reduction, the Class Boundary Preserving algorithm (CBP), a hybrid method that uses both selection and generation of prototypes, Instance Seriation for Prototype Abstraction (ISPA), which is an abstraction algorithm, and two selective techniques, Spectral Instance Reduction (SIR) and Direct Weight Optimization (DWO). CBP is a multi-stage method based on a simple heuristic that is very effective in identifying samples close to class borders. Using a noise filter harmful instances are removed, while the powerful heuristic determines the geometrical distribution of patterns around every instance. Together with the concepts of nearest enemy pairs and mean shift clustering this algorithm decides on the final set of retained prototypes. DWO is a selection model whose output set of prototypes is decided by a set of binary weights. These weights are computed according to an objective function composed of the ratio between the nearest friend and nearest enemy of every sample. In order to obtain good quality results DWO is optimized using a genetic algorithm. ISPA is an abstraction technique that employs the concept of data seriation to organize instances in an arrangement that favours merging between them. As a result, a new set of prototypes is created. Results show that CBP, SIR and DWO, the three major algorithms presented in this thesis, are competent and efficient in terms of at least one of the two basic objectives, classification accuracy and condensation ratio. The comparison against other successful condensation algorithms illustrates the competitiveness of the proposed models. The SIR algorithm presents a set of border discriminating features (BDFs) that depicts the local distribution of friends and enemies of all samples. These are then used along with spectral graph theory to partition the training set in to border and internal instances

    An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints

    Get PDF
    Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.

    DATA VISUALIZATION OF ASYMMETRIC DATA USING SAMMON MAPPING AND APPLICATIONS OF SELF-ORGANIZING MAPS

    Get PDF
    Data visualization can be used to detect hidden structures and patterns in data sets that are found in data mining applications. However, although efficient data visualization algorithms to handle data sets with asymmetric proximities have been proposed, we develop an improved algorithm in this dissertation. In the first part of the proposal, we develop a modified Sammon mapping approach that uses the upper triangular part and the lower triangular part of an asymmetric distance matrix simultaneously. Our proposed approach is applied to two asymmetric data sets: an American college selection data set, and a Canadian college selection data set which contains rank information. When compared to other approaches that are used in practice, our modified approach generates visual maps that have smaller distance errors and provide more reasonable representations of the data sets. In data visualization, self-organizing maps (SOM) have been used to cluster points. In the second part of the proposal, we assess the performance of several software implementations of SOM-based methods. Viscovery SOMine is found to be helpful in determining the number of clusters and recovering the cluster structure of data sets. A genocide and politicide data set is analyzed using Viscovery SOMine, followed by another analysis on the public and private college data sets with the goal to find out schools with best values

    Embedding Based Link Prediction for Knowledge Graph Completion

    Get PDF
    Knowledge Graphs (KGs) are the most widely used representation of structured information about a particular domain consisting of billions of facts in the form of entities (nodes) and relations (edges) between them. Besides, the KGs also encapsulate the semantic type information of the entities. The last two decades have witnessed a constant growth of KGs in various domains such as government, scholarly data, biomedical domains, etc. KGs have been used in Machine Learning based applications such as entity linking, question answering, recommender systems, etc. Open KGs are mostly heuristically created, automatically generated from heterogeneous resources such as text, images, etc., or are human-curated. However, these KGs are often incomplete, i.e., there are missing links between the entities and missing links between the entities and their corresponding entity types. This thesis focuses on addressing these two challenges of link prediction for Knowledge Graph Completion (KGC): \textbf{(i)} General Link Prediction in KGs that include head and tail prediction, triple classification, and \textbf{(ii)} Entity Type Prediction. Most of the graph mining algorithms are proven to be of high complexity, deterring their usage in KG-based applications. In recent years, KG embeddings have been trained to represent the entities and relations in the KG in a low-dimensional vector space preserving the graph structure. In most published works such as the translational models, convolutional models, semantic matching, etc., the triple information is used to generate the latent representation of the entities and relations. In this dissertation, it is argued that contextual information about the entities obtained from the random walks, and textual entity descriptions, are the keys to improving the latent representation of the entities for KGC. The experimental results show that the knowledge obtained from the context of the entities supports the hypothesis. Several methods have been proposed for KGC and their effectiveness is shown empirically in this thesis. Firstly, a novel multi-hop attentive KG embedding model MADLINK is proposed for Link Prediction. It considers the contextual information of the entities by using random walks as well as textual entity descriptions of the entities. Secondly, a novel architecture exploiting the information contained in a pre-trained contextual Neural Language Model (NLM) is proposed for Triple Classification. Thirdly, the limitations of the current state-of-the-art (SoTA) entity type prediction models have been analysed and a novel entity typing model CAT2Type is proposed that exploits the Wikipedia Categories which is one of the most under-treated features of the KGs. This model can also be used to predict missing types of unseen entities i.e., the newly added entities in the KG. Finally, another novel architecture GRAND is proposed to predict the missing entity types in KGs using multi-label, multi-class, and hierarchical classification by leveraging different strategic graph walks in the KGs. The extensive experiments and ablation studies show that all the proposed models outperform the current SoTA models and set new baselines for KGC. The proposed models establish that the NLMs and the contextual information of the entities in the KGs together with the different neural network architectures benefit KGC. The promising results and observations open up interesting scopes for future research involving exploiting the proposed models in domain-specific KGs such as scholarly data, biomedical data, etc. Furthermore, the link prediction model can be exploited as a base model for the entity alignment task as it considers the neighbourhood information of the entities

    Advanced Biometrics with Deep Learning

    Get PDF
    Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others

    Facing-up Challenges of Multiobjective Clustering Based on Evolutionary Algorithms: Representations, Scalability and Retrieval Solutions

    Get PDF
    Aquesta tesi es centra en algorismes de clustering multiobjectiu, que estan basats en optimitzar varis objectius simult脿niament obtenint una col鈥ecci贸 de solucions potencials amb diferents compromisos entre objectius. El prop貌sit d'aquesta tesi consisteix en dissenyar i implementar un nou algorisme de clustering multiobjectiu basat en algorismes evolutius per afrontar tres reptes actuals relacionats amb aquest tipus de t猫cniques. El primer repte es centra en definir adequadament l'脿rea de possibles solucions que s'explora per obtenir la millor soluci贸 i que dep猫n de la representaci贸 del coneixement. El segon repte consisteix en escalar el sistema dividint el conjunt de dades original en varis subconjunts per treballar amb menys dades en el proc茅s de clustering. El tercer repte es basa en recuperar la soluci贸 m茅s adequada tenint en compte la qualitat i la forma dels clusters a partir de la regi贸 m茅s interessant de la col鈥ecci贸 de solucions ofertes per l鈥檃lgorisme.Esta tesis se centra en los algoritmos de clustering multiobjetivo, que est谩n basados en optimizar varios objetivos simult谩neamente obteniendo una colecci贸n de soluciones potenciales con diferentes compromisos entre objetivos. El prop贸sito de esta tesis consiste en dise帽ar e implementar un nuevo algoritmo de clustering multiobjetivo basado en algoritmos evolutivos para afrontar tres retos actuales relacionados con este tipo de t茅cnicas. El primer reto se centra en definir adecuadamente el 谩rea de posibles soluciones explorada para obtener la mejor soluci贸n y que depende de la representaci贸n del conocimiento. El segundo reto consiste en escalar el sistema dividiendo el conjunto de datos original en varios subconjuntos para trabajar con menos datos en el proceso de clustering El tercer reto se basa en recuperar la soluci贸n m谩s adecuada seg煤n la calidad y la forma de los clusters a partir de la regi贸n m谩s interesante de la colecci贸n de soluciones ofrecidas por el algoritmo.This thesis is focused on multiobjective clustering algorithms, which are based on optimizing several objectives simultaneously obtaining a collection of potential solutions with different trade卢offs among objectives. The goal of the thesis is to design and implement a new multiobjective clustering technique based on evolutionary algorithms for facing up three current challenges related to these techniques. The first challenge is focused on successfully defining the area of possible solutions that is explored in order to find the best solution, and this depends on the knowledge representation. The second challenge tries to scale-up the system splitting the original data set into several data subsets in order to work with less data in the clustering process. The third challenge is addressed to the retrieval of the most suitable solution according to the quality and shape of the clusters from the most interesting region of the collection of solutions returned by the algorithm

    Decentralized Riemannian Particle Filtering with Applications to Multi-Agent Localization

    Get PDF
    The primary focus of this research is to develop consistent nonlinear decentralized particle filtering approaches to the problem of multiple agent localization. A key aspect in our development is the use of Riemannian geometry to exploit the inherently non-Euclidean characteristics that are typical when considering multiple agent localization scenarios. A decentralized formulation is considered due to the practical advantages it provides over centralized fusion architectures. Inspiration is taken from the relatively new field of information geometry and the more established research field of computer vision. Differential geometric tools such as manifolds, geodesics, tangent spaces, exponential, and logarithmic mappings are used extensively to describe probabilistic quantities. Numerous probabilistic parameterizations were identified, settling on the efficient square-root probability density function parameterization. The square-root parameterization has the benefit of allowing filter calculations to be carried out on the well studied Riemannian unit hypersphere. A key advantage for selecting the unit hypersphere is that it permits closed-form calculations, a characteristic that is not shared by current solution approaches. Through the use of the Riemannian geometry of the unit hypersphere, we are able to demonstrate the ability to produce estimates that are not overly optimistic. Results are presented that clearly show the ability of the proposed approaches to outperform current state-of-the-art decentralized particle filtering methods. In particular, results are presented that emphasize the achievable improvement in estimation error, estimator consistency, and required computational burden
    corecore