1,457 research outputs found

    Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

    Full text link
    In this paper, we provide an approach to clustering relational matrices whose entries correspond to either similarities or dissimilarities between objects. Our approach is based on the value of information, a parameterized, information-theoretic criterion that measures the change in costs associated with changes in information. Optimizing the value of information yields a deterministic annealing style of clustering with many benefits. For instance, investigators avoid needing to a priori specify the number of clusters, as the partitions naturally undergo phase changes, during the annealing process, whereby the number of clusters changes in a data-driven fashion. The global-best partition can also often be identified.Comment: Submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP

    Supervised Classification: Quite a Brief Overview

    Full text link
    The original problem of supervised classification considers the task of automatically assigning objects to their respective classes on the basis of numerical measurements derived from these objects. Classifiers are the tools that implement the actual functional mapping from these measurements---also called features or inputs---to the so-called class label---or output. The fields of pattern recognition and machine learning study ways of constructing such classifiers. The main idea behind supervised methods is that of learning from examples: given a number of example input-output relations, to what extent can the general mapping be learned that takes any new and unseen feature vector to its correct class? This chapter provides a basic introduction to the underlying ideas of how to come to a supervised classification problem. In addition, it provides an overview of some specific classification techniques, delves into the issues of object representation and classifier evaluation, and (very) briefly covers some variations on the basic supervised classification task that may also be of interest to the practitioner

    Median topographic maps for biomedical data sets

    Full text link
    Median clustering extends popular neural data analysis methods such as the self-organizing map or neural gas to general data structures given by a dissimilarity matrix only. This offers flexible and robust global data inspection methods which are particularly suited for a variety of data as occurs in biomedical domains. In this chapter, we give an overview about median clustering and its properties and extensions, with a particular focus on efficient implementations adapted to large scale data analysis

    Tree Edit Distance Learning via Adaptive Symbol Embeddings

    Full text link
    Metric learning has the aim to improve classification accuracy by learning a distance measure which brings data points from the same class closer together and pushes data points from different classes further apart. Recent research has demonstrated that metric learning approaches can also be applied to trees, such as molecular structures, abstract syntax trees of computer programs, or syntax trees of natural language, by learning the cost function of an edit distance, i.e. the costs of replacing, deleting, or inserting nodes in a tree. However, learning such costs directly may yield an edit distance which violates metric axioms, is challenging to interpret, and may not generalize well. In this contribution, we propose a novel metric learning approach for trees which we call embedding edit distance learning (BEDL) and which learns an edit distance indirectly by embedding the tree nodes as vectors, such that the Euclidean distance between those vectors supports class discrimination. We learn such embeddings by reducing the distance to prototypical trees from the same class and increasing the distance to prototypical trees from different classes. In our experiments, we show that BEDL improves upon the state-of-the-art in metric learning for trees on six benchmark data sets, ranging from computer science over biomedical data to a natural-language processing data set containing over 300,000 nodes.Comment: Paper at the International Conference of Machine Learning (2018), 2018-07-10 to 2018-07-15 in Stockholm, Swede

    A dissimilarity representation approach to designing systems for signature verification and bio-cryptography

    Get PDF
    Automation of legal and financial processes requires enforcing of authenticity, confidentiality, and integrity of the involved transactions. This Thesis focuses on developing offline signature verification (OLSV) systems for enforcing authenticity of transactions. In addition, bio-cryptography systems are developed based on the offline handwritten signature images for enforcing confidentiality and integrity of transactions. Design of OLSV systems is challenging, as signatures are behavioral biometric traits that have intrinsic intra-personal variations and inter-personal similarities. Standard OLSV systems are designed in the feature representation (FR) space, where high-dimensional feature representations are needed to capture the invariance of the signature images. With the numerous users, found in real world applications, e.g., banking systems, decision boundaries in the high-dimensional FR spaces become complex. Accordingly, large number of training samples are required to design of complex classifiers, which is not practical in typical applications. In contrast, design of bio-cryptography systems based on the offline signature images is more challenging. In these systems, signature images lock the cryptographic keys, and a user retrieves his key by applying a query signature sample. For practical bio-cryptographic schemes, the locking feature vector should be concise. In addition, such schemes employ simple error correction decoders, and therefore no complex classification rules can be employed. In this Thesis, the challenging problems of designing OLSV and bio-cryptography systems are addressed by employing the dissimilarity representation (DR) approach. Instead of designing classifiers in the feature space, the DR approach provides a classification space that is defined by some proximity measure. This way, a multi-class classification problem, with few samples per class, is transformed to a more tractable two-class problem with large number of training samples. Since many feature extraction techniques have already been proposed for OLSV applications, a DR approach based on FR is employed. In this case, proximity between two signatures is measured by applying a dissimilarity measure on their feature vectors. The main hypothesis of this Thesis is as follows. The FRs and dissimilarity measures should be properly designed, so that signatures belong to same writer are close, while signatures of different writers are well separated in the resulting DR spaces. In that case, more cost-effecitive classifiers, and therefore simpler OLSV and bio-cryptography systems can be designed. To this end, in Chapter 2, an approach for optimizing FR-based DR spaces is proposed such that concise representations are discriminant, and simple classification thresholds are sufficient. High-dimensional feature representations are translated to an intermediate DR space, where pairwise feature distances are the space constituents. Then, a two-step boosting feature selection (BFS) algorithm is applied. The first step uses samples from a development database, and aims to produce a universal space of reduced dimensionality. The resulting universal space is further reduced and tuned for specific users through a second BFS step using user-specific training set. In the resulting space, feature variations are modeled and an adaptive dissimilarity measure is designed. This measure generates the final DR space, where discriminant prototypes are selected for enhanced representation. The OLSV and bio-cryptographic systems are formulated as simple threshold classifiers that operate in the designed DR space. Proof of concept simulations on the Brazilian signature database indicate the viability of the proposed approach. Concise DRs with few features and a single prototype are produced. Employing a simple threshold classifier, the DRs have shown state-of-the-art accuracy of about 7% AER, comparable to complex systems in the literature. In Chapter 3, the OLSV problem is further studied. Although the aforementioned OLSV implementation has shown acceptable recognition accuracy, the resulting systems are not secure as signature templates must be stored for verification. For enhanced security, we modified the previous implementation as follows. The first BFS step is implemented as aforementioned, producing a writer-independent (WI) system. This enables starting system operation, even if users provide a single signature sample in the enrollment phase. However, the second BFS is modified to run in a FR space instead of a DR space, so that no signature templates are used for verification. To this end, the universal space is translated back to a FR space of reduced dimensionality, so that designing a writer-dependent (WD) system by the few user-specific samples is tractable in the reduced space. Simulation results on two real-world offline signature databases confirm the feasibility of the proposed approach. The initial universal (WI) verification mode showed comparable performance to that of state-of-the-art OLSV systems. The final secure WD verification mode showed enhanced accuracy with decreased computational complexity. Only a single compact classifier produced similar level of accuracy (AER of about 5.38 and 13.96% for the Brazilian and the GPDS signature databases, respectively) as complex WI and WD systems in the literature. Finally, in Chapter 4, a key-binding bio-cryptographic scheme known as the fuzzy vault (FV) is implemented based on the offline signature images. The proposed DR-based two-step BFS technique is employed for selecting a compact and discriminant user-specific FR from a large number of feature extractions. This representation is used to generate the FV locking/unlocking points. Representation variability modeled in the DR space is considered for matching the unlocking and locking points during FV decoding. Proof of concept simulations on the Brazilian signature database have shown FV recognition accuracy of 3% AER and system entropy of about 45-bits. For enhanced security, an adaptive chaff generation method is proposed, where the modeled variability controls the chaff generation process. Similar recognition accuracy is reported, where more enhanced entropy of about 69-bits is achieved

    Explanation of Siamese Neural Networks for Weakly Supervised Learning

    Get PDF
    A new method for explaining the Siamese neural network (SNN) as a black-box model for weakly supervised learning is proposed under condition that the output of every subnetwork of the SNN is a vector which is accessible. The main problem of the explanation is that the perturbation technique cannot be used directly for input instances because only their semantic similarity or dissimilarity is known. Moreover, there is no an "inverse" map between the SNN output vector and the corresponding input instance. Therefore, a special autoencoder is proposed, which takes into account the proximity of its hidden representation and the SNN outputs. Its pre-trained decoder part as well as the encoder are used to reconstruct original instances from the SNN perturbed output vectors. The important features of the explained instances are determined by averaging the corresponding changes of the reconstructed instances. Numerical experiments with synthetic data and with the well-known dataset MNIST illustrate the proposed method
    corecore