68 research outputs found

    Spectral Dimensionality Reduction

    Get PDF
    In this paper, we study and put under a common framework a number of non-linear dimensionality reduction methods, such as Locally Linear Embedding, Isomap, Laplacian Eigenmaps and kernel PCA, which are based on performing an eigen-decomposition (hence the name 'spectral'). That framework also includes classical methods such as PCA and metric multidimensional scaling (MDS). It also includes the data transformation step used in spectral clustering. We show that in all of these cases the learning algorithm estimates the principal eigenfunctions of an operator that depends on the unknown data density and on a kernel that is not necessarily positive semi-definite. This helps to generalize some of these algorithms so as to predict an embedding for out-of-sample examples without having to retrain the model. It also makes it more transparent what these algorithm are minimizing on the empirical data and gives a corresponding notion of generalization error. Dans cet article, nous étudions et développons un cadre unifié pour un certain nombre de méthodes non linéaires de réduction de dimensionalité, telles que LLE, Isomap, LE (Laplacian Eigenmap) et ACP à noyaux, qui font de la décomposition en valeurs propres (d'où le nom "spectral"). Ce cadre inclut également des méthodes classiques telles que l'ACP et l'échelonnage multidimensionnel métrique (MDS). Il inclut aussi l'étape de transformation de données utilisée dans l'agrégation spectrale. Nous montrons que, dans tous les cas, l'algorithme d'apprentissage estime les fonctions propres principales d'un opérateur qui dépend de la densité inconnue de données et d'un noyau qui n'est pas nécessairement positif semi-défini. Ce cadre aide à généraliser certains modèles pour prédire les coordonnées des exemples hors-échantillons sans avoir à réentraîner le modèle. Il aide également à rendre plus transparent ce que ces algorithmes minimisent sur les données empiriques et donne une notion correspondante d'erreur de généralisation.non-parametric models, non-linear dimensionality reduction, kernel models, modèles non paramétriques, réduction de dimensionalité non linéaire, modèles à noyau

    Metodología de visualización de datos utilizando métodos espectrales y basados en divergencias para la reducción interactiva de la dimensión

    Get PDF
    Las tareas de reconocimiento de patrones aplican métodos que evolucionan de manera equivalente al crecimiento de los datos, alcanzando métricas eficientes en términos de optimización y rendimiento computacional aplicado a exploración, selección y representación de datos. No obstante, los resultados brindados por dichos métodos y herramientas podrían resultar ambiguos y/o abstractos para el usuario, haciendo que su aplicación sea compleja, aun mas si no cuentan con un conocimiento previo de los datos. Tener un conocimiento a priori garantiza en el mayor de los casos la correcta selección del modelo, así como también algoritmos y métodos adecuados. Sin embargo, en datos masivos, donde este conocimiento es escaso y poco factible, los procesos de interpretación podrían ser arduos para los usuarios, especialmente, para aquellos usuarios no expertos. En consecuencia, han surgido diversos problemas que debe enfrentar el reconocimiento de patrones, entre los más importantes se encuentran: La reducción de dimensión, la interacción con grandes volúmenes de información, la interpretación y la visualización de los datos. Lo anterior puede enmarcar conceptos de controlabilidad e interacción que son propiedades, en su mayoría, ausentes en las investigaciones típicas dentro del campo de reducción de dimensión. Esta tesis presenta un nuevo enfoque de visualización de datos, basada en la mezcla interactiva de resultados de los métodos de reducción de dimensional dad (RD). Tal mezcla es una suma ponderada, cuyos factores de ponderación son definidos por el usuario a través de una interfaz visual e intuitiva. Además, el espacio de representación de baja dimensión producida por métodos de (RD) se representan gráficamente mediante diagramas de dispersión alimentados a través de una visualización de datos interactiva controlada. Para ello, se calculan las distancias entre pares por similitud y se emplean para definir el grafico a representar en el diagrama de dispersión..

    Metodología de visualización de datos utilizando métodos espectrales y basados en divergencias para la reducción interactiva de la dimensión

    Get PDF
    Las tareas de reconocimiento de patrones aplican métodos que evolucionan de manera equivalente al crecimiento de los datos, alcanzando métricas eficientes en términos de optimización y rendimiento computacional aplicado a exploración, selección y representación de datos. No obstante, los resultados brindados por dichos métodos y herramientas podrían resultar ambiguos y/o abstractos para el usuario, haciendo que su aplicación sea compleja, aun mas si no cuentan con un conocimiento previo de los datos. Tener un conocimiento a priori garantiza en el mayor de los casos la correcta selección del modelo, así como también algoritmos y métodos adecuados. Sin embargo, en datos masivos, donde este conocimiento es escaso y poco factible, los procesos de interpretación podrían ser arduos para los usuarios, especialmente, para aquellos usuarios no expertos. En consecuencia, han surgido diversos problemas que debe enfrentar el reconocimiento de patrones, entre los más importantes se encuentran: La reducción de dimensión, la interacción con grandes volúmenes de información, la interpretación y la visualización de los datos. Lo anterior puede enmarcar conceptos de controlabilidad e interacción que son propiedades, en su mayoría, ausentes en las investigaciones típicas dentro del campo de reducción de dimensión. Esta tesis presenta un nuevo enfoque de visualización de datos, basada en la mezcla interactiva de resultados de los métodos de reducción de dimensional dad (RD). Tal mezcla es una suma ponderada, cuyos factores de ponderación son definidos por el usuario a través de una interfaz visual e intuitiva. Además, el espacio de representación de baja dimensión producida por métodos de (RD) se representan gráficamente mediante diagramas de dispersión alimentados a través de una visualización de datos interactiva controlada. Para ello, se calculan las distancias entre pares por similitud y se emplean para definir el grafico a representar en el diagrama de dispersión..

    Studies on dimension reduction and feature spaces :

    Get PDF
    Today's world produces and stores huge amounts of data, which calls for methods that can tackle both growing sizes and growing dimensionalities of data sets. Dimension reduction aims at answering the challenges posed by the latter. Many dimension reduction methods consist of a metric transformation part followed by optimization of a cost function. Several classes of cost functions have been developed and studied, while metrics have received less attention. We promote the view that metrics should be lifted to a more independent role in dimension reduction research. The subject of this work is the interaction of metrics with dimension reduction. The work is built on a series of studies on current topics in dimension reduction and neural network research. Neural networks are used both as a tool and as a target for dimension reduction. When the results of modeling or clustering are represented as a metric, they can be studied using dimension reduction, or they can be used to introduce new properties into a dimension reduction method. We give two examples of such use: visualizing results of hierarchical clustering, and creating supervised variants of existing dimension reduction methods by using a metric that is built on the feature space of a neural network. Combining clustering with dimension reduction results in a novel way for creating space-efficient visualizations, that tell both about hierarchical structure and about distances of clusters. We study feature spaces used in a recently developed neural network architecture called extreme learning machine. We give a novel interpretation for such neural networks, and recognize the need to parameterize extreme learning machines with the variance of network weights. This has practical implications for use of extreme learning machines, since the current practice emphasizes the role of hidden units and ignores the variance. A current trend in the research of deep neural networks is to use cost functions from dimension reduction methods to train the network for supervised dimension reduction. We show that equally good results can be obtained by training a bottlenecked neural network for classification or regression, which is faster than using a dimension reduction cost. We demonstrate that, contrary to the current belief, using sparse distance matrices for creating fast dimension reduction methods is feasible, if a proper balance between short-distance and long-distance entries in the sparse matrix is maintained. This observation opens up a promising research direction, with possibility to use modern dimension reduction methods on much larger data sets than which are manageable today

    Batch and median neural gas

    Full text link
    Neural Gas (NG) constitutes a very robust clustering algorithm given euclidian data which does not suffer from the problem of local minima like simple vector quantization, or topological restrictions like the self-organizing map. Based on the cost function of NG, we introduce a batch variant of NG which shows much faster convergence and which can be interpreted as an optimization of the cost function by the Newton method. This formulation has the additional benefit that, based on the notion of the generalized median in analogy to Median SOM, a variant for non-vectorial proximity data can be introduced. We prove convergence of batch and median versions of NG, SOM, and k-means in a unified formulation, and we investigate the behavior of the algorithms in several experiments.Comment: In Special Issue after WSOM 05 Conference, 5-8 september, 2005, Pari

    Kern-basierte Lernverfahren für das virtuelle Screening

    Get PDF
    We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren für das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer Graphähnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse für Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von Gauß-Prozess-Regression. Virtuelles Screening ist die rechnergestützte Priorisierung von Molekülen bezüglich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ähnlichkeitsprinzip, nach dem (strukturell) ähnliche Moleküle tendenziell ähnliche Eigenschaften haben. In unserer Beschreibung des Lösungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer Repräsentationen, einschließlich der auf ihnen definierten (Un)ähnlichkeitsmaße. Wir untersuchen Effekte in hochdimensionalen chemischen Deskriptorräumen, ihre Auswirkungen auf Ähnlichkeits-basierte Verfahren und geben einen Literaturüberblick zu Empfehlungen zur retrospektiven Validierung, einschließlich eines Beispiel-Workflows. Graphkerne sind formale Ähnlichkeitsmaße, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen Literaturüberblick über Graphkerne, insbesondere solche, die auf zufälligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen Graphähnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lösung, und geben eine obere Schranke für die Anzahl der benötigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische Datensätze liegen oft auf Mannigfaltigkeiten niedrigerer Dimensionalität als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur Verfügung. Für spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives Maß dafür, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von Molekülen verwendet werden, die strukturell unähnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection für virtuelles Screening mit ausschließlich positiven Beispielen. Wir führen eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von Fettsäuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und Dyslipidämie. Wir etablieren ein Gauß-Prozess-Regressionsmodell für PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschließende Testung 15 ausgewählter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-Grundgerüst, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt Rückschlüsse auf die Naturstoffursprünge von Pharmakophormustern in synthetischen Liganden

    A novel approach for multimodal graph dimensionality reduction

    No full text
    This thesis deals with the problem of multimodal dimensionality reduction (DR), which arises when the input objects, to be mapped on a low-dimensional space, consist of multiple vectorial representations, instead of a single one. Herein, the problem is addressed in two alternative manners. One is based on the traditional notion of modality fusion, but using a novel approach to determine the fusion weights. In order to optimally fuse the modalities, the known graph embedding DR framework is extended to multiple modalities by considering a weighted sum of the involved affinity matrices. The weights of the sum are automatically calculated by minimizing an introduced notion of inconsistency of the resulting multimodal affinity matrix. The other manner for dealing with the problem is an approach to consider all modalities simultaneously, without fusing them, which has the advantage of minimal information loss due to fusion. In order to avoid fusion, the problem is viewed as a multi-objective optimization problem. The multiple objective functions are defined based on graph representations of the data, so that their individual minimization leads to dimensionality reduction for each modality separately. The aim is to combine the multiple modalities without the need to assign importance weights to them, or at least postpone such an assignment as a last step. The proposed approaches were experimentally tested in mapping multimedia data on low-dimensional spaces for purposes of visualization, classification and clustering. The no-fusion approach, namely Multi-objective DR, was able to discover mappings revealing the structure of all modalities simultaneously, which cannot be discovered by weight-based fusion methods. However, it results in a set of optimal trade-offs, from which one needs to be selected, which is not trivial. The optimal-fusion approach, namely Multimodal Graph Embedding DR, is able to easily extend unimodal DR methods to multiple modalities, but depends on the limitations of the unimodal DR method used. Both the no-fusion and the optimal-fusion approaches were compared to state-of-the-art multimodal dimensionality reduction methods and the comparison showed performance improvement in visualization, classification and clustering tasks. The proposed approaches were also evaluated for different types of problems and data, in two diverse application fields, a visual-accessibility-enhanced search engine and a visualization tool for mobile network security data. The results verified their applicability in different domains and suggested promising directions for future advancements.Open Acces

    Advances in dissimilarity-based data visualisation

    Get PDF
    Gisbrecht A. Advances in dissimilarity-based data visualisation. Bielefeld: Universitätsbibliothek Bielefeld; 2015

    Parametric nonlinear dimensionality reduction using kernel t-SNE

    Get PDF
    Gisbrecht A, Schulz A, Hammer B. Parametric nonlinear dimensionality reduction using kernel t-SNE. Neurocomputing. 2015;147:71-82
    corecore