15 research outputs found

    āļŠāļĄāļšāļąāļ•āļīāđ€āļŠāļīāļ‡āļžāļĩāļŠāļ„āļ“āļīāļ•āļ‚āļ­āļ‡āļœāļĨāļ„āļđāļ“āđ‚āļ„āļĢāđ€āļ™āļ„āđ€āļ„āļ­āļĢāđŒāđāļšāļšāļšāļĨāđ‡āļ­āļāđāļĨāļ°āļ•āļąāļ§āļ”āļģāđ€āļ™āļīāļ™āļāļēāļĢāđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļšāļšāļšāļĨāđ‡āļ­āļāļŠāļģāļŦāļĢāļąāļšāđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāđ€āļŦāļ™āļ·āļ­āļāļķāđˆāļ‡āļĢāļīāļ‡āļŠāļĨāļąāļšāļ—āļĩāđˆ

    Get PDF
    āļšāļ—āļ„āļąāļ”āļĒāđˆāļ­ āđ€āļĢāļēāļ‚āļĒāļēāļĒāđāļ™āļ§āļ„āļīāļ”āļ‚āļ­āļ‡āļœāļĨāļ„āļđāļ“āđ‚āļ„āļĢāđ€āļ™āļ„āđ€āļ„āļ­āļĢāđŒāđ„āļ›āļŠāļđāđˆāļœāļĨāļ„āļđāļ“āđ‚āļ„āļĢāđ€āļ™āļ„āđ€āļ„āļ­āļĢāđŒāđāļšāļšāļšāļĨāđ‡āļ­āļāļŠāļģāļŦāļĢāļąāļšāđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāđ€āļŦāļ™āļ·āļ­āļāļķāđˆāļ‡āļĢāļīāļ‡āļŠāļĨāļąāļšāļ—āļĩāđˆÂ  āđ€āļĢāļēāđ„āļ”āđ‰āļ§āđˆāļēāļœāļĨāļ„āļđāļ“āļ”āļąāļ‡āļāļĨāđˆāļēāļ§āđ€āļ‚āđ‰āļēāļāļąāļ™āđ„āļ”āđ‰āļāļąāļšāļāļēāļĢāļšāļ§āļāđ€āļĄāļ—āļĢāļīāļāļ‹āđŒ āļāļēāļĢāļ„āļđāļ“āđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāļ”āđ‰āļ§āļĒāļŠāđ€āļāļĨāļēāļĢāđŒ āļāļēāļĢāļ„āļđāļ“āđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāđāļšāļšāļ›āļĢāļāļ•āļī āļāļēāļĢāļŠāļĨāļąāļšāđ€āļ›āļĨāļĩāđˆāļĒāļ™ āđāļĨāļ°āļĢāļ­āļĒāđ€āļĄāļ—āļĢāļīāļāļ‹āđŒÂ  āļŠāļĄāļšāļąāļ•āļīāđ€āļŠāļīāļ‡āļžāļĩāļŠāļ„āļ“āļīāļ•āļŦāļĨāļēāļĒāļ­āļĒāđˆāļēāļ‡āļ‚āļ­āļ‡āđ€āļĄāļ—āļĢāļīāļāļ‹āđŒ āđ€āļŠāđˆāļ™ āļ„āļ§āļēāļĄāļŠāļĄāļĄāļēāļ•āļĢ āļāļēāļĢāļŦāļēāļœāļāļœāļąāļ™āđ„āļ”āđ‰ āļ āļēāļ§āļ°āļ„āļĨāđ‰āļēāļĒ āļŠāļĄāļ āļēāļ„ āļāļēāļĢāļ—āļģāđ€āļ›āđ‡āļ™āđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāļ—āđāļĒāļ‡āļĄāļļāļĄāđ„āļ”āđ‰ āļ–āļđāļāļĢāļąāļāļĐāļēāđ„āļ§āđ‰āļ āļēāļĒāđƒāļ•āđ‰āļœāļĨāļ„āļđāļ“āđ‚āļ„āļĢāđ€āļ™āļ„āđ€āļ„āļ­āļĢāđŒāđāļšāļšāļšāļĨāđ‡āļ­āļ āļ™āļ­āļāļˆāļēāļāļ™āļĩāđ‰āđ€āļĢāļēāļžāļīāļˆāļēāļĢāļ“āļēāļ„āļ§āļēāļĄāļŠāļąāļĄāļžāļąāļ™āļ˜āđŒāļĢāļ°āļŦāļ§āđˆāļēāļ‡āļœāļĨāļ„āļđāļ“āļ”āļąāļ‡āļāļĨāđˆāļēāļ§āļāļąāļšāļ•āļąāļ§āļ”āļģāđ€āļ™āļīāļ™āļāļēāļĢāđ€āļ§āļāđ€āļ•āļ­āļĢāđŒāđāļšāļšāļšāļĨāđ‡āļ­āļ āļ„āļ§āļēāļĄāļŠāļąāļĄāļžāļąāļ™āļ˜āđŒāļ”āļąāļ‡āļāļĨāđˆāļēāļ§āļŠāļēāļĄāļēāļĢāļ–āļ™āļģāđ„āļ›āļĨāļ”āļĢāļđāļ›āļŠāļĄāļāļēāļĢāđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāđ€āļŠāļīāļ‡āđ€āļŠāđ‰āļ™āđƒāļŦāđ‰āļ­āļĒāļđāđˆāđƒāļ™āļĢāļđāļ›āļŠāļĄāļāļēāļĢāđ€āļ§āļāđ€āļ•āļ­āļĢāđŒ-āđ€āļĄāļ—āļĢāļīāļāļ‹āđŒāļ­āļĒāđˆāļēāļ‡āļ‡āđˆāļēāļĒ  - - -  Algebraic Properties of the Block Kronecker Product and a Block Vector-Operator for Matrices over a Commutative Semiring  ABSTRACT We extend the notion of Kronecker product to the block Kronecker product for matrices over a commutative semiring. It turns out that this matrix product is compatible with the matrix addition, the scalar multiplication, the usual multiplication, the transposition, and the traces. Certain algebraic properties of matrices, such as symmetry, invertibility, similarity, congruence, diagonalizability, are preserved under the block Kronecker product. In addition, we investigate a relation between this matrix product and a block vector-operator. Such relation can be applied to reduce certain linear matrix equations to simple vector-matrix equations

    Visualizing multidimensional data similarities:Improvements and applications

    Get PDF
    Multidimensional data is increasingly more prominent and important in many application domains. Such data typically consist of a large set of elements, each of which described by several measurements (dimensions). During the design of techniques and tools to process this data, a key component is to gather insights into their structure and patterns, which can be described by the notion of similarity between elements. Among these techniques, multidimensional projections and similarity trees can effectively capture similarity patterns and handle a large number of data elements and dimensions. However, understanding and interpreting these patterns in terms of the original data dimensions is still hard. This thesis addresses the development of visual explanatory techniques for the easy interpretation of similarity patterns present in multidimensional projections and similarity trees, by several contributions. First, we propose methods that make the computation of similarity trees efficient for large datasets, and also enhance its visual representation to allow the exploration of more data in a limited screen. Secondly, we propose methods for the visual explanation of multidimensional projections in terms of groups of similar elements. These are automatically annotated to describe which dimensions are more important to define their notion of group similarity. We show next how these explanatory mechanisms can be adapted to handle both static and time-dependent data. Our proposed techniques are designed to be easy to use, work nearly automatically, and are demonstrated on a variety of real-world large data obtained from image collections, text archives, scientific measurements, and software engineering

    Explanatory visualization of multidimensional projections

    Get PDF

    Explanatory visualization of multidimensional projections

    Get PDF

    Explanatory visualization of multidimensional projections

    Get PDF

    Advances in dissimilarity-based data visualisation

    Get PDF
    Gisbrecht A. Advances in dissimilarity-based data visualisation. Bielefeld: UniversitÃĪtsbibliothek Bielefeld; 2015

    An integrated clustering analysis framework for heterogeneous data

    Get PDF
    Big data is a growing area of research with some important research challenges that motivate our work. We focus on one such challenge, the variety aspect. First, we introduce our problem by defining heterogeneous data as data about objects that are described by different data types, e.g., structured data, text, time-series, images, etc. Through our work we use five datasets for experimentation: a real dataset of prostate cancer data and four synthetic dataset that we have created and made them publicly available. Each dataset covers different combinations of data types that are used to describe objects. Our strategy for clustering is based on fusion approaches. We compare intermediate and late fusion schemes. We propose an intermediary fusion approach, Similarity Matrix Fusion (SMF), where the integration process takes place at the level of calculating similarities. SMF produces a single distance fusion matrix and two uncertainty expression matrices. We then propose a clustering algorithm, Hk-medoids, a modified version of the standard k-medoids algorithm that utilises uncertainty calculations to improve on the clustering performance. We evaluate our results by comparing them to clustering produced using individual elements and show that the fusion approach produces equal or significantly better results. Also, we show that there are advantages in utilising the uncertainty information as Hkmedoids does. In addition, from a theoretical point of view, our proposed Hk-medoids algorithm has less computation complexity than the popular PAM implementation of the k-medoids algorithm. Then, we employed late fusion that aggregates the results of clustering by individual elements by combining cluster labels using an object co-occurrence matrix technique. The final cluster is then derived by a hierarchical clustering algorithm. We show that intermediate fusion for clustering of heterogeneous data is a feasible and efficient approach using our proposed Hk-medoids algorithm

    Efficient feature reduction and classification methods

    Get PDF
    Durch die steigende Anzahl verfÞgbarer Daten in unterschiedlichsten Anwendungsgebieten nimmt der Aufwand vieler Data-Mining Applikationen signifikant zu. Speziell hochdimensionierte Daten (Daten die Þber viele verschiedene Attribute beschrieben werden) kÃķnnen ein großes Problem fÞr viele Data-Mining Anwendungen darstellen. Neben hÃķheren Laufzeiten kÃķnnen dadurch sowohl fÞr Þberwachte (supervised), als auch nicht Þberwachte (unsupervised) Klassifikationsalgorithmen weitere Komplikationen entstehen (z.B. ungenaue Klassifikationsgenauigkeit, schlechte Clustering-Eigenschaften, â€Ķ). Dies fÞhrt zu einem Bedarf an effektiven und effizienten Methoden zur Dimensionsreduzierung. Feature Selection (die Auswahl eines Subsets von Originalattributen) und Dimensionality Reduction (Transformation von Originalattribute in (Linear)-Kombinationen der Originalattribute) sind zwei wichtige Methoden um die Dimension von Daten zu reduzieren. Obwohl sich in den letzten Jahren vielen Studien mit diesen Methoden beschÃĪftigt haben, gibt es immer noch viele offene Fragestellungen in diesem Forschungsgebiet. DarÞber hinaus ergeben sich in vielen Anwendungsbereichen durch die immer weiter steigende Anzahl an verfÞgbaren und verwendeten Attributen und Features laufend neue Probleme. Das Ziel dieser Dissertation ist es, verschiedene Fragenstellungen in diesem Bereich genau zu analysieren und VerbesserungsmÃķglichkeiten zu entwickeln. GrundsÃĪtzlich, werden folgende AnsprÞche an Methoden zur Feature Selection und Dimensionality Reduction gestellt: Die Methoden sollten effizient (bezÞglich ihres Rechenaufwandes) sein und die resultierenden Feature-Sets sollten die Originaldaten mÃķglichst kompakt reprÃĪsentieren kÃķnnen. DarÞber hinaus ist es in vielen Anwendungsgebieten wichtig, die Interpretierbarkeit der Originaldaten beizubehalten. Letztendlich sollte der Prozess der Dimensionsreduzierung keinen negativen Effekt auf die Klassifikationsgenauigkeit haben - sondern idealerweise, diese noch verbessern. Offene Problemstellungen in diesem Bereich betreffen unter anderem den Zusammenhang zwischen Methoden zur Dimensionsreduzierung und der resultierenden Klassifikationsgenauigkeit, wobei sowohl eine mÃķglichst kompakte ReprÃĪsentation der Daten, als auch eine hohe Klassifikationsgenauigkeit erzielt werden sollen. Wie bereits erwÃĪhnt, ergibt sich durch die große Anzahl an Daten auch ein erhÃķhter Rechenaufwand, weshalb schnelle und effektive Methoden zur Dimensionsreduzierung entwickelt werden mÞssen, bzw. existierende Methoden verbessert werden mÞssen. DarÞber hinaus sollte natÞrlich auch der Rechenaufwand der verwendeten Klassifikationsmethoden mÃķglichst gering sein. Des Weiteren ist die Interpretierbarkeit von Feature Sets zwar mÃķglich, wenn Feature Selection Methoden fÞr die Dimensionsreduzierung verwendet werden, im Fall von Dimensionality Reduction sind die resultierenden Feature Sets jedoch meist Linearkombinationen der Originalfeatures. Daher ist es schwierig zu ÞberprÞfen, wie viel Information einzelne Originalfeatures beitragen. Im Rahmen dieser Dissertation konnten wichtige BeitrÃĪge zu den oben genannten Problemstellungen prÃĪsentiert werden: Es wurden neue, effiziente Initialisierungsvarianten fÞr die Dimensionality Reduction Methode Nonnegative Matrix Factorization (NMF) entwickelt, welche im Vergleich zu randomisierter Initialisierung und im Vergleich zu State-of-the-Art Initialisierungsmethoden zu einer schnelleren Reduktion des Approximationsfehlers fÞhren. Diese Initialisierungsvarianten kÃķnnen darÞber hinaus mit neu entwickelten und sehr effektiven Klassifikationsalgorithmen basierend auf NMF kombiniert werden. Um die Laufzeit von NMF weiter zu steigern wurden unterschiedliche Varianten von NMF Algorithmen auf Multi-Prozessor Systemen vorgestellt, welche sowohl Task- als auch Datenparallelismus unterstÞtzen und zu einer erheblichen Reduktion der Laufzeit fÞr NMF fÞhren. Außerdem wurde eine effektive Verbesserung der Matlab Implementierung des ALS Algorithmus vorgestellt. DarÞber hinaus wurde eine Technik aus dem Bereich des Information Retrieval -- Latent Semantic Indexing -- erfolgreich als Klassifikationsalgorithmus fÞr Email Daten angewendet. Schließlich wurde eine ausfÞhrliche empirische Studie Þber den Zusammenhang verschiedener Feature Reduction Methoden (Feature Selection und Dimensionality Reduction) und der resultierenden Klassifikationsgenauigkeit unterschiedlicher Lernalgorithmen prÃĪsentiert. Der starke Einfluss unterschiedlicher Methoden zur Dimensionsreduzierung auf die resultierende Klassifikationsgenauigkeit unterstreicht dass noch weitere Untersuchungen notwendig sind um das komplexe Zusammenspiel von Dimensionsreduzierung und Klassifikation genau analysieren zu kÃķnnen.The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data mining applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a real prerequisite for data mining applications. There are several open questions in this research field, and due to the often increasing number of candidate features for various application areas (e.\,g., email filtering or drug classification/molecular modeling) new questions arise. In this thesis, we focus on some open research questions in this context, such as the relationship between feature reduction techniques and the resulting classification accuracy and the relationship between the variability captured in the linear combinations of dimensionality reduction techniques (e.\,g., PCA, SVD) and the accuracy of machine learning algorithms operating on them. Another important goal is to better understand new techniques for dimensionality reduction, such as nonnegative matrix factorization (NMF), which can be applied for finding parts-based, linear representations of nonnegative data. This ``sum-of-parts'' representation is especially useful if the interpretability of the original data should be retained. Moreover, performance aspects of feature reduction algorithms are investigated. As data grow, implementations of feature selection and dimensionality reduction techniques for high-performance parallel and distributed computing environments become more and more important. In this thesis, we focus on two types of open research questions: methodological advances without any specific application context, and application-driven advances for a specific application context. Summarizing, new methodological contributions are the following: The utilization of nonnegative matrix factorization in the context of classification methods is investigated. In particular, it is of interest how the improved interpretability of NMF factors due to the non-negativity constraints (which is of central importance in various problem settings) can be exploited. Motivated by this problem context two new fast initialization techniques for NMF based on feature selection are introduced. It is shown how approximation accuracy can be increased and/or how computational effort can be reduced compared to standard randomized seeding of the NMF and to state-of-the-art initialization strategies suggested earlier. For example, for a given number of iterations and a required approximation error a speedup of 3.6 compared to standard initialization, and a speedup of 3.4 compared to state-of-the-art initialization strategies could be achieved. Beyond that, novel classification methods based on the NMF are proposed and investigated. We can show that they are not only competitive in terms of classification accuracy with state-of-the-art classifiers, but also provide important advantages in terms of computational effort (especially for low-rank approximations). Moreover, parallelization and distributed execution of NMF is investigated. Several algorithmic variants for efficiently computing NMF on multi-core systems are studied and compared to each other. In particular, several approaches for exploiting task and/or data-parallelism in NMF are studied. We show that for some scenarios new algorithmic variants clearly outperform existing implementations. Last, but not least, a computationally very efficient adaptation of the implementation of the ALS algorithm in Matlab 2009a is investigated. This variant reduces the runtime significantly (in some settings by a factor of 8) and also provides several possibilities to be executed concurrently. In addition to purely methodological questions, we also address questions arising in the adaptation of feature selection and classification methods to two specific application problems: email classification and in silico screening for drug discovery. Different research challenges arise in the contexts of these different application areas, such as the dynamic nature of data for email classification problems, or the imbalance in the number of available samples of different classes for drug discovery problems. Application-driven advances of this thesis comprise the adaptation and application of latent semantic indexing (LSI) to the task of email filtering. Experimental results show that LSI achieves significantly better classification results than the widespread de-facto standard method for this special application context. In the context of drug discovery problems, several groups of well discriminating descriptors could be identified by utilizing the ``sum-of-parts`` representation of NMF. The number of important descriptors could be further increased when applying sparseness constraints on the NMF factors

    Explanatory visualization of multidimensional projections

    Get PDF
    Het verkrijgen van inzicht in grote gegevensverzalelingen (tegenwoording bekend als ‘big data’) kan gedaan worden door ze visueel af te beelden en deze visualisaties vervolgens interactief exploreren. Toch kunnen beide het aantal datapunten of metingen, en ook het aantal dimensies die elke meting beschrijven, zeer groot zijn – zoals een table met veel rijen en kolommen. Het visualiseren van dergelijke zogenaamde hoog-dimensionale datasets is zeer uitdagend. Een manier om dit te doen is door het maken van een laag (twee of drie) dimensionale afbeelding, waarin men dan zoekt naar interessante datapatronen in plaats van deze te zoeken in de oorspronkelijke hoog-dimensionale data. Technieken die dit scenario ondersteunen, de zogenaamde projecties, hebben verschillende voordelen – ze zijn visueel schaalbaar, ze werken robuust met ruizige data, en ze zijn snel. Toch is het gebruik van projecties ernstig beperkt door het feit dat ze moeilijk te interpreteren zijn. We benaderen dit problem door verschillende technieken te ontwikkelen die de interpretative vergemakkelijken, zoals het weergeven van projectiefouten en het uitleggen van projecties door middel van de oorpronkelijke hoge dimensies. Onze technieken zijn makkelijk te leren, snel te rekenen, en makkelijk toe te voegen aan elke dataexploratiescenario dat gebruik maakt van elke projectie. We demonstreren onze oplossingen met verschillende toepassingen en data van metingen, wetenschappelijke simulaties, software-engineering, en netwerken

    Multidimensional projections for the visual exploration of multimedia data

    Get PDF
    Multidimensional data analysis is considerably important when dealing with such large and complex datasets. Among the possibilities when analyzing such kind of data, applying visualization techniques can help the user find and understand patters, trends and establish new goals. This thesis aims to present several visualization methods to interactively explore multidimensional datasets aimed from specialized to casual users, by making use of both static and dynamic representations created by multidimensional projections
    corecore