19 research outputs found

    Evaluating Clusterings by Estimating Clarity

    Get PDF
    In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness. I begin by reviewing clustering in general. I then review current clustering quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices. I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available. I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future

    Vibration Monitoring: Gearbox identification and faults detection

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Discretize and Conquer: Scalable Agglomerative Clustering in Hamming Space

    Get PDF
    Clustering is one of the most fundamental tasks in many machine learning and information retrieval applications. Roughly speaking, the goal is to partition data instances such that similar instances end up in the same group while dissimilar instances lie in different groups. Quite surprisingly though, the formal and rigorous definition of clustering is not at all clear mainly because there is no consensus about what constitutes a cluster. That said, across all disciplines, from mathematics and statistics to genetics, people frequently try to get a first intuition about the data through identifying meaningful groups. Finding similar instances and grouping them are two main steps in clustering, and not surprisingly, both have been the subject of extensive study over recent decades. It has been shown that using large datasets is the key to achieving acceptable levels of performance in data-driven applications. Today, the Internet is a vast resource for such datasets, each of which contains millions and billions of high-dimensional items such as images and text documents. However, for such large-scale datasets, the performance of the employed machine-learning algorithm quickly becomes the main bottleneck. Conventional clustering algorithms are no exception, and a great deal of effort has been devoted to developing scalable clustering algorithms. Clustering tasks can vary both in terms of the input they have and the output that they are expected to generate. For instance, the input of a clustering algorithm can hold various types of data such as continuous numerical, and categorical types. This thesis on a particular setting; in it, the input instances are represented with binary strings. Binary representation has several advantages such as storage efficiency, simplicity, lack of a numerical-data-like concept of noise, and being naturally normalized. The literature abounds with applications of clustering binary data, such as in marketing, document clustering, and image clustering. As a more-concrete example, in marketing for an online store, each customer's basket is a binary representation of items. By clustering customers, the store can recommend items to customers with the same interests. In document clustering, documents can be represented as binary codes in which each element indicates whether a word exists in the document or not. Another notable application of binary codes is in binary hashing, which has been the topic of significant research in the last decade. The goal of binary hashing is to encode high-dimensional items, such as images, with compact binary strings so as to preserve a given notion of similarity. Such codes enable extremely fast nearest neighbour searches, as the distance between two codes (often the Hamming distance) can be computed quickly using bit-wise operations implemented at the hardware level. Similar to other types of data, the clustering of binary datasets has witnessed considerable research recently. Unfortunately, most of the existing approaches are only concerned with devising density and centroid-based clustering algorithms, even though many other types of clustering techniques can be applied to binary data. One of the most popular and intuitive algorithms in connectivity-based clustering is the Hierarchical Agglomerative Clustering (HAC) algorithm, which is based on the core idea of objects being more related to nearby objects than to objects farther away. As the name suggests, HAC is a family of clustering methods that return a dendrogram as their output: that is, a hierarchical tree of domain subsets, having a singleton instance in their leaves and the whole data instances in their root. Such algorithms need no prior knowledge about the number of clusters. Most of them are deterministic and applicable to different cluster shapes, but these advantages come at the price of high computational and storage costs in comparison with other popular clustering algorithms such as k-means. In this thesis, a family of HAC algorithms is proposed, called Discretized Agglomerative Clustering (DAC), that is designed to work with binary data. By leveraging the discretized and bounded nature of binary representation, the proposed algorithms can achieve significant speedup factors both in theory and practice, in comparison to the existing solutions. From the theoretical perspective, DAC algorithms can reduce the computational cost of hierarchical clustering from cubic to quadratic, matching the known lower bounds for HAC. The proposed approach is also be empirically compared with other well-known clustering algorithms such as k-means, DBSCAN, average, and complete-linkage HAC, on well-known datasets such as TEXMEX, CIFAR-10 and MNIST, which are among the standard benchmarks for large-scale algorithms. Results indicate that by mapping real points to binary vectors using existing binary hashing algorithms and clustering them with DAC, one can achieve several orders of magnitude speed without losing much clustering quality, and in some cases, achieving even more

    Annales Mathematicae et Informaticae 2021

    Get PDF

    Repräsentations- und Transferlernen für Anwendungen des maschinellen Lernens

    Get PDF
    Das maschinelle Lernen befasst sich mit dem Lernen von Modellen anhand von Daten. Die Kombination mit neuronalen Netzen wird gemeinhin als Deep Learning bezeichnet und hat zu einem Paradigmenwechsel in fast allen Bereichen der Wissenschaften geführt. Deep Learning wird heutzutage unter anderem zur medizinischen Diagnostik, zur Vorhersage der Proteinfaltung, zur Gesichtserkennung oder sogar zur Schaffung neuer Kunstwerke eingesetzt. Die angesprochenen Anwendungsszenarien sowie ein Großteil der in der Praxis relevanten Datenquellen wie Töne, Videos oder Bilder, sind jedoch hochdimensional. Die direkte Weitergabe an linear Modelle führt, aufgrund des Fluchs der Dimensionalität, in der Regel zu schlechten Ergebnissen. Infolgedessen wurde lange auf das Feature Engineering zurückgegriffen. Anhand von Domänenwissen wird hierbei manuell eine geeignete Menge von Merkmalen extrahiert. Dieser Prozess ist langwierig und kostspielig. Im Gegensatz dazu können neuronale Netze hochdimensionale Daten direkt verarbeiten. Merkmale werden über mehrere Netzschichten hinweg automatisch extrahiert und durch deren Kombination immer spezifischer. Die Aktivierungen einer Schicht können dann als Repräsentation der Eingabe aufgefasst werden. Der Frage, wie ein Netz trainiert werden muss, um gute Repräsentationen extrahieren zu können, widmet sich das Repräsentationslernen. Das Transferlernen baut darauf auf und beschäftigt sich mit dem Transfer der gelernten Repräsentationen auf nachgelagerte Trainingsaufgaben. Dadurch kann das Wissen vortrainierter Netze effektiv ausgenutzt werden. Die vorliegende Arbeit beschäftigt sich mit dem Repräsentations- und Transferlernen für Anwendungen des maschinellen Lernens. Besonderes Augenmerk liegt dabei auf der Verarbeitung akustischer Signale. Dazu werden zunächst neue Algorithmen und Netzarchitekturen zur Klassifikation von Vokalisationen von Primaten sowie der akustischen Anomalieerkennung vorgestellt, welche die Genauigkeit bisheriger Architekturen übertreffen. Anschließend wird die Eignung des Transferlernens zurakustischen Anomalieerkennung genauer untersucht. Dabei wird gezeigt, dass das Transferlernen die Leistung der Anomalieerkennung steigern kann und dass sich vortrainierte Netze aus unterschiedlichsten Domänen, wie z.B. Musik oder Bildverarbeitung, dazu eignen. Schließlich werden neue Ansätze des Repräsentationslernens für weitere Anwendungsszenarien behandelt. Diese umfassen die diskrete Kommunikation in Multiagentensystemen durch das Clustering der internen Repräsentationen der Agenten sowie das Lernen von Repräsentationen von Fußballteams. In beiden Fällen kann gezeigt werden, dass die vorgestellten Algorithmen vergleichbaren Ansätzen überlegen sind.Machine learning deals with learning models based on sample data. Its combination with neural networks is commonly referred to as Deep Learning and has led to a paradigm shift in almost all areas of science. Deep Learning is being utilized for a variety of tasks, including face recognition, protein folding prediction, medical diagnostics, and even the creation of original art. In general, it should be noted that the mentioned application scenarios as well as a sizable portion of the practical data sources, such as audio, video, or photos, are high dimensional. Directly forwarding the data linear models usually leads to poor results due to the curse of dimensionality. Feature engineering has long been used for effective processing. It involves manually extracting a suitable set of features based on domain knowledge. This process is time consuming and costly. In contrast, neural networks can process high dimensional data directly. Features are automatically extracted across multiple network layers and become more specific as they are subsequently combined. The activations of a layer can be understood as a representation of the input. The question of how a network must be trained to be able to extract good representations automatically is addressed by the field of representation learning. Transfer learning expands on this by addressing the transfer of learned representations to downstream tasks, i.e. how pretrained networks’ knowledge can be exploited. This thesis is concerned with representation and transfer learning for machine learning applications. Special attention is given to the processing of acoustic signals. To this end, we first present new algorithms and network architectures for primate vocalization classification and acoustic anomaly detection that outperform the accuracy of previous architectures. Then, the suitability of transfer learning for acoustic anomaly detection is examined in more detail. It is shown that transfer learning can increase the performance of anomaly detection and that pretrained networks from a wide variety of domains, such as music or image processing, are suitable for this purpose. Finally, we address new approaches to representation learning for further application scenarios. These include discrete communication in multi-agent systems by clustering the agents’ internal representations, and learning representations of soccer teams. In both cases, it can be shown that the presented algorithms are superior to other comparable approaches

    Dynamics in Logistics

    Get PDF
    This open access book highlights the interdisciplinary aspects of logistics research. Featuring empirical, methodological, and practice-oriented articles, it addresses the modelling, planning, optimization and control of processes. Chiefly focusing on supply chains, logistics networks, production systems, and systems and facilities for material flows, the respective contributions combine research on classical supply chain management, digitalized business processes, production engineering, electrical engineering, computer science and mathematical optimization. To celebrate 25 years of interdisciplinary and collaborative research conducted at the Bremen Research Cluster for Dynamics in Logistics (LogDynamics), in this book hand-picked experts currently or formerly affiliated with the Cluster provide retrospectives, present cutting-edge research, and outline future research directions

    Dynamics in Logistics

    Get PDF
    This open access book highlights the interdisciplinary aspects of logistics research. Featuring empirical, methodological, and practice-oriented articles, it addresses the modelling, planning, optimization and control of processes. Chiefly focusing on supply chains, logistics networks, production systems, and systems and facilities for material flows, the respective contributions combine research on classical supply chain management, digitalized business processes, production engineering, electrical engineering, computer science and mathematical optimization. To celebrate 25 years of interdisciplinary and collaborative research conducted at the Bremen Research Cluster for Dynamics in Logistics (LogDynamics), in this book hand-picked experts currently or formerly affiliated with the Cluster provide retrospectives, present cutting-edge research, and outline future research directions
    corecore