19 research outputs found
Evaluating Clusterings by Estimating Clarity
In this thesis I examine clustering evaluation, with a subfocus on text clusterings specifically. The principal work
of this thesis is the development, analysis, and testing of a new internal clustering quality measure called informativeness.
I begin by reviewing clustering in general. I then review current clustering
quality measures, accompanying this with an in-depth discussion of many of the important properties one needs to understand about such measures. This is followed by extensive document clustering experiments that show problems with standard clustering evaluation practices.
I then develop informativeness, my new internal clustering quality measure for estimating the clarity of clusterings. I show that informativeness, which uses classification accuracy as a proxy for human assessment of clusterings, is both theoretically sensible and works empirically. I present a generalization of informativeness that leverages external clustering quality measures. I also show its use in a realistic application: email spam filtering. I show that informativeness can be used to select clusterings which lead to superior spam filters when few true labels are available.
I conclude this thesis with a discussion of clustering evaluation in general, informativeness, and the directions I believe clustering evaluation research should take in the future
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
Vibration Monitoring: Gearbox identification and faults detection
L'abstract è presente nell'allegato / the abstract is in the attachmen
Discretize and Conquer: Scalable Agglomerative Clustering in Hamming Space
Clustering is one of the most fundamental tasks in many machine learning and information retrieval applications. Roughly speaking, the goal is to partition data instances such that similar instances end up in the same group while dissimilar instances lie in different groups. Quite surprisingly though, the formal and rigorous definition of clustering is not at all clear mainly because there is no consensus about what constitutes a cluster. That said, across all disciplines, from mathematics and statistics to genetics, people frequently try to get a first intuition about the data through identifying meaningful groups. Finding similar instances and grouping them are two main steps in clustering, and not surprisingly, both have been the subject of extensive study over recent decades.
It has been shown that using large datasets is the key to achieving acceptable levels of performance in data-driven applications. Today, the Internet is a vast resource for such datasets, each of which contains millions and billions of high-dimensional items such as images and text documents. However, for such large-scale datasets, the performance of the employed machine-learning algorithm quickly becomes the main bottleneck. Conventional clustering algorithms are no exception, and a great deal of effort has been devoted to developing scalable clustering algorithms.
Clustering tasks can vary both in terms of the input they have and the output that they are expected to generate. For instance, the input of a clustering algorithm can hold various types of data such as continuous numerical, and categorical types. This thesis on a particular setting; in it, the input instances are represented with binary strings. Binary representation has several advantages such as storage efficiency, simplicity, lack of a numerical-data-like concept of noise, and being naturally normalized.
The literature abounds with applications of clustering binary data, such as in marketing, document clustering, and image clustering. As a more-concrete example, in marketing for an online store, each customer's basket is a binary representation of items. By clustering customers, the store can recommend items to customers with the same interests. In document clustering, documents can be represented as binary codes in which each element indicates whether a word exists in the document or not. Another notable application of binary codes is in binary hashing, which has been the topic of significant research in the last decade. The goal of binary hashing is to encode high-dimensional items, such as images, with compact binary strings so as to preserve a given notion of similarity. Such codes enable extremely fast nearest neighbour searches, as the distance between two codes (often the Hamming distance) can be computed quickly using bit-wise operations implemented at the hardware level.
Similar to other types of data, the clustering of binary datasets has witnessed considerable research recently. Unfortunately, most of the existing approaches are only concerned with devising density and centroid-based clustering algorithms, even though many other types of clustering techniques can be applied to binary data. One of the most popular and intuitive algorithms in connectivity-based clustering is the Hierarchical Agglomerative Clustering (HAC) algorithm, which is based on the core idea of objects being more related to nearby objects than to objects farther away. As the name suggests, HAC is a family of clustering methods that return a dendrogram as their output: that is, a hierarchical tree of domain subsets, having a singleton instance in their leaves and the whole data instances in their root. Such algorithms need no prior knowledge about the number of clusters. Most of them are deterministic and applicable to different cluster shapes, but these advantages come at the price of high computational and storage costs in comparison with other popular clustering algorithms such as k-means.
In this thesis, a family of HAC algorithms is proposed, called Discretized Agglomerative Clustering (DAC), that is designed to work with binary data. By leveraging the discretized and bounded nature of binary representation, the proposed algorithms can achieve significant speedup factors both in theory and practice, in comparison to the existing solutions. From the theoretical perspective, DAC algorithms can reduce the computational cost of hierarchical clustering from cubic to quadratic, matching the known lower bounds for HAC. The proposed approach is also be empirically compared with other well-known clustering algorithms such as k-means, DBSCAN, average, and complete-linkage HAC, on well-known datasets such as TEXMEX, CIFAR-10 and MNIST, which are among the standard benchmarks for large-scale algorithms. Results indicate that by mapping real points to binary vectors using existing binary hashing algorithms and clustering them with DAC, one can achieve several orders of magnitude speed without losing much clustering quality, and in some cases, achieving even more
Repräsentations- und Transferlernen für Anwendungen des maschinellen Lernens
Das maschinelle Lernen befasst sich mit dem Lernen von Modellen anhand von Daten. Die Kombination mit neuronalen Netzen wird gemeinhin als Deep Learning bezeichnet und hat zu einem Paradigmenwechsel in fast allen Bereichen der Wissenschaften geführt. Deep Learning wird heutzutage unter anderem zur medizinischen Diagnostik, zur Vorhersage der Proteinfaltung, zur Gesichtserkennung oder sogar zur Schaffung neuer Kunstwerke eingesetzt. Die angesprochenen Anwendungsszenarien sowie ein Großteil der in der Praxis relevanten Datenquellen wie Töne, Videos oder Bilder, sind jedoch hochdimensional. Die direkte Weitergabe an linear Modelle führt, aufgrund des Fluchs der Dimensionalität, in der Regel zu schlechten Ergebnissen. Infolgedessen wurde lange auf das Feature Engineering zurückgegriffen. Anhand von Domänenwissen wird hierbei manuell eine geeignete Menge von Merkmalen extrahiert. Dieser Prozess ist langwierig und kostspielig. Im Gegensatz dazu können neuronale Netze hochdimensionale Daten direkt verarbeiten. Merkmale werden über mehrere Netzschichten hinweg automatisch extrahiert und durch deren Kombination immer spezifischer. Die Aktivierungen einer Schicht können dann als Repräsentation der Eingabe aufgefasst werden. Der Frage, wie ein Netz trainiert werden muss, um gute Repräsentationen extrahieren zu können, widmet sich das Repräsentationslernen. Das Transferlernen baut darauf auf und beschäftigt sich mit dem Transfer der gelernten Repräsentationen auf nachgelagerte Trainingsaufgaben. Dadurch kann das Wissen vortrainierter Netze effektiv ausgenutzt werden. Die vorliegende Arbeit beschäftigt sich mit dem Repräsentations- und Transferlernen für Anwendungen des maschinellen Lernens. Besonderes Augenmerk liegt dabei auf der Verarbeitung akustischer Signale. Dazu werden zunächst neue Algorithmen und Netzarchitekturen zur Klassifikation von Vokalisationen von Primaten sowie der akustischen Anomalieerkennung vorgestellt, welche die Genauigkeit bisheriger Architekturen übertreffen. Anschließend wird die Eignung des Transferlernens zurakustischen Anomalieerkennung genauer untersucht. Dabei wird gezeigt, dass das Transferlernen die Leistung der Anomalieerkennung steigern kann und dass sich vortrainierte Netze aus unterschiedlichsten Domänen, wie z.B. Musik oder Bildverarbeitung, dazu eignen. Schließlich werden neue Ansätze des Repräsentationslernens für weitere Anwendungsszenarien behandelt. Diese umfassen die diskrete Kommunikation in Multiagentensystemen durch das Clustering der internen Repräsentationen der Agenten sowie das Lernen von Repräsentationen von Fußballteams. In beiden Fällen kann gezeigt werden, dass die vorgestellten Algorithmen vergleichbaren Ansätzen überlegen sind.Machine learning deals with learning models based on sample data. Its combination with neural networks is commonly referred to as Deep Learning and has led to a paradigm shift in almost all areas of science. Deep Learning is being utilized for a variety of tasks, including face recognition, protein folding prediction, medical diagnostics, and even the creation of original art. In general, it should be noted that the mentioned application scenarios as well as a sizable portion of the practical data sources, such as audio, video, or photos, are high dimensional. Directly forwarding the data linear models usually leads to poor results due to the curse of dimensionality. Feature engineering has long been used for effective processing. It involves manually extracting a suitable set of features based on domain knowledge. This process is time consuming and costly. In contrast, neural networks can process high dimensional data directly. Features are automatically extracted across multiple network layers and become more specific as they are subsequently combined. The activations of a layer can be understood as a representation of the input. The question of how a network must be trained to be able to extract good representations automatically is addressed by the field of representation learning. Transfer learning expands on this by addressing the transfer of learned representations to downstream tasks, i.e. how pretrained networks’ knowledge can be exploited. This thesis is concerned with representation and transfer learning for machine learning applications. Special attention is given to the processing of acoustic signals. To this end, we first present new algorithms and network architectures for primate vocalization classification and acoustic anomaly detection that outperform the accuracy of previous architectures. Then, the suitability of transfer learning for acoustic anomaly detection is examined in more detail. It is shown that transfer learning can increase the performance of anomaly detection and that pretrained networks from a wide variety of domains, such as music or image processing, are suitable for this purpose. Finally, we address new approaches to representation learning for further application scenarios. These include discrete communication in multi-agent systems by clustering the agents’ internal representations, and learning representations of soccer teams.
In both cases, it can be shown that the presented algorithms are superior to other comparable approaches
Dynamics in Logistics
This open access book highlights the interdisciplinary aspects of logistics research. Featuring empirical, methodological, and practice-oriented articles, it addresses the modelling, planning, optimization and control of processes. Chiefly focusing on supply chains, logistics networks, production systems, and systems and facilities for material flows, the respective contributions combine research on classical supply chain management, digitalized business processes, production engineering, electrical engineering, computer science and mathematical optimization. To celebrate 25 years of interdisciplinary and collaborative research conducted at the Bremen Research Cluster for Dynamics in Logistics (LogDynamics), in this book hand-picked experts currently or formerly affiliated with the Cluster provide retrospectives, present cutting-edge research, and outline future research directions
Dynamics in Logistics
This open access book highlights the interdisciplinary aspects of logistics research. Featuring empirical, methodological, and practice-oriented articles, it addresses the modelling, planning, optimization and control of processes. Chiefly focusing on supply chains, logistics networks, production systems, and systems and facilities for material flows, the respective contributions combine research on classical supply chain management, digitalized business processes, production engineering, electrical engineering, computer science and mathematical optimization. To celebrate 25 years of interdisciplinary and collaborative research conducted at the Bremen Research Cluster for Dynamics in Logistics (LogDynamics), in this book hand-picked experts currently or formerly affiliated with the Cluster provide retrospectives, present cutting-edge research, and outline future research directions