1,142 research outputs found

    A census of ρ\rho Oph candidate members from Gaia DR2

    Full text link
    The Ophiuchus cloud complex is one of the best laboratories to study the earlier stages of the stellar and protoplanetary disc evolution. The wealth of accurate astrometric measurements contained in the Gaia Data Release 2 can be used to update the census of Ophiuchus member candidates. We seek to find potential new members of Ophiuchus and identify those surrounded by a circumstellar disc. We constructed a control sample composed of 188 bona fide Ophiuchus members. Using this sample as a reference we applied three different density-based machine learning clustering algorithms (DBSCAN, OPTICS, and HDBSCAN) to a sample drawn from the Gaia catalogue centred on the Ophiuchus cloud. The clustering analysis was applied in the five astrometric dimensions defined by the three-dimensional Cartesian space and the proper motions in right ascension and declination. The three clustering algorithms systematically identify a similar set of candidate members in a main cluster with astrometric properties consistent with those of the control sample. The increased flexibility of the OPTICS and HDBSCAN algorithms enable these methods to identify a secondary cluster. We constructed a common sample containing 391 member candidates including 166 new objects, which have not yet been discussed in the literature. By combining the Gaia data with 2MASS and WISE photometry, we built the spectral energy distributions from 0.5 to 22\microm for a subset of 48 objects and found a total of 41 discs, including 11 Class II and 1 Class III new discs. Density-based clustering algorithms are a promising tool to identify candidate members of star forming regions in large astrometric databases. If confirmed, the candidate members discussed in this work would represent an increment of roughly 40% of the current census of Ophiuchus.Comment: A&A, Accepted. Abridged abstrac

    Cluster analysis for outlier detection : A case study of applying unsupervised machine learning on diesel engine data

    Get PDF
    With the advent of modern data driven methods, engine manufacturers and maintainers are attempting to pivot from corrective to predictive maintenance. One way to achieve this goal is to install sensors on the engine and look for anomalies in the data patterns it produces. Companies such as Wärtsilä that provide condition monitoring services use the Fast Fourier Transform to manually look for anomalies in the data. The Edge-project is an industrial research project involving institutions such as universities and private companies, with the goal of developing technical solutions and edge analytics for autonomous devices and vessels. Several papers and theses have been written as a result of the project, using techniques such as autoencoders to perform anomaly detection on data produced by sensors on a diesel engine. This thesis explores the use of cluster analysis for anomaly detection on diesel engine data from the Edge-project. Finding clusters could potentially represent different states of the running engine, with anomalies being represented e.g. by data points far away from cluster centroids, or data points not belonging to any particular cluster. The techniques of K-means, DBSCAN and spectral clustering are used for assigning clusters, with silhouette coefficient and eigengap used as hyperparameter tuning heuristics. Distance from cluster centroids and reduced kernel density estimation are used to flag anomalies. T-SNE and Self-Organizing Maps are used as dimensionality reduction techniques to visualize the data into a 3-dimensional and 2-dimensional space, respectively. Results show that what data are flagged as anomalies is highly sensitive to the choice of algorithm and chosen hyperparameters. The different results suggest different data as anomaly candidates. Therefore, further evaluation is needed from subject matter experts to determine which one of the models provides the most interesting results. Further work could include building an ensemble model that combines the used approaches, which could flag certain areas of the data space as a high risk for being anomalous.Moottorien valmistajat ja ylläpitäjät pyrkivät siirtymään korjaavasta huollosta ennakoivaan huoltoon modernien datavetoisten menetelmien avulla. Tämä voidaan saavuttaa esimerkiksi asentamalla antureita moottoriin ja etsimällä poikkeavuuksia anturien tuottamasta datasta. Yritykset kuten Wärtsilä, jotka tarjoavat kunnonvalvontapalveluita etsivät datasta poikkeavuuksia manuaalisesti Fourier-muunnosten avulla. Edge-projekti on teollinen tutkimushanke, johon osallistuu mm. yliopistoja ja yksityisen sektorin yrityksiä, ja jonka tavoitteena on tuottaa teknisiä ratkaisuja ja reunalaskenta-analytiikkaa itseohjautuville laitteille, ajoneuvoille ja aluksille. Hankkeesta on kirjoitettu monia tutkimusartikkeleita ja opinnäytetöitä, joissa käytetään tekniikoita kuten syviä neuroverkkoja poikkeavuuksien havaitsemiseen dieselmoottoriin asennettujen anturien tuottamasta datasta. Tämä opinnäytetyö tutkii klusterianalyysiä menetelmänä poikkeavuuksien havaitsemiseen Edge-projektissa ajetun dieselmoottorin datasta. Klusterit voisivat mahdollisesti edustaa ajettavan moottorin eri tiloja, ja poikkeavuudet voisivat olla esim. kaukana klusterien keskipisteistä olevia datapisteitä, tai datapisteitä, jotka eivät kuulu mihinkään tiettyyn klusteriin. Työssä käytetään algoritmeja K-means, DBSCAN ja spektraaliklusterointia klusterien määrittämiseen, ja siluettikerrointa sekä ominaisväliä käytetään hyperparametrioptimoinnin heuristiikkoina. Poikkeavuuksien merkintään käytetään etäisyyttä klusterien keskipisteisiin sekä alennettua ydintiheysestimaattoria. T-SNE:tä ja itseorganisoituvaa karttaa käytetään datan ulottuvuuksien vähentämisen tekniikoina, jotta data voidaan visualisoida 3- ja 2-ulotteiseen avaruuteen. Tulokset osoittavat, että mikä data tulkitaan poikkeavana, riippuu vahvasti algoritmin ja sen hyperparametrien valinnasta. Menetelmien merkitsemät poikkeavuudet eroavat huomattavasti toisistaan. Tämän vuoksi vaaditaan aihealueen ammattilaisilta lisätutkimuksia, jotta voidaan päättää mikä malli luo mielenkiintoisimmat tulokset. Jatkokehitysideana voisi olla mallikokoelma, jossa yhdistyy tässä työssä käytetyt menetelmät, ja jonka tehtävänä olisi kartoittaa data-avaruuden eri alueiden riskit poikkeavuuksien sisältämiseen

    Density-based algorithms for active and anytime clustering

    Get PDF
    Data intensive applications like biology, medicine, and neuroscience require effective and efficient data mining technologies. Advanced data acquisition methods produce a constantly increasing volume and complexity. As a consequence, the need of new data mining technologies to deal with complex data has emerged during the last decades. In this thesis, we focus on the data mining task of clustering in which objects are separated in different groups (clusters) such that objects inside a cluster are more similar than objects in different clusters. Particularly, we consider density-based clustering algorithms and their applications in biomedicine. The core idea of the density-based clustering algorithm DBSCAN is that each object within a cluster must have a certain number of other objects inside its neighborhood. Compared with other clustering algorithms, DBSCAN has many attractive benefits, e.g., it can detect clusters with arbitrary shape and is robust to outliers, etc. Thus, DBSCAN has attracted a lot of research interest during the last decades with many extensions and applications. In the first part of this thesis, we aim at developing new algorithms based on the DBSCAN paradigm to deal with the new challenges of complex data, particularly expensive distance measures and incomplete availability of the distance matrix. Like many other clustering algorithms, DBSCAN suffers from poor performance when facing expensive distance measures for complex data. To tackle this problem, we propose a new algorithm based on the DBSCAN paradigm, called Anytime Density-based Clustering (A-DBSCAN), that works in an anytime scheme: in contrast to the original batch scheme of DBSCAN, the algorithm A-DBSCAN first produces a quick approximation of the clustering result and then continuously refines the result during the further run. Experts can interrupt the algorithm, examine the results, and choose between (1) stopping the algorithm at any time whenever they are satisfied with the result to save runtime and (2) continuing the algorithm to achieve better results. Such kind of anytime scheme has been proven in the literature as a very useful technique when dealing with time consuming problems. We also introduced an extended version of A-DBSCAN called A-DBSCAN-XS which is more efficient and effective than A-DBSCAN when dealing with expensive distance measures. Since DBSCAN relies on the cardinality of the neighborhood of objects, it requires the full distance matrix to perform. For complex data, these distances are usually expensive, time consuming or even impossible to acquire due to high cost, high time complexity, noisy and missing data, etc. Motivated by these potential difficulties of acquiring the distances among objects, we propose another approach for DBSCAN, called Active Density-based Clustering (Act-DBSCAN). Given a budget limitation B, Act-DBSCAN is only allowed to use up to B pairwise distances ideally to produce the same result as if it has the entire distance matrix at hand. The general idea of Act-DBSCAN is that it actively selects the most promising pairs of objects to calculate the distances between them and tries to approximate as much as possible the desired clustering result with each distance calculation. This scheme provides an efficient way to reduce the total cost needed to perform the clustering. Thus it limits the potential weakness of DBSCAN when dealing with the distance sparseness problem of complex data. As a fundamental data clustering algorithm, density-based clustering has many applications in diverse fields. In the second part of this thesis, we focus on an application of density-based clustering in neuroscience: the segmentation of the white matter fiber tracts in human brain acquired from Diffusion Tensor Imaging (DTI). We propose a model to evaluate the similarity between two fibers as a combination of structural similarity and connectivity-related similarity of fiber tracts. Various distance measure techniques from fields like time-sequence mining are adapted to calculate the structural similarity of fibers. Density-based clustering is used as the segmentation algorithm. We show how A-DBSCAN and A-DBSCAN-XS are used as novel solutions for the segmentation of massive fiber datasets and provide unique features to assist experts during the fiber segmentation process.Datenintensive Anwendungen wie Biologie, Medizin und Neurowissenschaften erfordern effektive und effiziente Data-Mining-Technologien. Erweiterte Methoden der Datenerfassung erzeugen stetig wachsende Datenmengen und Komplexit\"at. In den letzten Jahrzehnten hat sich daher ein Bedarf an neuen Data-Mining-Technologien f\"ur komplexe Daten ergeben. In dieser Arbeit konzentrieren wir uns auf die Data-Mining-Aufgabe des Clusterings, in der Objekte in verschiedenen Gruppen (Cluster) getrennt werden, so dass Objekte in einem Cluster untereinander viel \"ahnlicher sind als Objekte in verschiedenen Clustern. Insbesondere betrachten wir dichtebasierte Clustering-Algorithmen und ihre Anwendungen in der Biomedizin. Der Kerngedanke des dichtebasierten Clustering-Algorithmus DBSCAN ist, dass jedes Objekt in einem Cluster eine bestimmte Anzahl von anderen Objekten in seiner Nachbarschaft haben muss. Im Vergleich mit anderen Clustering-Algorithmen hat DBSCAN viele attraktive Vorteile, zum Beispiel kann es Cluster mit beliebiger Form erkennen und ist robust gegen\"uber Ausrei{\ss}ern. So hat DBSCAN in den letzten Jahrzehnten gro{\ss}es Forschungsinteresse mit vielen Erweiterungen und Anwendungen auf sich gezogen. Im ersten Teil dieser Arbeit wollen wir auf die Entwicklung neuer Algorithmen eingehen, die auf dem DBSCAN Paradigma basieren, um mit den neuen Herausforderungen der komplexen Daten, insbesondere teurer Abstandsma{\ss}e und unvollst\"andiger Verf\"ugbarkeit der Distanzmatrix umzugehen. Wie viele andere Clustering-Algorithmen leidet DBSCAN an schlechter Per- formanz, wenn es teuren Abstandsma{\ss}en f\"ur komplexe Daten gegen\"uber steht. Um dieses Problem zu l\"osen, schlagen wir einen neuen Algorithmus vor, der auf dem DBSCAN Paradigma basiert, genannt Anytime Density-based Clustering (A-DBSCAN), der mit einem Anytime Schema funktioniert. Im Gegensatz zu dem urspr\"unglichen Schema DBSCAN, erzeugt der Algorithmus A-DBSCAN zuerst eine schnelle Ann\"aherung des Clusterings-Ergebnisses und verfeinert dann kontinuierlich das Ergebnis im weiteren Verlauf. Experten k\"onnen den Algorithmus unterbrechen, die Ergebnisse pr\"ufen und w\"ahlen zwischen (1) Anhalten des Algorithmus zu jeder Zeit, wann immer sie mit dem Ergebnis zufrieden sind, um Laufzeit sparen und (2) Fortsetzen des Algorithmus, um bessere Ergebnisse zu erzielen. Eine solche Art eines "Anytime Schemas" ist in der Literatur als eine sehr n\"utzliche Technik erprobt, wenn zeitaufwendige Problemen anfallen. Wir stellen auch eine erweiterte Version von A-DBSCAN als A-DBSCAN-XS vor, die effizienter und effektiver als A-DBSCAN beim Umgang mit teuren Abstandsma{\ss}en ist. Da DBSCAN auf der Kardinalit\"at der Nachbarschaftsobjekte beruht, ist es notwendig, die volle Distanzmatrix auszurechen. F\"ur komplexe Daten sind diese Distanzen in der Regel teuer, zeitaufwendig oder sogar unm\"oglich zu errechnen, aufgrund der hohen Kosten, einer hohen Zeitkomplexit\"at oder verrauschten und fehlende Daten. Motiviert durch diese m\"oglichen Schwierigkeiten der Berechnung von Entfernungen zwischen Objekten, schlagen wir einen anderen Ansatz f\"ur DBSCAN vor, namentlich Active Density-based Clustering (Act-DBSCAN). Bei einer Budgetbegrenzung B, darf Act-DBSCAN nur bis zu B ideale paarweise Distanzen verwenden, um das gleiche Ergebnis zu produzieren, wie wenn es die gesamte Distanzmatrix zur Hand h\"atte. Die allgemeine Idee von Act-DBSCAN ist, dass es aktiv die erfolgversprechendsten Paare von Objekten w\"ahlt, um die Abst\"ande zwischen ihnen zu berechnen, und versucht, sich so viel wie m\"oglich dem gew\"unschten Clustering mit jeder Abstandsberechnung zu n\"ahern. Dieses Schema bietet eine effiziente M\"oglichkeit, die Gesamtkosten der Durchf\"uhrung des Clusterings zu reduzieren. So schr\"ankt sie die potenzielle Schw\"ache des DBSCAN beim Umgang mit dem Distance Sparseness Problem von komplexen Daten ein. Als fundamentaler Clustering-Algorithmus, hat dichte-basiertes Clustering viele Anwendungen in den unterschiedlichen Bereichen. Im zweiten Teil dieser Arbeit konzentrieren wir uns auf eine Anwendung des dichte-basierten Clusterings in den Neurowissenschaften: Die Segmentierung der wei{\ss}en Substanz bei Faserbahnen im menschlichen Gehirn, die vom Diffusion Tensor Imaging (DTI) erfasst werden. Wir schlagen ein Modell vor, um die \"Ahnlichkeit zwischen zwei Fasern als einer Kombination von struktureller und konnektivit\"atsbezogener \"Ahnlichkeit von Faserbahnen zu beurteilen. Verschiedene Abstandsma{\ss}e aus Bereichen wie dem Time-Sequence Mining werden angepasst, um die strukturelle \"Ahnlichkeit von Fasern zu berechnen. Dichte-basiertes Clustering wird als Segmentierungsalgorithmus verwendet. Wir zeigen, wie A-DBSCAN und A-DBSCAN-XS als neuartige L\"osungen f\"ur die Segmentierung von sehr gro{\ss}en Faserdatens\"atzen verwendet werden, und bieten innovative Funktionen, um Experten w\"ahrend des Fasersegmentierungsprozesses zu unterst\"utzen

    Exposing and fixing causes of inconsistency and nondeterminism in clustering implementations

    Get PDF
    Cluster analysis aka Clustering is used in myriad applications, including high-stakes domains, by millions of users. Clustering users should be able to assume that clustering implementations are correct, reliable, and for a given algorithm, interchangeable. Based on observations in a wide-range of real-world clustering implementations, this dissertation challenges the aforementioned assumptions. This dissertation introduces an approach named SmokeOut that uses differential clustering to show that clustering implementations suffer from nondeterminism and inconsistency: on a given input dataset and using a given clustering algorithm, clustering outcomes and accuracy vary widely between (1) successive runs of the same toolkit, i.e., nondeterminism, and (2) different toolkits, i.e, inconsistency. Using a statistical approach, this dissertation quantifies and exposes statistically significant differences across runs and toolkits. This dissertation exposes the diverse root causes of nondeterminism or inconsistency, such as default parameter settings, noise insertion, distance metrics, termination criteria. Based on these findings, this dissertation introduces an automatic approach for locating the root causes of nondeterminism and inconsistency. This dissertation makes several contributions: (1) quantifying clustering outcomes across different algorithms, toolkits, and multiple runs; (2) using a statistical rigorous approach for testing clustering implementations; (3) exposing root causes of nondeterminism and inconsistency; and (4) automatically finding nondeterminism and inconsistency’s root causes
    corecore