95 research outputs found

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    A Data-Driven Approach for Generating Vortex Shedding Regime Maps for an Oscillating Cylinder

    Get PDF
    Recent developments in wind energy extraction methods from vortex-induced vibration (VIV) have fueled the research into vortex shedding behaviour. The vortex shedding map is vital for the consistent use of normalized amplitude and wavelength to validate the predicting power of forced vibration experiments. However, there is a lack of demonstrated methods of generating this map at Reynolds numbers feasible for energy generation due to the high computational cost and complex dynamics. Leveraging data-driven methods addresses the limitations of the traditional experimental vortex shedding map generation, which requires large amounts of data and intensive supervision that is unsuitable for many applications and Reynolds numbers. This thesis presents a data-driven approach for generating vortex shedding maps of a cylinder undergoing forced vibration that requires less data and supervision while accurately extracting the underlying vortex structure patterns. The quantitative analysis in this dissertation requires the univariate time series signatures of local fluid flow measurements in the wake of an oscillating cylinder experiencing forced vibration. The datasets were extracted from a 2-dimensional computational fluid dynamic (CFD) simulation of a cylinder oscillating at various normalized amplitude and wavelength parameters conducted at two discrete Reynolds numbers of 4000 and 10,000. First, the validity of clustering local flow measurements was demonstrated by proposing a vortex shedding mode classification strategy using supervised machine learning models of random forest and -nearest neighbour models, which achieved 99.3% and 99.8% classification accuracy using the velocity sensors orientated transverse to the pre-dominant flow (), respectively. Next, the dataset of local flow measurement of the -component of velocity was used to develop the procedure of generating vortex shedding maps using unsupervised clustering techniques. The clustering task was conducted on subsequences of repeated patterns from the whole time series extracted using the novel matrix profile method. The vortex shedding map was validated by reproducing a benchmark map produced at a low Reynolds number. The method was extended to a higher Reynolds number case of vortex shedding and demonstrated the insight gained into the underlying dynamical regimes of the physical system. The proposed multi-step clustering methods denoted Hybrid Method B, combining Density-Based Clustering Based on Connected Regions with High Density (DBSCAN) and Agglomerative algorithms, and Hybrid Method C, combining -Means and Agglomerative algorithms demonstrated the ability to extract meaningful clusters from more complex vortex structures that become increasingly indistinguishable. The data-driven methods yield exceptional performance and versatility, which significantly improves the map generation method while reducing the data input and supervision required

    State of the Art in Face Recognition

    Get PDF
    Notwithstanding the tremendous effort to solve the face recognition problem, it is not possible yet to design a face recognition system with a potential close to human performance. New computer vision and pattern recognition approaches need to be investigated. Even new knowledge and perspectives from different fields like, psychology and neuroscience must be incorporated into the current field of face recognition to design a robust face recognition system. Indeed, many more efforts are required to end up with a human like face recognition system. This book tries to make an effort to reduce the gap between the previous face recognition research state and the future state

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    Automated Morphology Analysis of Nanoparticles

    Get PDF
    The functional properties of nanoparticles highly depend on the surface morphology of the particles, so precise measurements of a particle's morphology enable reliable characterizing of the nanoparticle's properties. Obtaining the measurements requires image analysis of electron microscopic pictures of nanoparticles. Today's labor-intensive image analysis of electron micrographs of nanoparticles is a significant bottleneck for efficient material characterization. The objective of this dissertation is to develop automated morphology analysis methods. Morphology analysis is comprised of three tasks: separate individual particles from an agglomerate of overlapping nano-objects (image segmentation); infer the particle's missing contours (shape inference); and ultimately, classify the particles by shape based on their complete contours (shape classification). Two approaches are proposed in this dissertation: the divide-and-conquer approach and the convex shape analysis approach. The divide-and-conquer approach solves each task separately, taking less than one minute to complete the required analysis, even for the largest-sized micrograph. However, its separating capability of particle overlaps is limited, meaning that it is able to split only touching particles. The convex shape analysis approach solves shape inference and classification simultaneously for better accuracy, but it requires more computation time, ten minutes for the biggest-sized electron micrograph. However, with a little sacrifice of time efficiency, the second approach achieves far superior separation than the divide-and-conquer approach, and it handles the chain-linked structure of particle overlaps well. The capabilities of the two proposed methods cannot be substituted by generic image processing and bio-imaging methods. This is due to the unique features that the electron microscopic pictures of nanoparticles have, including special particle overlap structures, and large number of particles to be processed. The application of the proposed methods to real electron microscopic pictures showed that the two proposed methods were more capable of extracting the morphology information than the state-of-the-art methods. When nanoparticles do not have many overlaps, the divide-and-conquer approach performed adequately. When nanoparticles have many overlaps, forming chain-linked clusters, the convex shape analysis approach performed much better than the state-of-the-art alternatives in bio-imaging. The author believes that the capabilities of the proposed methods expedite the morphology characterization process of nanoparticles. The author further conjectures that the technical generality of the proposed methods could even be a competent alternative to the current methods analyzing general overlapping convex-shaped objects other than nanoparticles

    Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining

    Get PDF
    Knowledge Discovery in Databases (KDD) ist der Prozess, nicht-triviale Muster aus großen Datenbanken zu extrahieren, mit dem Ziel, dass diese bisher unbekannt, potentiell nützlich, statistisch fundiert und verständlich sind. Der Prozess umfasst mehrere Schritte wie die Selektion, Vorverarbeitung, Evaluierung und den Analyseschritt, der als Data-Mining bekannt ist. Eine der zentralen Aufgabenstellungen im Data-Mining ist die Ausreißererkennung, das Identifizieren von Beobachtungen, die ungewöhnlich sind und mit der Mehrzahl der Daten inkonsistent erscheinen. Solche seltene Beobachtungen können verschiedene Ursachen haben: Messfehler, ungewöhnlich starke (aber dennoch genuine) Abweichungen, beschädigte oder auch manipulierte Daten. In den letzten Jahren wurden zahlreiche Verfahren zur Erkennung von Ausreißern vorgeschlagen, die sich oft nur geringfügig zu unterscheiden scheinen, aber in den Publikationen experimental als ``klar besser'' dargestellt sind. Ein Schwerpunkt dieser Arbeit ist es, die unterschiedlichen Verfahren zusammenzuführen und in einem gemeinsamen Formalismus zu modularisieren. Damit wird einerseits die Analyse der Unterschiede vereinfacht, andererseits aber die Flexibilität der Verfahren erhöht, indem man Module hinzufügen oder ersetzen und damit die Methode an geänderte Anforderungen und Datentypen anpassen kann. Um die Vorteile der modularisierten Struktur zu zeigen, werden (i) zahlreiche bestehende Algorithmen in dem Schema formalisiert, (ii) neue Module hinzugefügt, um die Robustheit, Effizienz, statistische Aussagekraft und Nutzbarkeit der Bewertungsfunktionen zu verbessern, mit denen die existierenden Methoden kombiniert werden können, (iii) Module modifiziert, um bestehende und neue Algorithmen auf andere, oft komplexere, Datentypen anzuwenden wie geographisch annotierte Daten, Zeitreihen und hochdimensionale Räume, (iv) mehrere Methoden in ein Verfahren kombiniert, um bessere Ergebnisse zu erzielen, (v) die Skalierbarkeit auf große Datenmengen durch approximative oder exakte Indizierung verbessert. Ausgangspunkt der Arbeit ist der Algorithmus Local Outlier Factor (LOF). Er wird zunächst mit kleinen Erweiterungen modifiziert, um die Robustheit und die Nutzbarkeit der Bewertung zu verbessern. Diese Methoden werden anschließend in einem gemeinsamen Rahmen zur Erkennung lokaler Ausreißer formalisiert, um die entsprechenden Vorteile auch in anderen Algorithmen nutzen zu können. Durch Abstraktion von einem einzelnen Vektorraum zu allgemeinen Datentypen können auch räumliche und zeitliche Beziehungen analysiert werden. Die Verwendung von Unterraum- und Korrelations-basierten Nachbarschaften ermöglicht dann, einen neue Arten von Ausreißern in beliebig orientierten Projektionen zu erkennen. Verbesserungen bei den Bewertungsfunktionen erlauben es, die Bewertung mit der statistischen Intuition einer Wahrscheinlichkeit zu interpretieren und nicht nur eine Ausreißer-Rangfolge zu erstellen wie zuvor. Verbesserte Modelle generieren auch Erklärungen, warum ein Objekt als Ausreißer bewertet wurde. Anschließend werden für verschiedene Module Verbesserungen eingeführt, die unter anderem ermöglichen, die Algorithmen auf wesentlich größere Datensätze anzuwenden -- in annähernd linearer statt in quadratischer Zeit --, indem man approximative Nachbarschaften bei geringem Verlust an Präzision und Effektivität erlaubt. Des weiteren wird gezeigt, wie mehrere solcher Algorithmen mit unterschiedlichen Intuitionen gleichzeitig benutzt und die Ergebnisse in einer Methode kombiniert werden können, die dadurch unterschiedliche Arten von Ausreißern erkennen kann. Schließlich werden für reale Datensätze neue Ausreißeralgorithmen konstruiert, die auf das spezifische Problem angepasst sind. Diese neuen Methoden erlauben es, so aufschlussreiche Ergebnisse zu erhalten, die mit den bestehenden Methoden nicht erreicht werden konnten. Da sie aus den Bausteinen der modularen Struktur entwickelt wurden, ist ein direkter Bezug zu den früheren Ansätzen gegeben. Durch Verwendung der Indexstrukturen können die Algorithmen selbst auf großen Datensätzen effizient ausgeführt werden.Knowledge Discovery in Databases (KDD) is the process of extracting non-trivial patterns in large data bases, with the focus of extracting novel, potentially useful, statistically valid and understandable patterns. The process involves multiple phases including selection, preprocessing, evaluation and the analysis step which is known as Data Mining. One of the key techniques of Data Mining is outlier detection, that is the identification of observations that are unusual and seemingly inconsistent with the majority of the data set. Such rare observations can have various reasons: they can be measurement errors, unusually extreme (but valid) measurements, data corruption or even manipulated data. Over the previous years, various outlier detection algorithms have been proposed that often appear to be only slightly different than previous but ``clearly outperform'' the others in the experiments. A key focus of this thesis is to unify and modularize the various approaches into a common formalism to make the analysis of the actual differences easier, but at the same time increase the flexibility of the approaches by allowing the addition and replacement of modules to adapt the methods to different requirements and data types. To show the benefits of the modularized structure, (i) several existing algorithms are formalized within the new framework (ii) new modules are added that improve the robustness, efficiency, statistical validity and score usability and that can be combined with existing methods (iii) modules are modified to allow existing and new algorithms to run on other, often more complex data types including spatial, temporal and high-dimensional data spaces (iv) the combination of multiple algorithm instances into an ensemble method is discussed (v) the scalability to large data sets is improved using approximate as well as exact indexing. The starting point is the Local Outlier Factor (LOF) algorithm, which is extended with slight modifications to increase robustness and the usability of the produced scores. In order to get the same benefits for other methods, these methods are abstracted to a general framework for local outlier detection. By abstracting from a single vector space, other data types that involve spatial and temporal relationships can be analyzed. The use of subspace and correlation neighborhoods allows the algorithms to detect new kinds of outliers in arbitrarily oriented subspaces. Improvements in the score normalization bring back a statistic intuition of probabilities to the outlier scores that previously were only useful for ranking objects, while improved models also offer explanations of why an object was considered to be an outlier. Subsequently, for different modules found in the framework improved modules are presented that for example allow to run the same algorithms on significantly larger data sets -- in approximately linear complexity instead of quadratic complexity -- by accepting approximated neighborhoods at little loss in precision and effectiveness. Additionally, multiple algorithms with different intuitions can be run at the same time, and the results combined into an ensemble method that is able to detect outliers of different types. Finally, new outlier detection methods are constructed; customized for the specific problems of these real data sets. The new methods allow to obtain insightful results that could not be obtained with the existing methods. Since being constructed from the same building blocks, there however exists a strong and explicit connection to the previous approaches, and by using the indexing strategies introduced earlier, the algorithms can be executed efficiently even on large data sets

    Semi-Supervised Learning Vector Quantization method enhanced with regularization for anomaly detection in air conditioning time-series data

    Get PDF
    Researchers of semi-supervised learning methods have been developing the family of Learning Vector Quantization models which originated from the well-known Self-Organizing Map algorithm. The models of this type can be characterized as prototype-based, self-explanatory and flexible. The thesis contributes to the development of one of the LVQ models – Semi-Supervised Relational Prototype Classifier for dissimilarity data. The model implementation is developed based on the related research work and thesis author findings, and applied to the task of anomaly detection from a real-time air condition data. We propose a regularization algorithm for gradient descent in order to achieve better convergence and a new strategy for initializing prototypes. We develop an innovative framework involving a human expert as a source of labeled data. The framework detects anomalies of environment parameters in both real-time and long-run observations and updates the model according to findings. The data set used for experiments is collected in real-time from sensors installed inside the Aalto Mechanical Engineering building located at Otakaari, 4, Espoo. Installation was done as a part of the project of VTT and Korean National Research Institute. The data consists of 3 main parameters – air temperature, humidity and CO2 concentration. Total number of deployed sensors is around 150. One month recorded data observations contains approximately 1.5M of data points. The results of the project demonstrate the efficiency of the developed regularized LVQ method for classification in given settings. Its regularized version generally overperforms its parent and various baseline methods on air conditioning, synthetic and UCI data. Together with the proposed classification framework, the system has shown its robustness and efficiency and is ready for deployment to a production environment

    Hybrid models for combination of visual and textual features in context-based image retrieval.

    Get PDF
    Visual Information Retrieval poses a challenge to intelligent information search systems. This is due to the semantic gap, the difference between human perception (information needs) and the machine representation of multimedia objects. Most existing image retrieval systems are monomodal, as they utilize only visual or only textual information about images. The semantic gap can be reduced by improving existing visual representations, making them suitable for a large-scale generic image retrieval. The best up-to-date candidates for a large-scale Content-based Image Retrieval are models based on the Bag of Visual Words framework. Existing approaches, however, produce high dimensional and thus expensive representations for data storage and computation. Because the standard Bag of Visual Words framework disregards the relationships between the histogram bins, the model can be further enhanced by exploiting the correlations between the visual words. Even the improved visual features will find it hard to capture an abstract semantic meaning of some queries, e.g. straight road in the USA. Textual features, on the other hand, would struggle with such queries as church with more than two towers as in many cases the information about the number of towers would be missing. Thus, both visual and textual features represent complementary yet correlated aspects of the same information object, an image. Existing hybrid approaches for the combination of visual and textual features do not take these inherent relationships into account and thus the combinations performance improvement is limited. Visual and textual features can be also combined in the context of relevance feedback. The relevance feedback can help us narrow down and correct the search. The feedback mechanism would produce subsets of visual query and feedback representations as well as subsets of textual query and textual feedback representations. A meaningful feature combination in the context of relevance feedback should take the inherent inter (visual-textual) and intra (visual-visual, textualtextual) relationships into account. In this work, we propose a principled framework for the semantic gap reduction in large scale generic image retrieval. The proposed framework comprises development and enhancement of novel visual features, a hybrid model for the visual and textual features combination, and a hybrid model for the combination of features in the context of relevance feedback, with both fixed and adaptive weighting schemes (importance of a query and its context). Apart from the experimental evaluation of our models, theoretical validations of some interesting discoveries on feature fusion strategies were also performed. The proposed models were incorporated into our prototype system with an interactive user interface
    corecore