4,014 research outputs found

    Powellsnakes II: a fast Bayesian approach to discrete object detection in multi-frequency astronomical data sets

    Get PDF
    Powellsnakes is a Bayesian algorithm for detecting compact objects embedded in a diffuse background, and was selected and successfully employed by the Planck consortium in the production of its first public deliverable: the Early Release Compact Source Catalogue (ERCSC). We present the critical foundations and main directions of further development of PwS, which extend it in terms of formal correctness and the optimal use of all the available information in a consistent unified framework, where no distinction is made between point sources (unresolved objects), SZ clusters, single or multi-channel detection. An emphasis is placed on the necessity of a multi-frequency, multi-model detection algorithm in order to achieve optimality

    Indexing and knowledge discovery of gaussian mixture models and multiple-instance learning

    Get PDF
    Due to the increasing quantity and variety of generated and stored data, the manual and automatic analysis becomes a more and more challenging task in many modern applications, like biometric identification and content-based image retrieval. In this thesis, we consider two very typical, related inherent structures of objects: Multiple-Instance (MI) objects and Gaussian Mixture Models (GMM). In both approaches, each object is represented by a set. For MI, each object is a set of vectors from a multi-dimensional space. For GMM, each object is a set of multi-variate Gaussian distribution functions, providing the ability to approximate arbitrary distributions in a concise way. Both approaches are very powerful and natural as they allow to express (1) that an object is additively composed from several components or (2) that an object may have several different, alternative kinds of behavior. Thus we can model e.g. an image which may depict a set of different things (1). Likewise, we can model a sports player who has performed differently at different games (2). We can use GMM to approximate MI objects and vice versa. Both ways of approximation can be appealing because GMM are more concise whereas for MI objects the single components are less complex. A similarity measure quantifies similarities between two objects to assess how much alike these objects are. On this basis, indexing and similarity search play essential roles in data mining, providing efficient and/or indispensable supports for a variety of algorithms such as classification and clustering. This thesis aims to solve challenges in the indexing and knowledge discovery of complex data using MI objects and GMM. For the indexing of GMM, there are several techniques available, including universal index structures and GMM-specific methods. However, the well-known approaches either suffer from poor performance or have too many limitations. To make use of the parameterized properties of GMM and tackle the problem of potential unequal length of components, we propose the Gaussian Components based Index (GCI) for efficient queries on GMM. GCI decomposes GMM into their components, and stores the n-lets of Gaussian combinations that have uniform length of parameter vectors in traditional index structures. We introduce an efficient pruning strategy to filter unqualified GMM using the so-called Matching Probability (MP) as the similarity measure. MP sums up the joint probabilities of two objects all over the space. GCI achieves better performance than its competitors on both synthetic and real-world data. To further increase its efficiency, we propose a strategy to store GMM components in a normalized way. This strategy improves the ability of filtering unqualified GMM. Based on the normalized transformation, we derive a set of novel similarity measures for GMM. Since MP is not a metric (i.e., a symmetric, positive definite distance function guaranteeing the triangle inequality), which would be essential for the application of various analysis techniques, we introduce Infinite Euclidean Distance (IED) for probability distribution functions, a metric with a closed-form expression for GMM. IED allows us to store GMM in well-known metric trees like the Vantage-Point tree or M-tree, which facilitate similarity search in sublinear time by exploiting the triangle inequality. Moreover, analysis techniques that require the properties of a metric (e.g. Multidimensional Scaling) can be applied on GMM with IED. For MI objects which are not well-approximated by GMM, we introduce the potential densities of instances for the representation of MI objects. Based on that, two joint Gaussian based measures are proposed for MI objects and we extend GCI on MI objects for efficient queries as well. To sum up, we propose in this thesis a number of novel similarity measures and novel indexing techniques for GMM and MI objects, enabling efficient queries and knowledge discovery on complex data. In a thorough theoretic analysis as well as extensive experiments we demonstrate the superiority of our approaches over the state-of-the-art with respect to the run-time efficiency and the quality of the result.Angesichts der steigenden Quantität und Vielfalt der generierten und gespeicherten Daten werden manuelle und automatisierte Analysen in vielen modernen Anwendungen eine zunehmend anspruchsvolle Aufgabe, wie z.B. biometrische Identifikation und inhaltbasierter Bildzugriff. In dieser Arbeit werden zwei sehr typische und relevante inhärente Strukturen von Objekten behandelt: Multiple-Instance-Objects (MI) und Gaussian Mixture Models (GMM). In beiden Anwendungsfällen wird das Objekt in Form einer Menge dargestellt. Bei MI besteht jedes Objekt aus einer Menge von Vektoren aus einem multidimensionalen Raum. Bei GMM wird jedes Objekt durch eine Menge von multivariaten normalverteilten Dichtefunktionen repräsentiert. Dies bietet die Möglichkeit, beliebige Wahrscheinlichkeitsverteilungen in kompakter Form zu approximieren. Beide Ansätze sind sehr leistungsfähig, denn sie basieren auf einfachsten Ideen: (1) entweder besteht ein Objekt additiv aus mehreren Komponenten oder (2) ein Objekt hat unterschiedliche alternative Verhaltensarten. Dies ermöglicht es uns z.B. ein Bild zu repräsentieren, welches unterschiedliche Objekte und Szenen zeigt (1). In gleicher Weise können wir einen Sportler modellieren, der bei verschiedenen Wettkämpfen unterschiedliche Leistungen gezeigt hat (2). Wir können MI-Objekte durch GMM approximieren und auch der umgekehrte Weg ist möglich. Beide Vorgehensweisen können sehr ansprechend sein, da GMM im Vergleich zu MI kompakter sind, wogegen in MI-Objekten die einzelnen Komponenten weniger Komplexität aufweisen. Ein ähnlichkeitsmaß dient der Quantifikation der Gemeinsamkeit zwischen zwei Objekten. Darauf basierend spielen Indizierung und ähnlichkeitssuche eine wesentliche Rolle für die effiziente Implementierung von einer Vielzahl von Klassifikations- und Clustering-Algorithmen im Bereich des Data Minings. Ziel dieser Arbeit ist es, die Herausforderungen bei Indizierung und Wissensextraktion von komplexen Daten unter Verwendung von MI Objekten und GMM zu bewältigen. Für die Indizierung der GMM stehen verschiedene universelle und GMM-spezifische Indexstrukuren zur Verfügung. Jedoch leiden solche bekannten Ansätze unter schwacher Leistung oder zu vielen Einschränkungen. Um die parametrisieren Eigenschaften der GMM auszunutzen und dem Problem der möglichen ungleichen Komponentenlänge entgegenzuwirken, präsentieren wir das Verfahren Gaussian Components based Index (GCI), welches effizienten Abfrage auf GMM ermöglicht. GCI zerlegt dabei ein GMM in Parameterkomponenten und speichert alle möglichen Kombinationen mit einheitlicher Vektorlänge in traditionellen Indexstrukturen. Wir stellen ein effizientes Pruningverfahren vor, um ungeeignete GMM unter Verwendung der sogenannten Matching Probability (MP) als ähnlichkeitsma\ss auszufiltern. MP errechnet die Summe der gemeinsamen Wahrscheinlichkeit zweier Objekte aus dem gesamten Raum. CGI erzielt bessere Leistung als konkurrierende Verfahren, sowohl in Bezug auf synthetische, als auch auf reale Datensätze. Um ihre Effizienz weiter zu verbessern, stellen wir eine Strategie zur Speicherung der GMM-Komponenten in normalisierter Form vor. Diese Strategie verbessert die Fähigkeit zum Ausfiltern ungeeigneter GMM. Darüber hinaus leiten wir, basierend auf dieser Transformation, neuartige ähnlichkeitsmaße für GMM her. Da MP keine Metrik (d.h. eine symmetrische, positiv definite Distanzfunktion, die die Dreiecksungleichung garantiert) ist, dies jedoch unentbehrlich für die Anwendung mehrerer Analysetechniken ist, führen wir Infinite Euclidean Distance (IED) ein, ein Metrik mit geschlossener Ausdrucksform für GMM. IED erlaubt die Speicherung der GMM in Metrik-Bäumen wie z.B. Vantage-Point Trees oder M-Trees, die die ähnlichkeitssuche in sublinear Zeit mit Hilfe der Dreiecksungleichung erleichtert. Außerdem können Analysetechniken, die die Eigenschaften einer Metrik erfordern (z.B. Multidimensional Scaling), auf GMM mit IED angewandt werden. Für MI-Objekte, die mit GMM nicht in außreichender Qualität approximiert werden können, stellen wir Potential Densities of Instances vor, um MI-Objekte zu repräsentieren. Darauf beruhend werden zwei auf multivariater Gaußverteilungen basierende Maße für MI-Objekte eingeführt. Außerdem erweitern wir GCI für MI-Objekte zur effizienten Abfragen. Zusammenfassend haben wir in dieser Arbeit mehrere neuartige ähnlichkeitsmaße und Indizierungstechniken für GMM- und MI-Objekte vorgestellt. Diese ermöglichen effiziente Abfragen und die Wissensentdeckung in komplexen Daten. Durch eine gründliche theoretische Analyse und durch umfangreiche Experimente demonstrieren wir die überlegenheit unseres Ansatzes gegenüber anderen modernen Ansätzen bezüglich ihrer Laufzeit und Qualität der Resultate

    Applying Machine Learning to Root Cause Analysis in Agile CI/CD Software Testing Environments

    Get PDF
    This thesis evaluates machine learning classification and clustering algorithms with the aim of automating the root cause analysis of failed tests in agile software testing environments. The inefficiency of manually categorizing the root causes in terms of time and human resources motivates this work. The development and testing environments of an agile team at Ericsson Finland are used as this work's framework. The author of the thesis extracts relevant features from the raw log data after interviewing the team's testing engineers (human experts). The author puts his initial efforts into clustering the unlabeled data, and despite obtaining qualitative correlations between several clusters and failure root causes, the vagueness in the rest of the clusters leads to the consideration of labeling. The author then carries out a new round of interviews with the testing engineers, which leads to the conceptualization of ground-truth categories for the test failures. With these, the human experts label the dataset accordingly. A collection of artificial neural networks that either classify the data or pre-process it for clustering is then optimized by the author. The best solution comes in the form of a classification multilayer perceptron that correctly assigns the failure category to new examples, on average, 88.9\% of the time. The primary outcome of this thesis comes in the form of a methodology for the extraction of expert knowledge and its adaptation to machine learning techniques for test failure root cause analysis using test log data. The proposed methodology constitutes a prototype or baseline approach towards achieving this objective in a corporate environment

    Harnessing Deep Learning Techniques for Text Clustering and Document Categorization

    Get PDF
    This research paper delves into the realm of deep text clustering algorithms with the aim of enhancing the accuracy of document classification. In recent years, the fusion of deep learning techniques and text clustering has shown promise in extracting meaningful patterns and representations from textual data. This paper provides an in-depth exploration of various deep text clustering methodologies, assessing their efficacy in improving document classification accuracy. Delving into the core of deep text clustering, the paper investigates various feature representation techniques, ranging from conventional word embeddings to contextual embeddings furnished by BERT and GPT models.By critically reviewing and comparing these algorithms, we shed light on their strengths, limitations, and potential applications. Through this comprehensive study, we offer insights into the evolving landscape of document analysis and classification, driven by the power of deep text clustering algorithms.Through an original synthesis of existing literature, this research serves as a beacon for researchers and practitioners in harnessing the prowess of deep learning to enhance the accuracy of document classification endeavors

    Biclustering electronic health records to unravel disease presentation patterns

    Get PDF
    Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2019A Esclerose Lateral Amiotrófica (ELA) é uma doença neurodegenerativa heterogénea com padrões de apresentação altamente variáveis. Dada a natureza heterogénea dos doentes com ELA, aquando do diagnóstico os clínicos normalmente estimam a progressão da doença utilizando uma taxa de decaimento funcional, calculada com base na Escala Revista de Avaliação Funcional de ELA (ALSFRS-R). A utilização de modelos de Aprendizagem Automática que consigam lidar com este padrões complexos é necessária para compreender a doença, melhorar os cuidados aos doentes e a sua sobrevivência. Estes modelos devem ser explicáveis para que os clínicos possam tomar decisões informadas. Desta forma, o nosso objectivo é descobrir padrões de apresentação da doença, para isso propondo uma nova abordagem de Prospecção de Dados: Descoberta de Meta-atributos Discriminativos (DMD), que utiliza uma combinação de Biclustering, Classificação baseada em Biclustering e Prospecção de Regras de Associação para Classificação. Estes padrões (chamados de Meta-atributos) são compostos por subconjuntos de atributos discriminativos conjuntamente com os seus valores, permitindo assim distinguir e caracterizar subgrupos de doentes com padrões similares de apresentação da doença. Os Registos de Saúde Electrónicos (RSE) utilizados neste trabalho provêm do conjunto de dados JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis), composto por questões standardizadas acerca de factores de risco, mutações genéticas, atributos clínicos ou informação de sobrevivência de uma coorte de doentes e controlos seguidos pelo consórcio ENCALS (European Network to Cure ALS), que inclui vários países europeus, incluindo Portugal. Nesta tese a metodologia proposta foi utilizada na parte portuguesa do conjunto de dados ONWebDUALS para encontrar padrões de apresentação da doença que: 1) distinguissem os doentes de ELA dos seus controlos e 2) caracterizassem grupos de doentes de ELA com diferentes taxas de progressão (categorizados em grupos Lentos, Neutros e Rápidos). Nenhum padrão coerente emergiu das experiências efectuadas para a primeira tarefa. Contudo, para a segunda tarefa os padrões encontrados para cada um dos três grupos de progressão foram reconhecidos e validados por clínicos especialistas em ELA, como sendo características relevantes de doentes com progressão Lenta, Neutra e Rápida. Estes resultados sugerem que a nossa abordagem genérica baseada em Biclustering tem potencial para identificar padrões de apresentação noutros problemas ou doenças semelhantes.Amyotrophic Lateral Sclerosis (ALS) is a heterogeneous neurodegenerative disease with a high variability of presentation patterns. Given the heterogeneous nature of ALS patients and targeting a better prognosis, clinicians usually estimate disease progression at diagnosis using the rate of decay computed from the Revised ALS Functional Rating Scale (ALSFRS-R). In this context, the use of Machine Learning models able to unravel the complexity of disease presentation patterns is paramount for disease understanding, targeting improved patient care and longer survival times. Furthermore, explainable models are vital, since clinicians must be able to understand the reasoning behind a given model’s result before making a decision that can impact a patient’s life. Therefore we aim at unravelling disease presentation patterns by proposing a new Data Mining approach called Discriminative Meta-features Discovery (DMD), which uses a combination of Biclustering, Biclustering-based Classification and Class Association Rule Mining. These patterns (called Metafeatures) are composed of discriminative subsets of features together with their values, allowing to distinguish and characterize subgroups of patients with similar disease presentation patterns. The Electronic Health Record (EHR) data used in this work comes from the JPND ONWebDUALS (ONTology-based Web Database for Understanding Amyotrophic Lateral Sclerosis) dataset, comprised of standardized questionnaire answers regarding risk factors, genetic mutations, clinical features and survival information from a cohort of patients and controls from ENCALS (European Network to Cure ALS), a consortium of diverse European countries, including Portugal. In this work the proposed methodology was used on the ONWebDUALS Portuguese EHR data to find disease presentation patterns that: 1) distinguish the ALS patients from their controls and 2) characterize groups of ALS patients with different progression rates (categorized into Slow, Neutral and Fast groups). No clear pattern emerged from the experiments performed for the first task. However, in the second task the patterns found for each of the three progression groups were recognized and validated by ALS expert clinicians, as being relevant characteristics of slow, neutral and fast progressing patients. These results suggest that our generic Biclustering approach is a promising way to unravel disease presentation patterns and could be applied to similar problems and other diseases

    HiER 2015. Proceedings des 9. Hildesheimer Evaluierungs- und Retrievalworkshop

    Get PDF
    Die Digitalisierung formt unsere Informationsumwelten. Disruptive Technologien dringen verstärkt und immer schneller in unseren Alltag ein und verändern unser Informations- und Kommunikationsverhalten. Informationsmärkte wandeln sich. Der 9. Hildesheimer Evaluierungs- und Retrievalworkshop HIER 2015 thematisiert die Gestaltung und Evaluierung von Informationssystemen vor dem Hintergrund der sich beschleunigenden Digitalisierung. Im Fokus stehen die folgenden Themen: Digital Humanities, Internetsuche und Online Marketing, Information Seeking und nutzerzentrierte Entwicklung, E-Learning
    corecore