317 research outputs found

    Fraction-score: a generalized support measure for weighted and maximal co-location pattern mining

    Get PDF
    Co-location patterns, which capture the phenomenon that objects with certain labels are often located in close geographic proximity, are defined based on a support measure which quantifies the prevalence of a pattern candidate in the form of a label set. Existing support measures share the idea of counting the number of instances of a given label set C as its support, where an instance of C is an object set whose objects collectively carry all labels in C and are located close to one another. However, they suffer from various weaknesses, e.g., fail to capture all possible instances, or overlook the cases when multiple instances overlap. In this paper, we propose a new measure called Fraction-Score which counts instances fractionally if they overlap. Fraction-Score captures all possible instances, and handles the cases where instances overlap appropriately (so that the supports defined are more meaningful and anti-monotonic). We develop efficient algorithms to solve the co-location pattern mining problem defined with Fraction-Score. Furthermore, to obtain representative patterns, we develop an efficient algorithm for mining the maximal co-location patterns, which are those patterns without proper superset patterns. We conduct extensive experiments using real and synthetic datasets, which verified the superiority of our proposals

    Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)

    Get PDF
    This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21–22 September 2023

    Public Sentiments towards the COVID-19 Pandemic: Insights from the Academic Literature Review and Twitter Analytics

    Get PDF
    The recent COVID-19 pandemic has severely impacted nations across the globe. Not only has it created economic shocks, but also long-term impacts on the social and psychological behaviors of the public. This can be attributed to the severity of the pandemic and because of the preventive and control measures such as global lockdowns, social distancing, and selfisolation that the governments imposed. Previous studies have reported significant changes in human emotions and behaviors are used to measure public sentiments about certain phenomena (such as the recent pandemic). The present study aims to study the public's sentiments during the COVID-19 outbreak based on an analytics review of public tweets highlighting changes in emotions. A dataset of 58,320 tweets extracted from Twitter and 61 academic articles was explored to analyze behavioral and emotional changes during previous and current pandemic situations. We chose the RPA – COV (Research Process Approach – COVID-19) approach, which was combined with the LBTA (Literature-Based Thematic Analysis) and the COVTA (COVID-19 Twitter Analytics). The sentiments' analysis results were coupled with word-tree analysis and highlighted that the public showed more highly neutral, positive, and mixed emotions than negative ones. The analysis pointed that people may react differently on Twitter as compared to real-life circumstances. The present study makes a significant contribution towards understanding how the public express their sentiments in pandemic situations

    K-Means and Alternative Clustering Methods in Modern Power Systems

    Get PDF
    As power systems evolve by integrating renewable energy sources, distributed generation, and electric vehicles, the complexity of managing these systems increases. With the increase in data accessibility and advancements in computational capabilities, clustering algorithms, including K-means, are becoming essential tools for researchers in analyzing, optimizing, and modernizing power systems. This paper presents a comprehensive review of over 440 articles published through 2022, emphasizing the application of K-means clustering, a widely recognized and frequently used algorithm, along with its alternative clustering methods within modern power systems. The main contributions of this study include a bibliometric analysis to understand the historical development and wide-ranging applications of K-means clustering in power systems. This research also thoroughly examines K-means, its various variants, potential limitations, and advantages. Furthermore, the study explores alternative clustering algorithms that can complete or substitute K-means. Some prominent examples include K-medoids, Time-series K-means, BIRCH, Bayesian clustering, HDBSCAN, CLIQUE, SPECTRAL, SOMs, TICC, and swarm-based methods, broadening the understanding and applications of clustering methodologies in modern power systems. The paper highlights the wide-ranging applications of these techniques, from load forecasting and fault detection to power quality analysis and system security assessment. Throughout the examination, it has been observed that the number of publications employing clustering algorithms within modern power systems is following an exponential upward trend. This emphasizes the necessity for professionals to understand various clustering methods, including their benefits and potential challenges, to incorporate the most suitable ones into their studies

    Detecting Novel Subtypes of Cancer Using Bayesian Unsupervised Clustering

    Get PDF
    Although there have been many advances in screening programs and treatments in recent years that have reduced the mortality rate of cancer, it remains the second leading cause of death worldwide, accounting for almost 10 million deaths worldwide in 2020. Identifying and characterising subtypes based on molecular classifications can help identify the aggressiveness of the disease so that the best treatment pathway can be identified, and new treatment options developed. This has been exemplified in breast cancer. Latent Process Decomposition (LPD) is a soft clustering technique that has been successfully applied to expression data to discover subtypes, including a poor prognosis subtype called DESNT. The benefit of LPD is that it better models the heterogenous structure of tumours. The aim of this thesis is to apply LPD on transcriptome data from The Cancer Genome Atlas to detect and characterise subtypes of numerous cancer types and create a resource of the results. This was achieved through the development of Automata, an R package used to automate this methodology. In total I have identified 168 cancer subtypes spanning across 28 cancer types. Moreover, I have characterised the features of each subtype, generating a unique encyclopaedic compendium of molecular subtypes of cancer that provides an in-depth source of information for the research community. I have successfully validated my findings by comparing them with known subtypes from breast carcinoma, prostate adenocarcinoma, colorectal adenocarcinoma and lung cancer. Additionally, I have discovered common features that characterise subtypes across cancer types. Finally, I have identified 26 subtypes which have a significant association with outcome including some that were not picked up by traditional clustering methods. The results presented in this thesis are the foundation for the long-term impact of a more personalised approach to cancer patient care

    Machine learning methods for genomic high-content screen data analysis applied to deduce organization of endocytic network

    Get PDF
    High-content screens are widely used to get insight on mechanistic organization of biological systems. Chemical and/or genomic interferences are used to modulate molecular machinery, then light microscopy and quantitative image analysis yield a large number of parameters describing phenotype. However, extracting functional information from such high-content datasets (e.g. links between cellular processes or functions of unknown genes) remains challenging. This work is devoted to the analysis of a multi-parametric image-based genomic screen of endocytosis, the process whereby cells uptake cargoes (signals and nutrients) and distribute them into different subcellular compartments. The complexity of the quantitative endocytic data was approached using different Machine Learning techniques, namely, Clustering methods, Bayesian networks, Principal and Independent component analysis, Artificial neural networks. The main goal of such an analysis is to predict possible modes of action of screened genes and also to find candidate genes that can be involved in a process of interest. The degree of freedom for the multidimensional phenotypic space was identified using the data distributions, and then the high-content data were deconvolved into separate signals from different cellular modules. Some of those basic signals (phenotypic traits) were straightforward to interpret in terms of known molecular processes; the other components gave insight into interesting directions for further research. The phenotypic profile of perturbation of individual genes are sparse in coordinates of the basic signals, and, therefore, intrinsically suggest their functional roles in cellular processes. Being a very fundamental process, endocytosis is specifically modulated by a variety of different pathways in the cell; therefore, endocytic phenotyping can be used for analysis of non-endocytic modules in the cell. Proposed approach can be also generalized for analysis of other high-content screens.:Contents Objectives Chapter 1 Introduction 1.1 High-content biological data 1.1.1 Different perturbation types for HCS 1.1.2 Types of observations in HTS 1.1.3 Goals and outcomes of MP HTS 1.1.4 An overview of the classical methods of analysis of biological HT- and HCS data 1.2 Machine learning for systems biology 1.2.1 Feature selection 1.2.2 Unsupervised learning 1.2.3 Supervised learning 1.2.4 Artificial neural networks 1.3 Endocytosis as a system process 1.3.1 Endocytic compartments and main players 1.3.2 Relation to other cellular processes Chapter 2 Experimental and analytical techniques 2.1 Experimental methods 2.1.1 RNA interference 2.1.2 Quantitative multiparametric image analysis 2.2 Detailed description of the endocytic HCS dataset 2.2.1 Basic properties of the endocytic dataset 2.2.2 Control subset of genes 2.3 Machine learning methods 2.3.1 Latent variables models 2.3.2 Clustering 2.3.3 Bayesian networks 2.3.4 Neural networks Chapter 3 Results 3.1 Selection of labeled data for training and validation based on KEGG information about genes pathways 3.2 Clustering of genes 3.2.1 Comparison of clustering techniques on control dataset 3.2.2 Clustering results 3.3 Independent components as basic phenotypes 3.3.1 Algorithm for identification of the best number of independent components 3.3.2 Application of ICA on the full dataset and on separate assays of the screen 3.3.3 Gene annotation based on revealed phenotypes 3.3.4 Searching for genes with target function 3.4 Bayesian network on endocytic parameters 3.4.1 Prediction of pathway based on parameters values using NaĂŻve Bayesian Classifier 3.4.2 General Bayesian Networks 3.5 Neural networks 3.5.1 Autoencoders as nonlinear ICA 3.5.2 siRNA sequence motives discovery with deep NN 3.6 Biological results 3.6.1 Rab11 ZNF-specific phenotype found by ICA 3.6.2 Structure of BN revealed dependency between endocytosis and cell adhesion Chapter 4 Discussion 4.1 Machine learning approaches for discovery of phenotypic patterns 4.1.1 Functional annotation of unknown genes based on phenotypic profiles 4.1.2 Candidate genes search 4.2 Adaptation to other HCS data and generalization Chapter 5 Outlook and future perspectives 5.1 Handling sequence-dependent off-target effects with neural networks 5.2 Transition between machine learning and systems biology models Acknowledgements References Appendix A.1 Full list of cellular and endocytic parameters A.2 Description of independent components of the full dataset A.3 Description of independent components extracted from separate assays of the HC

    MĂ ster universitari en estadĂ­stica i investigaciĂł operativa

    Get PDF

    Fortschritte im unĂĽberwachten Lernen und Anwendungsbereiche: Subspace Clustering mit Hintergrundwissen, semantisches Passworterraten und erlernte Indexstrukturen

    Get PDF
    Over the past few years, advances in data science, machine learning and, in particular, unsupervised learning have enabled significant progress in many scientific fields and even in everyday life. Unsupervised learning methods are usually successful whenever they can be tailored to specific applications using appropriate requirements based on domain expertise. This dissertation shows how purely theoretical research can lead to circumstances that favor overly optimistic results, and the advantages of application-oriented research based on specific background knowledge. These observations apply to traditional unsupervised learning problems such as clustering, anomaly detection and dimensionality reduction. Therefore, this thesis presents extensions of these classical problems, such as subspace clustering and principal component analysis, as well as several specific applications with relevant interfaces to machine learning. Examples include password guessing using semantic word embeddings and learning spatial index structures using statistical models. In essence, this thesis shows that application-oriented research has many advantages for current and future research.In den letzten Jahren haben Fortschritte in der Data Science, im maschinellen Lernen und insbesondere im unüberwachten Lernen zu erheblichen Fortentwicklungen in vielen Bereichen der Wissenschaft und des täglichen Lebens geführt. Methoden des unüberwachten Lernens sind in der Regel dann erfolgreich, wenn sie durch geeignete, auf Expertenwissen basierende Anforderungen an spezifische Anwendungen angepasst werden können. Diese Dissertation zeigt, wie rein theoretische Forschung zu Umständen führen kann, die allzu optimistische Ergebnisse begünstigen, und welche Vorteile anwendungsorientierte Forschung hat, die auf spezifischem Hintergrundwissen basiert. Diese Beobachtungen gelten für traditionelle unüberwachte Lernprobleme wie Clustering, Anomalieerkennung und Dimensionalitätsreduktion. Daher werden in diesem Beitrag Erweiterungen dieser klassischen Probleme, wie Subspace Clustering und Hauptkomponentenanalyse, sowie einige spezifische Anwendungen mit relevanten Schnittstellen zum maschinellen Lernen vorgestellt. Beispiele sind das Erraten von Passwörtern mit Hilfe semantischer Worteinbettungen und das Lernen von räumlichen Indexstrukturen mit Hilfe statistischer Modelle. Im Wesentlichen zeigt diese Arbeit, dass anwendungsorientierte Forschung viele Vorteile für die aktuelle und zukünftige Forschung hat

    Trading Indistinguishability-based Privacy and Utility of Complex Data

    Get PDF
    The collection and processing of complex data, like structured data or infinite streams, facilitates novel applications. At the same time, it raises privacy requirements by the data owners. Consequently, data administrators use privacy-enhancing technologies (PETs) to sanitize the data, that are frequently based on indistinguishability-based privacy definitions. Upon engineering PETs, a well-known challenge is the privacy-utility trade-off. Although literature is aware of a couple of trade-offs, there are still combinations of involved entities, privacy definition, type of data and application, in which we miss valuable trade-offs. In this thesis, for two important groups of applications processing complex data, we study (a) which indistinguishability-based privacy and utility requirements are relevant, (b) whether existing PETs solve the trade-off sufficiently, and (c) propose novel PETs extending the state-of-the-art substantially in terms of methodology, as well as achieved privacy or utility. Overall, we provide four contributions divided into two parts. In the first part, we study applications that analyze structured data with distance-based mining algorithms. We reveal that an essential utility requirement is the preservation of the pair-wise distances of the data items. Consequently, we propose distance-preserving encryption (DPE), together with a general procedure to engineer respective PETs by leveraging existing encryption schemes. As proof of concept, we apply it to SQL log mining, useful for database performance tuning. In the second part, we study applications that monitor query results over infinite streams. To this end, -event differential privacy is state-of-the-art. Here, PETs use mechanisms that typically add noise to query results. First, we study state-of-the-art mechanisms with respect to the utility they provide. Conducting the so far largest benchmark that fulfills requirements derived from limitations of prior experimental studies, we contribute new insights into the strengths and weaknesses of existing mechanisms. One of the most unexpected, yet explainable result, is a baseline supremacy. It states that one of the two baseline mechanisms delivers high or even the best utility. A natural follow-up question is whether baseline mechanisms already provide reasonable utility. So, second, we perform a case study from the area of electricity grid monitoring revealing two results. First, achieving reasonable utility is only possible under weak privacy requirements. Second, the utility measured with application-specific utility metrics decreases faster than the sanitization error, that is used as utility metric in most studies, suggests. As a third contribution, we propose a novel differential privacy-based privacy definition called Swellfish privacy. It allows tuning utility beyond incremental -event mechanism design by supporting time-dependent privacy requirements. Formally, as well as by experiments, we prove that it increases utility significantly. In total, our thesis contributes substantially to the research field, and reveals directions for future research

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
    • …
    corecore