10 research outputs found

    Fortschritte im unüberwachten Lernen und Anwendungsbereiche: Subspace Clustering mit Hintergrundwissen, semantisches Passworterraten und erlernte Indexstrukturen

    Get PDF
    Over the past few years, advances in data science, machine learning and, in particular, unsupervised learning have enabled significant progress in many scientific fields and even in everyday life. Unsupervised learning methods are usually successful whenever they can be tailored to specific applications using appropriate requirements based on domain expertise. This dissertation shows how purely theoretical research can lead to circumstances that favor overly optimistic results, and the advantages of application-oriented research based on specific background knowledge. These observations apply to traditional unsupervised learning problems such as clustering, anomaly detection and dimensionality reduction. Therefore, this thesis presents extensions of these classical problems, such as subspace clustering and principal component analysis, as well as several specific applications with relevant interfaces to machine learning. Examples include password guessing using semantic word embeddings and learning spatial index structures using statistical models. In essence, this thesis shows that application-oriented research has many advantages for current and future research.In den letzten Jahren haben Fortschritte in der Data Science, im maschinellen Lernen und insbesondere im unüberwachten Lernen zu erheblichen Fortentwicklungen in vielen Bereichen der Wissenschaft und des täglichen Lebens geführt. Methoden des unüberwachten Lernens sind in der Regel dann erfolgreich, wenn sie durch geeignete, auf Expertenwissen basierende Anforderungen an spezifische Anwendungen angepasst werden können. Diese Dissertation zeigt, wie rein theoretische Forschung zu Umständen führen kann, die allzu optimistische Ergebnisse begünstigen, und welche Vorteile anwendungsorientierte Forschung hat, die auf spezifischem Hintergrundwissen basiert. Diese Beobachtungen gelten für traditionelle unüberwachte Lernprobleme wie Clustering, Anomalieerkennung und Dimensionalitätsreduktion. Daher werden in diesem Beitrag Erweiterungen dieser klassischen Probleme, wie Subspace Clustering und Hauptkomponentenanalyse, sowie einige spezifische Anwendungen mit relevanten Schnittstellen zum maschinellen Lernen vorgestellt. Beispiele sind das Erraten von Passwörtern mit Hilfe semantischer Worteinbettungen und das Lernen von räumlichen Indexstrukturen mit Hilfe statistischer Modelle. Im Wesentlichen zeigt diese Arbeit, dass anwendungsorientierte Forschung viele Vorteile für die aktuelle und zukünftige Forschung hat

    Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

    Full text link
    We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on kk-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical'' k+1/ε2k + 1/\varepsilon^2 elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative (1±ε)(1\pm\varepsilon) factor and an additive ελΦk\varepsilon \lambda \Phi_k, where Φk\Phi_k represents the kk-means cost for the input embeddings and λ\lambda is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable

    Simple, Scalable and Effective Clustering via One-Dimensional Projections

    Full text link
    Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and kk-means++ can take Ω(ndk)\Omega(ndk) time when clustering nn points in a dd-dimensional space (represented by an n×dn\times d matrix XX) into kk clusters. In applications with moderate to large kk, the multiplicative kk factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time O(nnz(X)+nlogn)O(\mathrm{nnz}(X) + n\log n) for arbitrary kk. Here nnz(X)\mathrm{nnz}(X) is the total number of non-zero entries in the input dataset XX, which is upper bounded by ndnd and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio O~(k4)\smash{\widetilde{O}(k^4)} on any input dataset for the kk-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a kk-means algorithm is approximately preserved under a class of projections and that kk-means++ seeding can be implemented in expected O(nlogn)O(n \log n) time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.Comment: 41 pages, 6 figures, to appear in NeurIPS 202

    Cancer drug sensitivity prediction from routine histology images

    Get PDF
    Drug sensitivity prediction models can aid in personalising cancer therapy, biomarker discovery, and drug design. Such models require survival data from randomised controlled trials which can be time consuming and expensive. In this proof-of-concept study, we demonstrate for the first time that deep learning can link histological patterns in whole slide images (WSIs) of Haematoxylin & Eosin (H&E) stained breast cancer sections with drug sensitivities inferred from cell lines. We employ patient-wise drug sensitivities imputed from gene expression-based mapping of drug effects on cancer cell lines to train a deep learning model that predicts patients’ sensitivity to multiple drugs from WSIs. We show that it is possible to use routine WSIs to predict the drug sensitivity profile of a cancer patient for a number of approved and experimental drugs. We also show that the proposed approach can identify cellular and histological patterns associated with drug sensitivity profiles of cancer patients

    Graph set data mining

    Get PDF
    Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF

    XXIII Edición del Workshop de Investigadores en Ciencias de la Computación : Libro de actas

    Get PDF
    Compilación de las ponencias presentadas en el XXIII Workshop de Investigadores en Ciencias de la Computación (WICC), llevado a cabo en Chilecito (La Rioja) en abril de 2021.Red de Universidades con Carreras en Informátic

    Longitudinal multi-dimensional investigation of metabolic and endocrine genetics

    Get PDF
    Genome-wide association studies (GWASs) in recent decades have revealed the genetic landscape and shared aetiology of common, complex traits across the spectrum of human phenotypes. In this work, I develop and apply statistical tools to interrogate the genetic basis of, and relationships between, metabolic and endocrine traits. I demonstrate that under-explored primary care electronic health records (EHRs), linked to massive biobank projects across the globe, are a valuable source of longitudinal and rare biomarker data for genetics studies. Using EHRs, I find a common missense variant in the APOE gene that is associated with weight-loss in adulthood, which replicates in three global biobanking cohorts of between 125,000 to 475,000 individuals each. While the heritability of weight-change is low ( 700,000 participants across seven global biobanks), to characterise the genetic contributions to these common but poorly understood phenotypes. I find 21 unique genetic loci for infertility, of which only six colocalise with reproductive hormone levels. While there is modest correlation between female infertility and heritable diseases of the reproductive tract, such as endometriosis (rG = 58%) and polycystic ovary syndrome (PCOS) (rG = 40%), I find no evidence for metabolic conditions such as obesity in the genetic aetiology of infertility. I explore these findings further through Mendelian Randomisation analyses to reveal heterogeneity in the genetically predicted causal effects of overall and central obesity on the risk of female reproductive conditions, including infertility, endometriosis, and PCOS, which may be partly genetically mediated by hormone levels. Through a range of genetics-based investigations, I outline the shared and distinct mechanisms of metabolic and endocrine disease in humans
    corecore