Search CORE

10 research outputs found

Fortschritte im unüberwachten Lernen und Anwendungsbereiche: Subspace Clustering mit Hintergrundwissen, semantisches Passworterraten und erlernte Indexstrukturen

Author: Hünemörder Maximilian
Publication venue
Publication date: 01/01/2023
Field of study

Over the past few years, advances in data science, machine learning and, in particular, unsupervised learning have enabled significant progress in many scientific fields and even in everyday life. Unsupervised learning methods are usually successful whenever they can be tailored to specific applications using appropriate requirements based on domain expertise. This dissertation shows how purely theoretical research can lead to circumstances that favor overly optimistic results, and the advantages of application-oriented research based on specific background knowledge. These observations apply to traditional unsupervised learning problems such as clustering, anomaly detection and dimensionality reduction. Therefore, this thesis presents extensions of these classical problems, such as subspace clustering and principal component analysis, as well as several specific applications with relevant interfaces to machine learning. Examples include password guessing using semantic word embeddings and learning spatial index structures using statistical models. In essence, this thesis shows that application-oriented research has many advantages for current and future research.In den letzten Jahren haben Fortschritte in der Data Science, im maschinellen Lernen und insbesondere im unüberwachten Lernen zu erheblichen Fortentwicklungen in vielen Bereichen der Wissenschaft und des täglichen Lebens geführt. Methoden des unüberwachten Lernens sind in der Regel dann erfolgreich, wenn sie durch geeignete, auf Expertenwissen basierende Anforderungen an spezifische Anwendungen angepasst werden können. Diese Dissertation zeigt, wie rein theoretische Forschung zu Umständen führen kann, die allzu optimistische Ergebnisse begünstigen, und welche Vorteile anwendungsorientierte Forschung hat, die auf spezifischem Hintergrundwissen basiert. Diese Beobachtungen gelten für traditionelle unüberwachte Lernprobleme wie Clustering, Anomalieerkennung und Dimensionalitätsreduktion. Daher werden in diesem Beitrag Erweiterungen dieser klassischen Probleme, wie Subspace Clustering und Hauptkomponentenanalyse, sowie einige spezifische Anwendungen mit relevanten Schnittstellen zum maschinellen Lernen vorgestellt. Beispiele sind das Erraten von Passwörtern mit Hilfe semantischer Worteinbettungen und das Lernen von räumlichen Indexstrukturen mit Hilfe statistischer Modelle. Im Wesentlichen zeigt diese Arbeit, dass anwendungsorientierte Forschung viele Vorteile für die aktuelle und zukünftige Forschung hat

MACAU: Open Access Repository of Kiel University

Similarity search and applications: 12th international conference, SISAP 2019, Newark, NJ, USA, October 2-4, 2019, proceedings

Author: Amato Giuseppe
Gennaro Claudio
Oria Vincent
Radovanović Milos
Publication venue: Springer International Publishing AG
Publication date: 01/01/2019
Field of study

CERN Document Server

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Author: Axiotis Kyriakos
Cohen-Addad Vincent
Henzinger Monika
Jerome Sammy
Mirrokni Vahab
Saulpic David
Woodruff David
Wunder Michael
Publication venue
Publication date: 27/02/2024
Field of study

We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on

k

-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is H\"older continuous, our approach provably allows selecting a set of ``typical''

k + 1/\varepsilon^2

elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative

(1\pm\varepsilon)

factor and an additive

\varepsilon \lambda \Phi_k

, where

\Phi_k

represents the

k

-means cost for the input embeddings and

\lambda

is the H\"older constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable

arXiv.org e-Print Archive

Simple, Scalable and Effective Clustering via One-Dimensional Projections

Author: Charikar Moses
Henzinger Monika
Hu Lunjia
Vötsch Maxmilian
Waingarten Erik
Publication venue
Publication date: 25/10/2023
Field of study

Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. Popular clustering algorithms such as Lloyd's algorithm and

k

-means++ can take

\Omega(ndk)

time when clustering

n

points in a

d

-dimensional space (represented by an

n\times d

matrix

X

) into

k

clusters. In applications with moderate to large

k

, the multiplicative

k

factor can become very expensive. We introduce a simple randomized clustering algorithm that provably runs in expected time

O(\mathrm{nnz}(X) + n\log n)

for arbitrary

k

. Here

\mathrm{nnz}(X)

is the total number of non-zero entries in the input dataset

X

, which is upper bounded by

nd

and can be significantly smaller for sparse datasets. We prove that our algorithm achieves approximation ratio

\smash{\widetilde{O}(k^4)}

on any input dataset for the

k

-means objective. We also believe that our theoretical analysis is of independent interest, as we show that the approximation ratio of a

k

-means algorithm is approximately preserved under a class of projections and that

k

-means++ seeding can be implemented in expected

O(n \log n)

time in one dimension. Finally, we show experimentally that our clustering algorithm gives a new tradeoff between running time and cluster quality compared to previous state-of-the-art methods for these tasks.Comment: 41 pages, 6 figures, to appear in NeurIPS 202

arXiv.org e-Print Archive

Cancer drug sensitivity prediction from routine histology images

Author: Branson Kim
Dawood Muhammad
Jones Louise
Minhas Fayyaz ul Amir Afsar
Rajpoot Nasir M.
Vu Quoc Dang
Young Lawrence S.
Publication venue: Nature Publishing Group
Publication date: 06/01/2024
Field of study

Drug sensitivity prediction models can aid in personalising cancer therapy, biomarker discovery, and drug design. Such models require survival data from randomised controlled trials which can be time consuming and expensive. In this proof-of-concept study, we demonstrate for the first time that deep learning can link histological patterns in whole slide images (WSIs) of Haematoxylin & Eosin (H&E) stained breast cancer sections with drug sensitivities inferred from cell lines. We employ patient-wise drug sensitivities imputed from gene expression-based mapping of drug effects on cancer cell lines to train a deep learning model that predicts patients’ sensitivity to multiple drugs from WSIs. We show that it is possible to use routine WSIs to predict the drug sensitivity profile of a cancer patient for a number of approved and experimental drugs. We also show that the proposed approach can identify cellular and histological patterns associated with drug sensitivity profiles of cancer patients

Warwick Research Archives Portal Repository

Graph set data mining

Author: Schäfer Till
Publication venue
Publication date: 01/01/2023
Field of study

Graphs are among the most versatile abstract data types in computer science. With the variety comes great adoption in various application fields, such as chemistry, biology, social analysis, logistics, and computer science itself. With the growing capacities of digital storage, the collection of large amounts of data has become the norm in many application fields. Data mining, i.e., the automated extraction of non-trivial patterns from data, is a key step to extract knowledge from these datasets and generate value. This thesis is dedicated to concurrent scalable data mining algorithms beyond traditional notions of efficiency for large-scale datasets of small labeled graphs; more precisely, structural clustering and representative subgraph pattern mining. It is motivated by, but not limited to, the need to analyze molecular libraries of ever-increasing size in the drug discovery process. Structural clustering makes use of graph theoretical concepts, such as (common) subgraph isomorphisms and frequent subgraphs, to model cluster commonalities directly in the application domain. It is considered computationally demanding for non-restricted graph classes and with very few exceptions prior algorithms are only suitable for very small datasets. This thesis discusses the first truly scalable structural clustering algorithm StruClus with linear worst-case complexity. At the same time, StruClus embraces the inherent values of structural clustering algorithms, i.e., interpretable, consistent, and high-quality results. A novel two-fold sampling strategy with stochastic error bounds for frequent subgraph mining is presented. It enables fast extraction of cluster commonalities in the form of common subgraph representative sets. StruClus is the first structural clustering algorithm with a directed selection of structural cluster-representative patterns regarding homogeneity and separation aspects in the high-dimensional subgraph pattern space. Furthermore, a novel concept of cluster homogeneity balancing using dynamically-sized representatives is discussed. The second part of this thesis discusses the representative subgraph pattern mining problem in more general terms. A novel objective function maximizes the number of represented graphs for a cardinality-constrained representative set. It is shown that the problem is a special case of the maximum coverage problem and is NP-hard. Based on the greedy approximation of Nemhauser, Wolsey, and Fisher for submodular set function maximization a novel sampling approach is presented. It mines candidate sets that contain an optimal greedy solution with a probabilistic maximum error. This leads to a constant-time algorithm to generate the candidate sets given a fixed-size sample of the dataset. In combination with a cheap single-pass streaming evaluation of the candidate sets, this enables scalability to datasets with billions of molecules on a single machine. Ultimately, the sampling approach leads to the first distributed subgraph pattern mining algorithm that distributes the pattern space and the dataset graphs at the same time

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

Author: ESA <27. 2019, München>
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 01/09/2019
Field of study

Digitale Bibliothek Thüringen

XXIII Edición del Workshop de Investigadores en Ciencias de la Computación : Libro de actas

Author: Carmona Fernanda Beatriz
Frati Fernando Emmanuel
Publication venue: Universidad Nacional de Chilecito (UNdeC)
Publication date: 31/05/2021
Field of study

Compilación de las ponencias presentadas en el XXIII Workshop de Investigadores en Ciencias de la Computación (WICC), llevado a cabo en Chilecito (La Rioja) en abril de 2021.Red de Universidades con Carreras en Informátic

Servicio de Difusión de la Creación Intelectual

Longitudinal multi-dimensional investigation of metabolic and endocrine genetics

Author: Sudheesh Venkatesh Samvida
Publication venue
Publication date: 23/04/2024
Field of study

Genome-wide association studies (GWASs) in recent decades have revealed the genetic landscape and shared aetiology of common, complex traits across the spectrum of human phenotypes. In this work, I develop and apply statistical tools to interrogate the genetic basis of, and relationships between, metabolic and endocrine traits. I demonstrate that under-explored primary care electronic health records (EHRs), linked to massive biobank projects across the globe, are a valuable source of longitudinal and rare biomarker data for genetics studies. Using EHRs, I find a common missense variant in the APOE gene that is associated with weight-loss in adulthood, which replicates in three global biobanking cohorts of between 125,000 to 475,000 individuals each. While the heritability of weight-change is low ( 700,000 participants across seven global biobanks), to characterise the genetic contributions to these common but poorly understood phenotypes. I find 21 unique genetic loci for infertility, of which only six colocalise with reproductive hormone levels. While there is modest correlation between female infertility and heritable diseases of the reproductive tract, such as endometriosis (rG = 58%) and polycystic ovary syndrome (PCOS) (rG = 40%), I find no evidence for metabolic conditions such as obesity in the genetic aetiology of infertility. I explore these findings further through Mendelian Randomisation analyses to reveal heterogeneity in the genetically predicted causal effects of overall and central obesity on the risk of female reproductive conditions, including infertility, endometriosis, and PCOS, which may be partly genetically mediated by hormone levels. Through a range of genetics-based investigations, I outline the shared and distinct mechanisms of metabolic and endocrine disease in humans

Oxford University Research Archive