745 research outputs found

    Graph Laplacian for Semi-Supervised Learning

    Full text link
    Semi-supervised learning is highly useful in common scenarios where labeled data is scarce but unlabeled data is abundant. The graph (or nonlocal) Laplacian is a fundamental smoothing operator for solving various learning tasks. For unsupervised clustering, a spectral embedding is often used, based on graph-Laplacian eigenvectors. For semi-supervised problems, the common approach is to solve a constrained optimization problem, regularized by a Dirichlet energy, based on the graph-Laplacian. However, as supervision decreases, Dirichlet optimization becomes suboptimal. We therefore would like to obtain a smooth transition between unsupervised clustering and low-supervised graph-based classification. In this paper, we propose a new type of graph-Laplacian which is adapted for Semi-Supervised Learning (SSL) problems. It is based on both density and contrastive measures and allows the encoding of the labeled data directly in the operator. Thus, we can perform successfully semi-supervised learning using spectral clustering. The benefits of our approach are illustrated for several SSL problems.Comment: 12 pages, 6 figure

    Applications of No-Collision Transportation Maps in Manifold Learning

    Full text link
    In this work, we investigate applications of no-collision transportation maps introduced in [Nurbekyan et. al., 2020] in manifold learning for image data. Recently, there has been a surge in applying transportation-based distances and features for data representing motion-like or deformation-like phenomena. Indeed, comparing intensities at fixed locations often does not reveal the data structure. No-collision maps and distances developed in [Nurbekyan et. al., 2020] are sensitive to geometric features similar to optimal transportation (OT) maps but much cheaper to compute due to the absence of optimization. In this work, we prove that no-collision distances provide an isometry between translations (respectively dilations) of a single probability measure and the translation (respectively dilation) vectors equipped with a Euclidean distance. Furthermore, we prove that no-collision transportation maps, as well as OT and linearized OT maps, do not in general provide an isometry for rotations. The numerical experiments confirm our theoretical findings and show that no-collision distances achieve similar or better performance on several manifold learning tasks compared to other OT and Euclidean-based methods at a fraction of a computational cost

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Boosting precision crop protection towards agriculture 5.0 via machine learning and emerging technologies: A contextual review

    Get PDF
    Crop protection is a key activity for the sustainability and feasibility of agriculture in a current context of climate change, which is causing the destabilization of agricultural practices and an increase in the incidence of current or invasive pests, and a growing world population that requires guaranteeing the food supply chain and ensuring food security. In view of these events, this article provides a contextual review in six sections on the role of artificial intelligence (AI), machine learning (ML) and other emerging technologies to solve current and future challenges of crop protection. Over time, crop protection has progressed from a primitive agriculture 1.0 (Ag1.0) through various technological developments to reach a level of maturity closelyin line with Ag5.0 (section 1), which is characterized by successfully leveraging ML capacity and modern agricultural devices and machines that perceive, analyze and actuate following the main stages of precision crop protection (section 2). Section 3 presents a taxonomy of ML algorithms that support the development and implementation of precision crop protection, while section 4 analyses the scientific impact of ML on the basis of an extensive bibliometric study of >120 algorithms, outlining the most widely used ML and deep learning (DL) techniques currently applied in relevant case studies on the detection and control of crop diseases, weeds and plagues. Section 5 describes 39 emerging technologies in the fields of smart sensors and other advanced hardware devices, telecommunications, proximal and remote sensing, and AI-based robotics that will foreseeably lead the next generation of perception-based, decision-making and actuation systems for digitized, smart and real-time crop protection in a realistic Ag5.0. Finally, section 6 highlights the main conclusions and final remarks

    Comparison between gradients and parcellations for functional connectivity prediction of behavior

    Get PDF
    Resting-state functional connectivity (RSFC) is widely used to predict behavioral measures. To predict behavioral measures, representing RSFC with parcellations and gradients are the two most popular approaches. Here, we compare parcellation and gradient approaches for RSFC-based prediction of a broad range of behavioral measures in the Human Connectome Project (HCP) and Adolescent Brain Cognitive Development (ABCD) datasets. Among the parcellation approaches, we consider group-average “hard” parcellations (Schaefer et al., 2018), individual-specific “hard” parcellations (Kong et al., 2021a), and an individual-specific “soft” parcellation (spatial independent component analysis with dual regression; Beckmann et al., 2009). For gradient approaches, we consider the well-known principal gradients (Margulies et al., 2016) and the local gradient approach that detects local RSFC changes (Laumann et al., 2015). Across two regression algorithms, individual-specific hard-parcellation performs the best in the HCP dataset, while the principal gradients, spatial independent component analysis and group-average “hard” parcellations exhibit similar performance. On the other hand, principal gradients and all parcellation approaches perform similarly in the ABCD dataset. Across both datasets, local gradients perform the worst. Finally, we find that the principal gradient approach requires at least 40 to 60 gradients to perform as well as parcellation approaches. While most principal gradient studies utilize a single gradient, our results suggest that incorporating higher order gradients can provide significant behaviorally relevant information. Future work will consider the inclusion of additional parcellation and gradient approaches for comparison

    Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need

    Full text link
    Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.Comment: 8 main pages, 20 totals, 10 figure

    Linearized Wasserstein dimensionality reduction with approximation guarantees

    Full text link
    We introduce LOT Wassmap, a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space. The algorithm is motivated by the observation that many datasets are naturally interpreted as probability measures rather than points in Rn\mathbb{R}^n, and that finding low-dimensional descriptions of such datasets requires manifold learning algorithms in the Wasserstein space. Most available algorithms are based on computing the pairwise Wasserstein distance matrix, which can be computationally challenging for large datasets in high dimensions. Our algorithm leverages approximation schemes such as Sinkhorn distances and linearized optimal transport to speed-up computations, and in particular, avoids computing a pairwise distance matrix. We provide guarantees on the embedding quality under such approximations, including when explicit descriptions of the probability measures are not available and one must deal with finite samples instead. Experiments demonstrate that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size. We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.Comment: 38 pages, 10 figures. Submitte

    Contrastive Learning Can Find An Optimal Basis For Approximately View-Invariant Functions

    Full text link
    Contrastive learning is a powerful framework for learning self-supervised representations that generalize well to downstream supervised tasks. We show that multiple existing contrastive learning methods can be reinterpreted as learning kernel functions that approximate a fixed positive-pair kernel. We then prove that a simple representation obtained by combining this kernel with PCA provably minimizes the worst-case approximation error of linear predictors, under a straightforward assumption that positive pairs have similar labels. Our analysis is based on a decomposition of the target function in terms of the eigenfunctions of a positive-pair Markov chain, and a surprising equivalence between these eigenfunctions and the output of Kernel PCA. We give generalization bounds for downstream linear prediction using our Kernel PCA representation, and show empirically on a set of synthetic tasks that applying Kernel PCA to contrastive learning models can indeed approximately recover the Markov chain eigenfunctions, although the accuracy depends on the kernel parameterization as well as on the augmentation strength.Comment: Published at ICLR 202
    corecore