153 research outputs found

    Optimal Tuning for Divide-and-conquer Kernel Ridge Regression with Massive Data

    Get PDF
    Divide-and-conquer is a powerful approach for large and massive data analysis. In the nonparameteric regression setting, although various theoretical frameworks have been established to achieve optimality in estimation or hypothesis testing, how to choose the tuning parameter in a practically effective way is still an open problem. In this paper, we propose a data-driven procedure based on divide-and-conquer for selecting the tuning parameters in kernel ridge regression by modifying the popular Generalized Cross-validation (GCV, Wahba, 1990). While the proposed criterion is computationally scalable for massive data sets, it is also shown under mild conditions to be asymptotically optimal in the sense that minimizing the proposed distributed-GCV (dGCV) criterion is equivalent to minimizing the true global conditional empirical loss of the averaged function estimator, extending the existing optimality results of GCV to the divide-and-conquer framework

    Domain Adaptation on Graphs by Learning Graph Topologies: Theoretical Analysis and an Algorithm

    Full text link
    Traditional machine learning algorithms assume that the training and test data have the same distribution, while this assumption does not necessarily hold in real applications. Domain adaptation methods take into account the deviations in the data distribution. In this work, we study the problem of domain adaptation on graphs. We consider a source graph and a target graph constructed with samples drawn from data manifolds. We study the problem of estimating the unknown class labels on the target graph using the label information on the source graph and the similarity between the two graphs. We particularly focus on a setting where the target label function is learnt such that its spectrum is similar to that of the source label function. We first propose a theoretical analysis of domain adaptation on graphs and present performance bounds that characterize the target classification error in terms of the properties of the graphs and the data manifolds. We show that the classification performance improves as the topologies of the graphs get more balanced, i.e., as the numbers of neighbors of different graph nodes become more proportionate, and weak edges with small weights are avoided. Our results also suggest that graph edges between too distant data samples should be avoided for good generalization performance. We then propose a graph domain adaptation algorithm inspired by our theoretical findings, which estimates the label functions while learning the source and target graph topologies at the same time. The joint graph learning and label estimation problem is formulated through an objective function relying on our performance bounds, which is minimized with an alternating optimization scheme. Experiments on synthetic and real data sets suggest that the proposed method outperforms baseline approaches

    Machine Learning based Surrogate Modeling of Electronic Devices and Circuits

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Local, Semi-Local and Global Models for Texture, Object and Scene Recognition

    Get PDF
    This dissertation addresses the problems of recognizing textures, objects, and scenes in photographs. We present approaches to these recognition tasks that combine salient local image features with spatial relations and effective discriminative learning techniques. First, we introduce a bag of features image model for recognizing textured surfaces under a wide range of transformations, including viewpoint changes and non-rigid deformations. We present results of a large-scale comparative evaluation indicating that bags of features can be effective not only for texture, but also for object categization, even in the presence of substantial clutter and intra-class variation. We also show how to augment the purely local image representation with statistical co-occurrence relations between pairs of nearby features, and develop a learning and classification framework for the task of classifying individual features in a multi-texture image. Next, we present a more structured alternative to bags of features for object recognition, namely, an image representation based on semi-local parts, or groups of features characterized by stable appearance and geometric layout. Semi-local parts are automatically learned from small sets of unsegmented, cluttered images. Finally, we present a global method for recognizing scene categories that works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting spatial pyramid representation demonstrates significantly improved performance on challenging scene categorization tasks

    Semi-Supervised Learning for Blog Classification.

    Get PDF
    Abstract Blog classification (e.g., identifying bloggers' gender or age) is one of the most interesting current problems in blog analysis. Although this problem is usually solved by applying supervised learning techniques, the large labeled dataset required for training is not always available. In contrast, unlabeled blogs can easily be collected from the web. Therefore, a semi-supervised learning method for blog classification, effectively using unlabeled data, is proposed. In this method, entries from the same blog are assumed to have the same characteristics. With this assumption, the proposed method captures the characteristics of each blog, such as writing style and topic, and uses these characteristics to improve the classification accuracy

    Generalized Matrix Decomposition Regression: Estimation and Inference for Two-way Structured Data

    Full text link
    This paper studies high-dimensional regression with two-way structured data. To estimate the high-dimensional coefficient vector, we propose the generalized matrix decomposition regression (GMDR) to efficiently leverage any auxiliary information on row and column structures. The GMDR extends the principal component regression (PCR) to two-way structured data, but unlike PCR, the GMDR selects the components that are most predictive of the outcome, leading to more accurate prediction. For inference on regression coefficients of individual variables, we propose the generalized matrix decomposition inference (GMDI), a general high-dimensional inferential framework for a large family of estimators that include the proposed GMDR estimator. GMDI provides more flexibility for modeling relevant auxiliary row and column structures. As a result, GMDI does not require the true regression coefficients to be sparse; it also allows dependent and heteroscedastic observations. We study the theoretical properties of GMDI in terms of both the type-I error rate and power and demonstrate the effectiveness of GMDR and GMDI on simulation studies and an application to human microbiome data.Comment: 25 pages, 6 figures, Accepted by the Annals of Applied Statistic

    Classification of Flow Regimes Using Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM).

    Get PDF
    This dissertation project presents a novel method for the classification of vertical and horizontal two-phase flow regimes through pipes. For gas-liquid vertical and horizontal two-phase flows, the goal of the study is to predict the transition region between the flow regimesusing the data generated by empirical correlations. The transition region is determined with respect to pipe diameter, superficial gas velocity, and superficial liquid velocity. Accurate determination of the flow regime is critical in the design of multiphase flow systems, which are used in various industrial processes, including boiling and condensation, oil and gas pipelines, and cooling systems for nuclear reactors

    Leveraging large scale data for video retrieval

    Get PDF
    Ankara : The Department of Computer Engineering and the Graduate School of Engineering and Science of Bilkent University, 2014.Thesis (Master's) -- Bilkent University, 2014.Includes bibliographical references leaves 75-82.The large amount of video data shared on the web resulted in increased interest on retrieving videos using usual cues, since textual cues alone are not sufficient for satisfactory results. We address the problem of leveraging large scale image and video data for capturing important characteristics in videos. We focus on three different problems, namely finding common patterns in unusual videos, large scale multimedia event detection, and semantic indexing of videos. Unusual events are important as being possible indicators of undesired consequences. Discovery of unusual events in videos is generally attacked as a problem of finding usual patterns. With this challenging problem at hand, we propose a novel descriptor to encode the rapid motions in videos utilizing densely extracted trajectories. The proposed descriptor, trajectory snippet histograms, is used to distinguish unusual videos from usual videos, and further exploited to discover snapshots in which unusualness happen. Next, we attack the Multimedia Event Detection (MED) task. We approach this problem as representing the videos in the form of prototypes, that correspond to models each describing a different visual characteristic of a video shot. Finally, we approach the Semantic Indexing (SIN) problem, and collect web images to train models for each concept.Armağan, AnılM.S

    Statistical learning methods for functional data with applications to prediction, classification and outlier detection

    Get PDF
    In the era of big data, Functional Data Analysis has become increasingly important insofar as it constitutes a powerful tool to tackle inference problems in statistics. In particular in this thesis we have proposed several methods aimed to solve problems of prediction of time series, classification and outlier detection from a functional approach. The thesis is organized as follows: In Chapter 1 we introduce the concept of functional data and state the overview of the thesis. In Chapter 2 of this work we present the theoretical framework used to we develop the proposed methodologies. In Chapters 3 and 4 two new ordering mappings for functional data are proposed. The first is a Kernel depth measure, which satisfies the corresponding theoretical properties, while the second is an entropy measure. In both cases we propose a parametric and non-parametric estimation method that allow us to define an order in the data set at hand. A natural application of these measures is the identification of atypical observations (functions). In Chapter 5 we study the Functional Autoregressive Hilbertian model. We also propose a new family of basis functions for the estimation and prediction of the aforementioned model, which belong to a reproducing kernel Hilbert space. The properties of continuity obtained in this space allow us to construct confidence bands for the corresponding predictions in a detracted time horizon. In order to boost different classification methods, in Chapter 6 we propose a divergence measure for functional data. This metric allows us to determine in which part of the domain two classes of functional present divergent behavior. This methodology is framed in the field of domain selection, and it is aimed to solve classification problems by means of the elimination of redundant information. Finally in Chapter 7 the general conclusions of this work and the future research lines are presented.Financial support received from the Spanish Ministry of Economy and Competitiveness ECO2015-66593-P and the UC3M PIF scholarship for doctoral studies.Programa de Doctorado en Economía de la Empresa y Métodos Cuantitativos por la Universidad Carlos III de MadridPresidente: Santiago Velilla Cerdán; Secretario: Kalliopi Mylona; Vocal: Luis Antonio Belanche Muño

    Uncertainty Estimation: single forward pass methods and applications in Active Learning

    Get PDF
    Machine Learning (ML) models are now powerful enough to be used in complex automated decision-making settings such as autonomous driving and medical diagnosis. Despite being very accurate in general, these models do still make mistakes. A critical factor in being able to depend on such models is that they can quantify the uncertainty of their predictions, and it is paramount that this is taken into account by users of the model. Unfortunately, deep learning models cannot readily express their uncertainty, rendering them unsafe for many real-world applications. Bayesian modelling provides a mathematical framework for learning models that can express their uncertainty. However, exact Bayesian methods are computationally expensive to learn and evaluate, and approximate methods often reduce accuracy or are still prohibitively expensive. Meanwhile, ML models continue to increase in number of parameters, meaning that one has to make a decision between being (more) Bayesian or using a larger model. So far it has always fallen in favour of larger models. Instead of building on Bayesian methods, we deconstruct uncertainty estimation and formulate desiderata that we base our work on throughout the thesis (Chapter 1). In Chapter 3, we introduce a new model (DUQ) that is able to estimate uncertainty in a single forward pass by carefully constructing the model’s parameter and output space based on the desiderata. We then extend this model in Chapter 4 (DUE) by placing it in the framework provided by Deep Kernel Learning. This enables the model to work well for both classification and regression tasks (as opposed to just classification), and estimate uncertainty over a batch of inputs jointly. Both models are competitive with standard softmax models in terms of accuracy and speed, while having significantly improved uncertainty estimation. We additionally consider the problem of Active Learning (AL), where the goal is to maximise label efficiency by selecting only the most informative data points to be labelled. In Section 4.5, we evaluate the DUE model in AL for personalised healthcare. Here, the labelled dataset needs to adhere to specific assumptions made in causal inference, which makes this a challenging problem. In Chapter 5, we look at AL in the batch setting. We show that current methods do not select diverse batches of data, and we introduce a principled method to overcome this issue. Building upon deep kernel learning, this thesis provides a compelling foundation for single forward pass uncertainty and advances the state of the art in active learning. In the conclusions (Section 6, and at the end of each chapter), we discuss how users of ML models could make use of these tools for making sound and confident decisions
    corecore