23 research outputs found

    Domain Generalization by Marginal Transfer Learning

    Full text link
    In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced Blanchard et al., 2011. We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets

    Machine Learning for Flow Cytometry Data Analysis.

    Full text link
    This thesis concerns the problem of automatic flow cytometry data analysis. Flow cytometry is a technique for rapid cell analysis and widely used in many biomedical and clinical laboratories. Quantitative measurements from a flow cytometer provide rich information about various physical and chemical characteristics of a large number of cells. In clinical applications, flow cytometry data is visualized on a sequence of two-dimensional scatter plots and analyzed through a manual process called “gating”. This conventional analysis process requires a large amount of time and labor and is highly subjective and inefficient. In this thesis, we present novel machine learning methods for flow cytometry data analysis to address these issues. We first begin by a method for generating a high dimensional flow cytometry dataset from multiple low dimensional datasets. We present an imputation algorithm based on clustering and show that it improves upon a simple nearest neighbor based approach that often induces spurious clusters in the imputed data. This technique enables the analysis of multi-dimensional flow cytometry data beyond the fundamental measurement limits of instruments. We then present two machine learning methods for automatic gating problems. Gating is a process of identifying interesting subsets of cell populations. Pathologists make clinical decisions by inspecting the results from gating. Unfortunately, this process is performed manually in most clinical settings and poses many challenges in high-throughput analysis. The first approach is an unsupervised learning technique based on multivariate mixture models. Since measurements from a flow cytometer are often censored and truncated, standard model-fitting algorithms can cause biases and lead to poor gating results. We propose novel algorithms for fitting multivariate Gaussian mixture models to data that is truncated, censored, or truncated and censored. Our second approach is a transfer learning technique combined with the low-density separation principle. Unlike conventional unsupervised learning approaches, this method can leverage existing datasets previously gated by domain experts to automatically gate a new flow cytometry data. Moreover, the proposed algorithm can adaptively account for biological variations in multiple datasets. We demonstrate these techniques on clinical flow cytometry data and evaluate their effectiveness.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89818/1/gyemin_1.pd

    Determinants of Esports Highlight Viewership: The Case of League of Legends Champions Korea

    Get PDF
    Studies on esports league demand via new media platforms are limited yet. This paper is the first to identify determinants of esports highlight viewership. Using set-level highlight view count from YouTube, we analyze various determinants to explain view counts. As a result, we found that the number of kills, playoff games, age of video clip, 2nd round games, and 3rd set is positively correlated to view counts. Outcome uncertainty and upset results do not affect view counts. We interpret the results that as highlight clips are released after the game is finished, viewers can know the results when making a decision. Or, relatively short highlight videos reduce opportunity costs for fans and fans do not care about game outcomes much

    Missed a live match? Determinants of League of Legends Champions Korea highlights viewership

    Get PDF
    This research aims to explore the determinants of the League of Legends Champions Korea (LCK) highlight views and comment counts. The data of 629 game highlight views and comment counts for seven tournaments were collected from YouTube. The highlight views and comment counts were regressed on a series of before-the-game factors (outcome uncertainty and game quality), after-the-game factors (sum and difference of kills, assists, multiple kills, and upset results), and match-related characteristics (game duration, evening game, and clip recentness). A multi-level least square dummy variable regression was conducted to test the model. Among the before-the-game factors, outcome uncertainty and game quality were significantly associated with highlight views and comment counts. This indicated that fans liked watching games with uncertain outcomes and those involving high-quality teams. Among the after-the-game factors, an upset result was a significant determinant of esports highlight views and comment counts. Thus, fans enjoy watching underdogs win. Finally, the sum of kills and assists only affected view counts, which indicated that fans prefer watching offensive games with more kills and a solo performance rather than teamwork

    Hierarchical Clustering Using One-Class Support Vector Machines

    No full text
    This paper presents a novel hierarchical clustering method using support vector machines. A common approach for hierarchical clustering is to use distance for the task. However, different choices for computing inter-cluster distances often lead to fairly distinct clustering outcomes, causing interpretation difficulties in practice. In this paper, we propose to use a one-class support vector machine (OC-SVM) to directly find high-density regions of data. Our algorithm generates nested set estimates using the OC-SVM and exploits the hierarchical structure of the estimated sets. We demonstrate the proposed algorithm on synthetic datasets. The cluster hierarchy is visualized with dendrograms and spanning trees

    Nested Support Vector Machines

    No full text
    The one-class and cost-sensitive support vector machines (SVMs) are state-of-the-art machine learning methods for estimating density level sets and solving weighted classification problems, respectively. However, the solutions of these SVMs do not necessarily produce set estimates that are nested as the parameters controlling the density level or cost-asymmetry are continuously varied. Such a nesting constraint is desirable for applications requiring the simultaneous estimation of multiple sets, including clustering, anomaly detection, and ranking problems. We propose new quadratic programs whose solutions give rise to nested extensions of the one-class and cost-sensitive SVMs. Furthermore, like conventional SVMs, the solution paths in our construction are piecewise linear in the control parameters, with significantly fewer breakpoints. We also describe decomposition algorithms to solve the quadratic programs. These methods are compared to conventional SVMs on synthetic and benchmark data sets, and are shown to exhibit more stable rankings and decreased sensitivity to parameter settings
    corecore