Search CORE

23 research outputs found

Recommended from our members

Semi-supervised novelty detection

Author: Blanchard Gilles
Lee Gyemin
Scott Clayton
Publication venue: Berlin : Weierstraß-Institut für Angewandte Analysis und Stochastik
Publication date: 01/01/2009
Field of study

A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard

p

-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded

p

-values, the general SSND framework allows for adaptation to arbitrary alternative distributions

Repositorium für Naturwissenschaften und Technik

Domain Generalization by Marginal Transfer Learning

Author: Blanchard Gilles
Deshmukh Aniket Anand
Dogan Urun
Lee Gyemin
Scott Clayton
Publication venue
Publication date: 17/04/2020
Field of study

In the problem of domain generalization (DG), there are labeled training data sets from several related prediction problems, and the goal is to make accurate predictions on future unlabeled data sets that are not known to the learner. This problem arises in several applications where data distributions fluctuate because of environmental, technical, or other sources of variation. We introduce a formal framework for DG, and argue that it can be viewed as a kind of supervised learning problem by augmenting the original feature space with the marginal distribution of feature vectors. While our framework has several connections to conventional analysis of supervised learning algorithms, several unique aspects of DG require new methods of analysis. This work lays the learning theoretic foundations of domain generalization, building on our earlier conference paper where the problem of DG was introduced Blanchard et al., 2011. We present two formal models of data generation, corresponding notions of risk, and distribution-free generalization error analysis. By focusing our attention on kernel methods, we also provide more quantitative results and a universally consistent algorithm. An efficient implementation is provided for this algorithm, which is experimentally compared to a pooling strategy on one synthetic and three real-world data sets

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Machine Learning for Flow Cytometry Data Analysis.

Author: Lee Gyemin
Publication venue
Publication date
Field of study

This thesis concerns the problem of automatic flow cytometry data analysis. Flow cytometry is a technique for rapid cell analysis and widely used in many biomedical and clinical laboratories. Quantitative measurements from a flow cytometer provide rich information about various physical and chemical characteristics of a large number of cells. In clinical applications, flow cytometry data is visualized on a sequence of two-dimensional scatter plots and analyzed through a manual process called “gating”. This conventional analysis process requires a large amount of time and labor and is highly subjective and inefficient. In this thesis, we present novel machine learning methods for flow cytometry data analysis to address these issues. We first begin by a method for generating a high dimensional flow cytometry dataset from multiple low dimensional datasets. We present an imputation algorithm based on clustering and show that it improves upon a simple nearest neighbor based approach that often induces spurious clusters in the imputed data. This technique enables the analysis of multi-dimensional flow cytometry data beyond the fundamental measurement limits of instruments. We then present two machine learning methods for automatic gating problems. Gating is a process of identifying interesting subsets of cell populations. Pathologists make clinical decisions by inspecting the results from gating. Unfortunately, this process is performed manually in most clinical settings and poses many challenges in high-throughput analysis. The first approach is an unsupervised learning technique based on multivariate mixture models. Since measurements from a flow cytometer are often censored and truncated, standard model-fitting algorithms can cause biases and lead to poor gating results. We propose novel algorithms for fitting multivariate Gaussian mixture models to data that is truncated, censored, or truncated and censored. Our second approach is a transfer learning technique combined with the low-density separation principle. Unlike conventional unsupervised learning approaches, this method can leverage existing datasets previously gated by domain experts to automatically gate a new flow cytometry data. Moreover, the proposed algorithm can adaptively account for biological variations in multiple datasets. We demonstrate these techniques on clinical flow cytometry data and evaluate their effectiveness.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89818/1/gyemin_1.pd

Deep Blue Documents at the University of Michigan

Determinants of Esports Highlight Viewership: The Case of League of Legends Champions Korea

Author: Hwang Hui
Jang Wonseok (Eric)
Jeong Jaehyun
Lee Gyemin
Pyun Hyunwoong
Ryu Yoonji
Publication venue
Publication date: 03/01/2023
Field of study

Studies on esports league demand via new media platforms are limited yet. This paper is the first to identify determinants of esports highlight viewership. Using set-level highlight view count from YouTube, we analyze various determinants to explain view counts. As a result, we found that the number of kills, playoff games, age of video clip, 2nd round games, and 3rd set is positively correlated to view counts. Outcome uncertainty and upset results do not affect view counts. We interpret the results that as highlight clips are released after the game is finished, viewers can know the results when making a decision. Or, relatively short highlight videos reduce opportunity costs for fans and fans do not care about game outcomes much

ScholarSpace at University of Hawai'i at Manoa

Missed a live match? Determinants of League of Legends Champions Korea highlights viewership

Author: Gyemin Lee
Hui Hwang
Hyunwoong Pyun
Jaehyun Jeong
Wonseok Jang
Yoonji Ryu
Publication venue: Frontiers Media S.A.
Publication date: 01/08/2023
Field of study

This research aims to explore the determinants of the League of Legends Champions Korea (LCK) highlight views and comment counts. The data of 629 game highlight views and comment counts for seven tournaments were collected from YouTube. The highlight views and comment counts were regressed on a series of before-the-game factors (outcome uncertainty and game quality), after-the-game factors (sum and difference of kills, assists, multiple kills, and upset results), and match-related characteristics (game duration, evening game, and clip recentness). A multi-level least square dummy variable regression was conducted to test the model. Among the before-the-game factors, outcome uncertainty and game quality were significantly associated with highlight views and comment counts. This indicated that fans liked watching games with uncertain outcomes and those involving high-quality teams. Among the after-the-game factors, an upset result was a significant determinant of esports highlight views and comment counts. Thus, fans enjoy watching underdogs win. Finally, the sum of kills and assists only affected view counts, which indicated that fans prefer watching offensive games with more kills and a solo performance rather than teamwork

Directory of Open Access Journals

Hierarchical Clustering Using One-Class Support Vector Machines

Author: Gyemin Lee
Publication venue: 'MDPI AG'
Publication date: 01/07/2015
Field of study

This paper presents a novel hierarchical clustering method using support vector machines. A common approach for hierarchical clustering is to use distance for the task. However, different choices for computing inter-cluster distances often lead to fairly distinct clustering outcomes, causing interpretation difficulties in practice. In this paper, we propose to use a one-class support vector machine (OC-SVM) to directly find high-density regions of data. Our algorithm generates nested set estimates using the OC-SVM and exploits the hierarchical structure of the estimated sets. We demonstrate the proposed algorithm on synthetic datasets. The cluster hierarchy is visualized with dendrograms and spanning trees

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Nested Support Vector Machines

Author: Clayton Scott
Gyemin Lee
Publication venue
Publication date: 01/01/2008
Field of study

The one-class and cost-sensitive support vector machines (SVMs) are state-of-the-art machine learning methods for estimating density level sets and solving weighted classification problems, respectively. However, the solutions of these SVMs do not necessarily produce set estimates that are nested as the parameters controlling the density level or cost-asymmetry are continuously varied. Such a nesting constraint is desirable for applications requiring the simultaneous estimation of multiple sets, including clustering, anomaly detection, and ranking problems. We propose new quadratic programs whose solutions give rise to nested extensions of the one-class and cost-sensitive SVMs. Furthermore, like conventional SVMs, the solution paths in our construction are piecewise linear in the control parameters, with significantly fewer breakpoints. We also describe decomposition algorithms to solve the quadratic programs. These methods are compared to conventional SVMs on synthetic and benchmark data sets, and are shown to exhibit more stable rankings and decreased sensitivity to parameter settings

CiteSeerX

Crossref