23 research outputs found
Recommended from our members
Semi-supervised novelty detection
A common setting for novelty detection assumes that labeled examples
from the nominal class are available, but that labeled examples of novelties
are unavailable. The standard (inductive) approach is to declare novelties
where the nominal density is low, which reduces the problem to density level
set estimation. In this paper, we consider the setting where an unlabeled and
possibly contaminated sample is also available at learning time. We argue
that novelty detection in this semi-supervised setting is naturally solved by
a general reduction to a binary classification problem. In particular, a
detector with a desired false positive rate can be achieved through a
reduction to Neyman-Pearson classification. Unlike the inductive approach,
semi-supervised novelty detection (SSND) yields detectors that are optimal
(e.g., statistically consistent) regardless of the distribution on novelties.
Therefore, in novelty detection, unlabeled data have a substantial impact on
the theoretical properties of the decision rule. We validate the practical
utility of SSND with an extensive experimental study. We also show that SSND
provides distribution-free, learning-theoretic solutions to two well known
problems in hypothesis testing. First, our results provide a general solution
to the general two-sample problem, that is, the problem of determining
whether two random samples arise from the same distribution. Second, a
specialization of SSND coincides with the standard -value approach to
multiple testing under the so-called random effects model. Unlike standard
rejection regions based on thresholded -values, the general SSND framework
allows for adaptation to arbitrary alternative distributions
Domain Generalization by Marginal Transfer Learning
In the problem of domain generalization (DG), there are labeled training data
sets from several related prediction problems, and the goal is to make accurate
predictions on future unlabeled data sets that are not known to the learner.
This problem arises in several applications where data distributions fluctuate
because of environmental, technical, or other sources of variation. We
introduce a formal framework for DG, and argue that it can be viewed as a kind
of supervised learning problem by augmenting the original feature space with
the marginal distribution of feature vectors. While our framework has several
connections to conventional analysis of supervised learning algorithms, several
unique aspects of DG require new methods of analysis.
This work lays the learning theoretic foundations of domain generalization,
building on our earlier conference paper where the problem of DG was introduced
Blanchard et al., 2011. We present two formal models of data generation,
corresponding notions of risk, and distribution-free generalization error
analysis. By focusing our attention on kernel methods, we also provide more
quantitative results and a universally consistent algorithm. An efficient
implementation is provided for this algorithm, which is experimentally compared
to a pooling strategy on one synthetic and three real-world data sets
Machine Learning for Flow Cytometry Data Analysis.
This thesis concerns the problem of automatic flow cytometry data analysis. Flow cytometry
is a technique for rapid cell analysis and widely used in many biomedical and clinical laboratories. Quantitative measurements from a flow cytometer provide rich information about various physical and chemical characteristics of a large number of cells. In clinical applications, flow cytometry data is visualized on a sequence of two-dimensional scatter plots and analyzed through a manual process called “gating”. This conventional analysis process requires a large amount of time and labor and is highly subjective and inefficient. In this thesis, we present novel machine learning methods for flow cytometry data analysis to address these issues.
We first begin by a method for generating a high dimensional flow cytometry dataset from multiple low dimensional datasets. We present an imputation algorithm based on clustering and show that it improves upon a simple nearest neighbor based approach that often induces spurious clusters in the imputed data. This technique enables the analysis of multi-dimensional flow cytometry data beyond the fundamental measurement limits of instruments.
We then present two machine learning methods for automatic gating problems. Gating is a process of identifying interesting subsets of cell populations. Pathologists make clinical decisions by inspecting the results from gating. Unfortunately, this process is performed manually in most clinical settings and poses many challenges in high-throughput analysis.
The first approach is an unsupervised learning technique based on multivariate mixture models. Since measurements from a flow cytometer are often censored and truncated, standard model-fitting algorithms can cause biases and lead to poor gating results. We propose novel algorithms for fitting multivariate Gaussian mixture models to data that is truncated, censored, or truncated and censored.
Our second approach is a transfer learning technique combined with the low-density
separation principle. Unlike conventional unsupervised learning approaches, this method
can leverage existing datasets previously gated by domain experts to automatically gate a
new flow cytometry data. Moreover, the proposed algorithm can adaptively account for biological variations in multiple datasets.
We demonstrate these techniques on clinical flow cytometry data and evaluate their
effectiveness.Ph.D.Electrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89818/1/gyemin_1.pd
Determinants of Esports Highlight Viewership: The Case of League of Legends Champions Korea
Studies on esports league demand via new media platforms are limited yet. This paper is the first to identify determinants of esports highlight viewership. Using set-level highlight view count from YouTube, we analyze various determinants to explain view counts. As a result, we found that the number of kills, playoff games, age of video clip, 2nd round games, and 3rd set is positively correlated to view counts. Outcome uncertainty and upset results do not affect view counts. We interpret the results that as highlight clips are released after the game is finished, viewers can know the results when making a decision. Or, relatively short highlight videos reduce opportunity costs for fans and fans do not care about game outcomes much
Missed a live match? Determinants of League of Legends Champions Korea highlights viewership
This research aims to explore the determinants of the League of Legends Champions Korea (LCK) highlight views and comment counts. The data of 629 game highlight views and comment counts for seven tournaments were collected from YouTube. The highlight views and comment counts were regressed on a series of before-the-game factors (outcome uncertainty and game quality), after-the-game factors (sum and difference of kills, assists, multiple kills, and upset results), and match-related characteristics (game duration, evening game, and clip recentness). A multi-level least square dummy variable regression was conducted to test the model. Among the before-the-game factors, outcome uncertainty and game quality were significantly associated with highlight views and comment counts. This indicated that fans liked watching games with uncertain outcomes and those involving high-quality teams. Among the after-the-game factors, an upset result was a significant determinant of esports highlight views and comment counts. Thus, fans enjoy watching underdogs win. Finally, the sum of kills and assists only affected view counts, which indicated that fans prefer watching offensive games with more kills and a solo performance rather than teamwork
Hierarchical Clustering Using One-Class Support Vector Machines
This paper presents a novel hierarchical clustering method using support vector machines. A common approach for hierarchical clustering is to use distance for the task. However, different choices for computing inter-cluster distances often lead to fairly distinct clustering outcomes, causing interpretation difficulties in practice. In this paper, we propose to use a one-class support vector machine (OC-SVM) to directly find high-density regions of data. Our algorithm generates nested set estimates using the OC-SVM and exploits the hierarchical structure of the estimated sets. We demonstrate the proposed algorithm on synthetic datasets. The cluster hierarchy is visualized with dendrograms and spanning trees
Nested Support Vector Machines
The one-class and cost-sensitive support vector machines (SVMs) are state-of-the-art machine learning methods for estimating density level sets and solving weighted classification problems, respectively. However, the solutions of these SVMs do not necessarily produce set estimates that are nested as the parameters controlling the density level or cost-asymmetry are continuously varied. Such a nesting constraint is desirable for applications requiring the simultaneous estimation of multiple sets, including clustering, anomaly detection, and ranking problems. We propose new quadratic programs whose solutions give rise to nested extensions of the one-class and cost-sensitive SVMs. Furthermore, like conventional SVMs, the solution paths in our construction are piecewise linear in the control parameters, with significantly fewer breakpoints. We also describe decomposition algorithms to solve the quadratic programs. These methods are compared to conventional SVMs on synthetic and benchmark data sets, and are shown to exhibit more stable rankings and decreased sensitivity to parameter settings