177 research outputs found
Significance Testing Against the Random Model for Scoring Models on Top k Predictions
Performance at top k predictions, where instances are ranked by a (learned) scoring model, has
been used as an evaluation metric in machine learning for various reasons such as where the entire
corpus is unknown (e.g., the web) or where the results are to be used by a person with limited time or
resources (e.g., ranking financial news stories where the investor only has time to look at relatively
few stories per day). This evaluation metric is primarily used to report whether the performance
of a given method is significantly better than other (baseline) methods. It has not, however, been
used to show whether the result is significant when compared to the simplest of baselines â the
random model. If no models outperform the random model at a given confidence interval, then the
results may not be worth reporting. This paper introduces a technique to perform an analysis of the
expected performance of the top k predictions from the random model given k and a p-value on an
evaluation dataset D. The technique is based on the realization that the distribution of the number
of positives seen in the top k predictions follows a hypergeometric distribution, which has welldefined
statistical density functions. As this distribution is discrete, we show that using parametric
estimations based on a binomial distribution are almost always in complete agreement with the
discrete distribution and that, if they differ, an interpolation of the discrete bounds gets very close
to the parametric estimations. The technique is demonstrated on results from three prior published
works, in which it clearly shows that even though performance is greatly increased (sometimes over
100%) with respect to the expected performance of the random model (at p = 0.5), these results,
although qualitatively impressive, are not always as significant (p = 0.1) as might be suggested
by the impressive qualitative improvements. The technique is used to show, given k, both how
many positive instances are needed to achieve a specific significance threshold is as well as how
significant a given top k performance is. The technique when used in a more global setting is able
to identify the crossover points, with respect to k, when a method becomes significant for a given
p. Lastly, the technique is used to generate a complete confidence curve, which shows a general
trend over all k and visually shows where a method is significantly better than the random model
over all values of k.Information Systems Working Papers Serie
Confidence Bands for ROC Curves: Methods and an Empirical Study
In this paper we study techniques for generating
and evaluating confidence bands on ROC curves. ROC
curve evaluation is rapidly becoming a commonly used evaluation
metric in machine learning, although evaluating ROC
curves has thus far been limited to studying the area under
the curve (AUC) or generation of one-dimensional confidence
intervals by freezing one variable—the false-positive rate, or
threshold on the classification scoring function. Researchers in
the medical field have long been using ROC curves and have
many well-studied methods for analyzing such curves, including
generating confidence intervals as well as simultaneous
confidence bands. In this paper we introduce these techniques
to the machine learning community and show their empirical
fitness on the Covertype data set—a standard machine learning
benchmark from the UCI repository. We show how some
of these methods work remarkably well, others are too loose,
and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well to
generation of simultanous bands—their bands are too tight.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
A Simple Relational Classifier
We analyze a Relational Neighbor (RN) classifier, a simple relational
predictive model that predicts only based on class labels of related neighbors,
using no learning and no inherent attributes.We show that it performs surprisingly
well by comparing it to more complex models such as Probabilistic Relational
Models and Relational Probability Trees on three data sets from published work.
We argue that a simple model such as this should be used as a baseline to assess
the performance of relational learners.NYU, Stern School of Business, IOMS department, Center for Digital Economy Researc
Confidence Bands for Roc Curves
In this paper we study techniques for generating and evaluating
confidence bands on ROC curves. ROC curve evaluation is
rapidly becoming a commonly used evaluation metric in machine
learning, although evaluating ROC curves has thus far been limited
to studying the area under the curve (AUC) or generation of
one-dimensional confidence intervals by freezing one variableâ
the false-positive rate, or threshold on the classification scoring
function. Researchers in the medical field have long been using
ROC curves and have many well-studied methods for analyzing
such curves, including generating confidence intervals as
well as simultaneous confidence bands. In this paper we introduce
these techniques to the machine learning community and
show their empirical fitness on the Covertype data setâa standard
machine learning benchmark from the UCI repository. We
show how some of these methods work remarkably well, others
are too loose, and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well
to generation of simultaneous bandsâtheir bands are too tight.Information Systems Working Papers Serie
Predicting citation rates for physics papers: Constructing features for an ordered probit model
Gehrke et al. introduce the citation prediction task in their paper "Overview of the KDD Cup 2003" (in this issue). The objective was to predict the change in the number of citations a paper will receive-not the absolute number of citations. There are obvious factors affecting the number of citations including the quality and the topic of the paper, and the reputation of the authors. However it is not clear which factors might influence the change in citations between quarters, rendering the construction of predictive features a challenging task. A high quality and timely paper will be cited more often than a lower quality paper, but that does not suggest the change in citation counts. The selection of training data was critical, as the evaluation would only be on papers that received more than 5 citations in the quarter following the submission of results. After considering several modeling approaches, we used a modified version of an ordered probit model. We describe each of these steps in turn.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Significance Testing Against the Random Model for Scoring Models on Top k Predictions
Performance at top k predictions, where instances are ranked by a (learned) scoring model, has
been used as an evaluation metric in machine learning for various reasons such as where the entire
corpus is unknown (e.g., the web) or where the results are to be used by a person with limited time or
resources (e.g., ranking financial news stories where the investor only has time to look at relatively
few stories per day). This evaluation metric is primarily used to report whether the performance
of a given method is significantly better than other (baseline) methods. It has not, however, been
used to show whether the result is significant when compared to the simplest of baselines â the
random model. If no models outperform the random model at a given confidence interval, then the
results may not be worth reporting. This paper introduces a technique to perform an analysis of the
expected performance of the top k predictions from the random model given k and a p-value on an
evaluation dataset D. The technique is based on the realization that the distribution of the number
of positives seen in the top k predictions follows a hypergeometric distribution, which has welldefined
statistical density functions. As this distribution is discrete, we show that using parametric
estimations based on a binomial distribution are almost always in complete agreement with the
discrete distribution and that, if they differ, an interpolation of the discrete bounds gets very close
to the parametric estimations. The technique is demonstrated on results from three prior published
works, in which it clearly shows that even though performance is greatly increased (sometimes over
100%) with respect to the expected performance of the random model (at p = 0.5), these results,
although qualitatively impressive, are not always as significant (p = 0.1) as might be suggested
by the impressive qualitative improvements. The technique is used to show, given k, both how
many positive instances are needed to achieve a specific significance threshold is as well as how
significant a given top k performance is. The technique when used in a more global setting is able
to identify the crossover points, with respect to k, when a method becomes significant for a given
p. Lastly, the technique is used to generate a complete confidence curve, which shows a general
trend over all k and visually shows where a method is significantly better than the random model
over all values of k.Information Systems Working Papers Serie
Excellent diagnostic characteristics for ultrafast gene profiling of DEFA1-IL1B-LTF in detection of prosthetic joint infections
The timely and exact diagnosis of prosthetic joint infection (PJI) is crucial for surgical decision-making. Intraoperatively, delivery of the result within an hour is required. Alpha-defensin lateral immunoassay of joint fluid (JF) is precise for the intraoperative exclusion of PJI; however, for patients with a limited amount of JF and/or in cases where the JF is bloody, this test is unhelpful. Important information is hidden in periprosthetic tissues that may much better reflect the current status of implant pathology. We therefore investigated the utility of the gene expression patterns of 12 candidate genes (TLR1, -2, -4, -6, and 10, DEFA1, LTF, IL1B, BPI, CRP, IFNG, and DEFB4A) previously associated with infection for detection of PJI in periprosthetic tissues of patients with total joint arthroplasty (TJA) (n = 76) reoperated for PJI (n = 38) or aseptic failure (n = 38), using the ultrafast quantitative reverse transcription-PCR (RT-PCR) Xxpress system (BJS Biotechnologies Ltd.). Advanced data-mining algorithms were applied for data analysis. For PJI, we detected elevated mRNA expression levels of DEFA1 (P < 0.0001), IL1B (P < 0.0001), LTF (P < 0.0001), TLR1 (P = 0.02), and BPI (P = 0.01) in comparison to those in tissues from aseptic cases. A feature selection algorithm revealed that the DEFA1-IL1B-LTF pattern was the most appropriate for detection/exclusion of PJI, achieving 94.5% sensitivity and 95.7% specificity, with likelihood ratios (LRs) for positive and negative results of 16.3 and 0.06, respectively. Taken together, the results show that DEFA1-IL1B-LTF gene expression detection by use of ultrafast qRT-PCR linked to an electronic calculator allows detection of patients with a high probability of PJI within 45 min after sampling. Further testing on a larger cohort of patients is needed.Web of Science5592697268
Confidence Bands for Roc Curves
In this paper we study techniques for generating and evaluating
confidence bands on ROC curves. ROC curve evaluation is
rapidly becoming a commonly used evaluation metric in machine
learning, although evaluating ROC curves has thus far been limited
to studying the area under the curve (AUC) or generation of
one-dimensional confidence intervals by freezing one variableâ
the false-positive rate, or threshold on the classification scoring
function. Researchers in the medical field have long been using
ROC curves and have many well-studied methods for analyzing
such curves, including generating confidence intervals as
well as simultaneous confidence bands. In this paper we introduce
these techniques to the machine learning community and
show their empirical fitness on the Covertype data setâa standard
machine learning benchmark from the UCI repository. We
show how some of these methods work remarkably well, others
are too loose, and that existing machine learning methods for generation
of 1-dimensional confidence intervals do not translate well
to generation of simultaneous bandsâtheir bands are too tight.Information Systems Working Papers Serie
DeepWalk: Online Learning of Social Representations
We present DeepWalk, a novel approach for learning latent representations of
vertices in a network. These latent representations encode social relations in
a continuous vector space, which is easily exploited by statistical models.
DeepWalk generalizes recent advancements in language modeling and unsupervised
feature learning (or deep learning) from sequences of words to graphs. DeepWalk
uses local information obtained from truncated random walks to learn latent
representations by treating walks as the equivalent of sentences. We
demonstrate DeepWalk's latent representations on several multi-label network
classification tasks for social networks such as BlogCatalog, Flickr, and
YouTube. Our results show that DeepWalk outperforms challenging baselines which
are allowed a global view of the network, especially in the presence of missing
information. DeepWalk's representations can provide scores up to 10%
higher than competing methods when labeled data is sparse. In some experiments,
DeepWalk's representations are able to outperform all baseline methods while
using 60% less training data. DeepWalk is also scalable. It is an online
learning algorithm which builds useful incremental results, and is trivially
parallelizable. These qualities make it suitable for a broad class of real
world applications such as network classification, and anomaly detection.Comment: 10 pages, 5 figures, 4 table
- …