Search CORE

2,821 research outputs found

A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer

Author: A Subramanian
C Desmedt
Christine Staiger
D Hanahan
E Lee
F Reyal
G Abraham
GR Mishra
Gunnar W. Klau
HY Chuang
I Ulitsky
IW Taylor
Joaquín Dopazo
K Chin
KR Brown
L Ein-Dor
L Tian
LD Miller
LFA Wessels
LJ van’t Veer
Lodewyk F. A. Wessels
M Kanehisa
Marcus Dittrich
MH van Vliet
MJ van de Vijver
ML Gatza
MT Dittrich
P Dao
Raul Kooter
S Loi
S Ma
SA Chowdhury
Sidney Cadot
Tobias Müller
TSK Prasad
V Popovici
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/10/2011
Field of study

Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single gene classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single gene classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single gene classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single gene sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single gene classifiers for predicting outcome in breast cancer

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

VU Research Portal

CWI's Institutional Repository

Directory of Open Access Journals

PubMed Central

Online-Publikations-Server der Universität Würzburg

FigShare

Stable Feature Selection for Biomarker Discovery

Author: He Zengyou
Yu Weichuan
Publication venue
Publication date: 01/01/2010
Field of study

Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

arXiv.org e-Print Archive

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

Ranking with Submodular Valuations

Author: Azar Yossi
Gamzu Iftah
Publication venue
Publication date: 15/07/2010
Field of study

We study the problem of ranking with submodular valuations. An instance of this problem consists of a ground set

[m]

, and a collection of

n

monotone submodular set functions

f^1, \ldots, f^n

, where each

f^i: 2^{[m]} \to R_+

. An additional ingredient of the input is a weight vector

w \in R_+^n

. The objective is to find a linear ordering of the ground set elements that minimizes the weighted cover time of the functions. The cover time of a function is the minimal number of elements in the prefix of the linear ordering that form a set whose corresponding function value is greater than a unit threshold value. Our main contribution is an

O(\ln(1 / \epsilon))

-approximation algorithm for the problem, where

\epsilon

is the smallest non-zero marginal value that any function may gain from some element. Our algorithm orders the elements using an adaptive residual updates scheme, which may be of independent interest. We also prove that the problem is

\Omega(\ln(1 / \epsilon))

-hard to approximate, unless P = NP. This implies that the outcome of our algorithm is optimal up to constant factors.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

An Efficient Bandit Algorithm for Realtime Multivariate Optimization

Author: Hill Daniel N
Iyer Anand
Liu Yi
Nassif Houssam
Vishwanathan S V N
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/10/2018
Field of study

Optimization is commonly employed to determine the content of web pages, such as to maximize conversions on landing pages or click-through rates on search engine result pages. Often the layout of these pages can be decoupled into several separate decisions. For example, the composition of a landing page may involve deciding which image to show, which wording to use, what color background to display, etc. Such optimization is a combinatorial problem over an exponentially large decision space. Randomized experiments do not scale well to this setting, and therefore, in practice, one is typically limited to optimizing a single aspect of a web page at a time. This represents a missed opportunity in both the speed of experimentation and the exploitation of possible interactions between layout decisions. Here we focus on multivariate optimization of interactive web pages. We formulate an approach where the possible interactions between different components of the page are modeled explicitly. We apply bandit methodology to explore the layout space efficiently and use hill-climbing to select optimal content in realtime. Our algorithm also extends to contextualization and personalization of layout selection. Simulation results show the suitability of our approach to large decision spaces with strong interactions between content. We further apply our algorithm to optimize a message that promotes adoption of an Amazon service. After only a single week of online optimization, we saw a 21% conversion increase compared to the median layout. Our technique is currently being deployed to optimize content across several locations at Amazon.com.Comment: KDD'17 Audience Appreciation Awar

arXiv.org e-Print Archive

Crossref

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

Author: Andoni A.
Broder A. Z.
Li P.
Lv Q.
Shrivastava A.
Shrivastava A.
Weber R.
Publication venue
Publication date: 03/07/2018
Field of study

We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (

n^2D

), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

arXiv.org e-Print Archive

Crossref

Recommended from our members

Metaheuristic approaches for the quartet method of hierarchical clustering

Author: Consoli S
Darby-Dowman K
Geleijnse G
Korst J
Pauws S
Publication venue
Publication date: 01/01/2008
Field of study

Given a set of objects and their pairwise distances, we wish to determine a visual representation of the data. We use the quartet paradigm to compute a hierarchy of clusters of the objects. The method is based on an NP-hard graph optimization problem called the Minimum Quartet Tree Cost problem. This paper presents and compares several metaheuristic approaches to approximate the optimal hierarchy. The performance of the algorithms is tested through extensive computational experiments and it is shown that the Reduced Variable Neighbourhood Search metaheuristic is the most effective approach to the problem, obtaining high quality solutions in short computational running times

Brunel University Research Archive

Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms

Author: Chu Wei
Langford John
Li Lihong
Wang Xuanhui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

Contextual bandit algorithms have become popular for online recommendation systems such as Digg, Yahoo! Buzz, and news recommendation in general. \emph{Offline} evaluation of the effectiveness of new algorithms in these applications is critical for protecting online user experiences but very challenging due to their "partial-label" nature. Common practice is to create a simulator which simulates the online environment for the problem at hand and then run an algorithm against this simulator. However, creating simulator itself is often difficult and modeling bias is usually unavoidably introduced. In this paper, we introduce a \emph{replay} methodology for contextual bandit algorithm evaluation. Different from simulator-based approaches, our method is completely data-driven and very easy to adapt to different applications. More importantly, our method can provide provably unbiased evaluations. Our empirical results on a large-scale news article recommendation dataset collected from Yahoo! Front Page conform well with our theoretical results. Furthermore, comparisons between our offline replay and online bucket evaluation of several contextual bandit algorithms show accuracy and effectiveness of our offline evaluation method.Comment: 10 pages, 7 figures, revised from the published version at the WSDM 2011 conferenc

arXiv.org e-Print Archive

CiteSeerX

Crossref

TimeMachine: Timeline Generation for Knowledge-Base Entities

Author: Baeza-Yates R. A.
Dasgupta A.
Do Q. X.
Graus D.
Krause A.
Lin C.-Y.
Lin H.
Ling X.
Minoux M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/06/2015
Field of study

We present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such as works, awards, collaborations, and family relationships. We develop three orthogonal timeline quality criteria that an ideal timeline should satisfy: (1) it shows events that are relevant to the entity; (2) it shows events that are temporally diverse, so they distribute along the time axis, avoiding visual crowding and allowing for easy user interaction, such as zooming in and out; and (3) it shows events that are content diverse, so they contain many different types of events (e.g., for an actor, it should show movies and marriages and awards, not just movies). We present an algorithm to generate such timelines for a given time period and screen size, based on submodular optimization and web-co-occurrence statistics with provable performance guarantees. A series of user studies using Mechanical Turk shows that all three quality criteria are crucial to produce quality timelines and that our algorithm significantly outperforms various baseline and state-of-the-art methods.Comment: To appear at ACM SIGKDD KDD'15. 12pp, 7 fig. With appendix. Demo and other info available at http://cs.stanford.edu/~althoff/timemachine

arXiv.org e-Print Archive

Crossref