Search CORE

11 research outputs found

SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification

Author: A Berchuck
A Hodges
A Webb
B Ding
B Hanczar
BS Everitt
D Chowdary
D Singh
DG Beer
DV Nguyen
DV Nguyen
E Tian
F Borovecki
G Fort
Gideon Dror
H Martens
H Wold
H Wold
K Chin
K-AL Cao
L Breiman
L Breiman
L Song
LJ van 't Veer
M Barker
M Gutkin
M Momma
M West
MA Zapala
Magnus Rattray
ME Burczynski
Michael Gutkin
N Iizuka
R Rosipal
R Rosipal
R Rosipol
RA Fisher
Ron Shamir
RW Hamming
S Gruvberger
S Wold
SJ Russell
SM Dhanasekaran
T Hastie
T Okada
TM Mitchell
TR Golub
U Alon
V Vapnik
WN Venables
X Huang
X Huang
Y Saeys
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique

CiteSeerX

Public Library of Science (PLOS)

Crossref

PubMed Central

TRAF6 and IRF7 Control HIV Replication in Macrophages

Author: A Matsuda
AA Lambert
AJ Smith
B Beutler
BK Weaver
CE Samuel
Christopher H. Woelk
CJ Shah Mohak
DD Richman
E Izmailova
E Meylan
Fabrizio Mammano
G Dennis Jr
G Peng
GJ Nuovo
GR Stark
H Chen
H Kato
H Konno
HC Schroder
I Marie
J Ruland
Jacques Corbeil
JF Fortin
Jérôme Estaquier
K Honda
K Takeda
KA Fitzgerald
KN Bishop
L Campillo-Gimenez
L Song
LG Xu
Lynda Robitaille
M Covic
M Laforge
M Oldak
M Sato
M Sato
M Sato
M Sgarbanti
M Sgarbanti
M Shah
M Stremlau
M Yoneyama
M Yoneyama
M Yoneyama
MG Wathelet
Mohak Shah
MS Kunzi
Mélissa Sirois
N Buettner
N Jouvenet
NF Muto
P Genin
PR Meylan
R Colina
R Lin
R Medzhitov
R Yoshida
RB Seth
Robin Allary
RS Kornbluth
S Ning
S Pestka
S Sato
S Sharma
SD Barr
SJ Neil
T Kawai
T Kawai
Z Wu
Publication venue: Public Library of Science
Publication date: 28/11/2011
Field of study

The innate immune system recognizes virus infection and evokes antiviral responses which include producing type I interferons (IFNs). The induction of IFN provides a crucial mechanism of antiviral defense by upregulating interferon-stimulated genes (ISGs) that restrict viral replication. ISGs inhibit the replication of many viruses by acting at different steps of their viral cycle. Specifically, IFN treatment prior to in vitro human immunodeficiency virus (HIV) infection stops or significantly delays HIV-1 production indicating that potent inhibitory factors are generated. We report that HIV-1 infection of primary human macrophages decreases tumor necrosis factor receptor-associated factor 6 (TRAF6) and virus-induced signaling adaptor (VISA) expression, which are both components of the IFN signaling pathway controlling viral replication. Knocking down the expression of TRAF6 in macrophages increased HIV-1 replication and augmented the expression of IRF7 but not IRF3. Suppressing VISA had no impact on viral replication. Overexpression of IRF7 resulted in enhanced viral replication while knocking down IRF7 expression in macrophages significantly reduced viral output. These findings are the first demonstration that TRAF6 can regulate HIV-1 production and furthermore that expression of IRF7 promotes HIV-1 replication

Public Library of Science (PLOS)

Southampton (e-Prints Soton)

Crossref

Directory of Open Access Journals

PubMed Central

Kernelized partial least squares for feature reduction and classification of gene microarray data

Author: Borgia Jeffrey A
Deng Youping
Ford William S
Land Walker H
Margolis Daniel E
Paquette Christopher T
Perez-Rogers Joseph F
Qiao Xingye
Yang Jack Y
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Graph Kernels

Author: Borgwardt Karsten Michael
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2007
Field of study

As new graph structured data is constantly being generated, learning and data mining on graphs have become a challenge in application areas such as molecular biology, telecommunications, chemoinformatics, and social network analysis. The central algorithmic problem in these areas, measuring similarity of graphs, has therefore received extensive attention in the recent past. Unfortunately, existing approaches are slow, lacking in expressivity, or hard to parameterize. Graph kernels have recently been proposed as a theoretically sound and promising approach to the problem of graph comparison. Their attractivity stems from the fact that by defining a kernel on graphs, a whole family of data mining and machine learning algorithms becomes applicable to graphs. These kernels on graphs must respect both the information represented by the topology and the node and edge labels of the graphs, while being efficient to compute. Existing methods fall woefully short; they miss out on important topological information, are plagued by runtime issues, and do not scale to large graphs. Hence the primary goal of this thesis is to make learning and data mining with graph kernels feasible. In the first half of this thesis, we review and analyze the shortcomings of state-of-the-art graph kernels. We then propose solutions to overcome these weaknesses. As highlights of our research, we - speed up the classic random walk graph kernel from O(n^6) to O(n^3), where n is the number of nodes in the larger graph, and by a factor of up to 1,000 in CPU runtime, by extending concepts from Linear Algebra to Reproducing Kernel Hilbert Spaces, - define novel graph kernels based on shortest paths that avoid tottering and outperform random walk kernels in accuracy, - define novel graph kernels that estimate the frequency of small subgraphs within a large graph and that work on large graphs hitherto not handled by existing graph kernels. In the second half of this thesis, we present algorithmic solutions to two novel problems in graph mining. First, we define a two-sample test on graphs. Given two sets of graphs, or a pair of graphs, this test lets us decide whether these graphs are likely to originate from the same underlying distribution. To solve this so-called two-sample-problem, we define the first kernel-based two-sample test. Combined with graph kernels, this results in the first two-sample test on graphs described in the literature. Second, we propose a principled approach to supervised feature selection on graphs. As in feature selection on vectors, feature selection on graphs aims at finding features that are correlated with the class membership of a graph. Towards this goal, we first define a family of supervised feature selection algorithms based on kernels and the Hilbert-Schmidt Independence Criterion. We then show how to extend this principle of feature selection to graphs, and how to combine it with gSpan, the state-of-the-art method for frequent subgraph mining. On several benchmark datasets, our novel procedure manages to select a small subset of dozens of informative features among thousands and millions of subgraphs detected by gSpan. In classification experiments, the features selected by our method outperform those chosen by other feature selectors in terms of classification accuracy. Along the way, we also solve several problems that can be deemed contributions in their own right: - We define a unifying framework for describing both variants of random walk graph kernels proposed in the literature. - We present the first theoretical connection between graph kernels and molecular descriptors from chemoinformatics. - We show how to determine sample sizes for estimating the frequency of certain subgraphs within a large graph with a given precision and confidence, which promises to be a key to the solution of important problems in data mining and bioinformatics. Three branches of computer science immediately benefit from our findings: data mining, machine learning, and bioinformatics. For data mining, our efficient graph kernels allow us to bring to bear the large family of kernel methods to mining problems on real-world graph data. For machine learning, we open the door to extend strong theoretical results on learning on graphs into useful practical applications. For bioinformatics, we make a number of principled kernel methods and efficient kernel functions available for biological network comparison, and structural comparisons of proteins. Apart from these three areas, other fields may also benefit from our findings, as our algorithms are general in nature and not restricted to a particular type of application

CiteSeerX

Digitale Hochschulschriften der LMU

Feature Selection for Gene Expression Data Based on Hilbert-Schmidt Independence Criterion

Author: Zarkoob Hadi
Publication venue: 'University of Waterloo'
Publication date: 21/05/2010
Field of study

DNA microarrays are capable of measuring expression levels of thousands of genes, even the whole genome, in a single experiment. Based on this, they have been widely used to extend the studies of cancerous tissues to a genomic level. One of the main goals in DNA microarray experiments is to identify a set of relevant genes such that the desired outputs of the experiment mostly depend on this set, to the exclusion of the rest of the genes. This is motivated by the fact that the biological process in cell typically involves only a subset of genes, and not the whole genome. The task of selecting a subset of relevant genes is called feature (gene) selection. Herein, we propose a feature selection algorithm for gene expression data. It is based on the Hilbert-Schmidt independence criterion, and partly motivated by Rank-One Downdate (R1D) and the Singular Value Decomposition (SVD). The algorithm is computationally very fast and scalable to large data sets, and can be applied to response variables of arbitrary type (categorical and continuous). Experimental results of the proposed technique are presented on some synthetic and well-known microarray data sets. Later, we discuss the capability of HSIC in providing a general framework which encapsulates many widely used techniques for dimensionality reduction, clustering and metric learning. We will use this framework to explain two metric learning algorithms, namely the Fisher discriminant analysis (FDA) and closed form metric learning (CFML). As a result of this framework, we are able to propose a new metric learning method. The proposed technique uses the concepts from normalized cut spectral clustering and is associated with an underlying convex optimization problem

University of Waterloo's Institutional Repository

Small sample size learning in bioinformatics

Author: Bedo Justin
Publication venue
Publication date: 21/11/2018
Field of study

The Australian National University

Kernelized Supervised Dictionary Learning

Author: Jabbarzadeh Gangeh Mehrdad
Publication venue: 'University of Waterloo'
Publication date: 24/04/2013
Field of study

The representation of a signal using a learned dictionary instead of predefined operators, such as wavelets, has led to state-of-the-art results in various applications such as denoising, texture analysis, and face recognition. The area of dictionary learning is closely associated with sparse representation, which means that the signal is represented using few atoms in the dictionary. Despite recent advances in the computation of a dictionary using fast algorithms such as K-SVD, online learning, and cyclic coordinate descent, which make the computation of a dictionary from millions of data samples computationally feasible, the dictionary is mainly computed using unsupervised approaches such as k-means. These approaches learn the dictionary by minimizing the reconstruction error without taking into account the category information, which is not optimal in classification tasks. In this thesis, we propose a supervised dictionary learning (SDL) approach by incorporating information on class labels into the learning of the dictionary. To this end, we propose to learn the dictionary in a space where the dependency between the signals and their corresponding labels is maximized. To maximize this dependency, the recently-introduced Hilbert Schmidt independence criterion (HSIC) is used. The learned dictionary is compact and has closed form; the proposed approach is fast. We show that it outperforms other unsupervised and supervised dictionary learning approaches in the literature on real-world data. Moreover, the proposed SDL approach has as its main advantage that it can be easily kernelized, particularly by incorporating a data-driven kernel such as a compression-based kernel, into the formulation. In this thesis, we propose a novel compression-based (dis)similarity measure. The proposed measure utilizes a 2D MPEG-1 encoder, which takes into consideration the spatial locality and connectivity of pixels in the images. The proposed formulation has been carefully designed based on MPEG encoder functionality. To this end, by design, it solely uses P-frame coding to find the (dis)similarity among patches/images. We show that the proposed measure works properly on both small and large patch sizes on textures. Experimental results show that by incorporating the proposed measure as a kernel into our SDL, it significantly improves the performance of a supervised pixel-based texture classification on Brodatz and outdoor images compared to other compression-based dissimilarity measures, as well as state-of-the-art SDL methods. It also improves the computation speed by about 40% compared to its closest rival. Eventually, we have extended the proposed SDL to multiview learning, where more than one representation is available on a dataset. We propose two different multiview approaches: one fusing the feature sets in the original space and then learning the dictionary and sparse coefficients on the fused set; and the other by learning one dictionary and the corresponding coefficients in each view separately, and then fusing the representations in the space of the dictionaries learned. We will show that the proposed multiview approaches benefit from the complementary information in multiple views, and investigate the relative performance of these approaches in the application of emotion recognition

University of Waterloo's Institutional Repository

Gene selection via the BAHSIC family of algorithms

Author: Alex Smola
Alizadeh
Alon
Arthur Gretton
Baker
Bedo
Beer
Berchuck
Bhattacharjee
Brown
Bullinger
Candes
Cover
Dhanasekaran
Ein-Dor
Fan
Feuerverger
Fukumizu
Golub
Gretton
Gruvberger
Guyon
Gärtner
Hastie
Iizuka
Justin Bedo
Karsten M. Borgwardt
Le Song
Li
Li
Lodhi
Rosenwald
Schölkopf
Schölkopf
Smyth
Steinwart
Stolovitzky
Tibshirani
Tibshirani
Tibshirani
Tusher
Valk
van 't Veer
van de Vijver
Wainwright
Wang
Warnat
Welsh
West
Zaffalon
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Gene selection via the BAHSIC family of algorithms

Author: Bedo Justin
Borgwardt Karsten
Gretton Arthur
Smola Alexander
Song Le
Publication venue: 'Oxford University Press (OUP)'
Publication date: 09/12/2015
Field of study

Motivation: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert-Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems. Results: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable

The Australian National University