3,147 research outputs found
Can We Rebrand the Humanities?
As someone who studied both marketing and history (and who
finds her history degree a super valuable part of that mix) the
question often crosses my mind: “How can I sell my history
degree?
Reconstruction of biological networks by supervised machine learning approaches
We review a recent trend in computational systems biology which aims at using
pattern recognition algorithms to infer the structure of large-scale biological
networks from heterogeneous genomic data. We present several strategies that
have been proposed and that lead to different pattern recognition problems and
algorithms. The strenght of these approaches is illustrated on the
reconstruction of metabolic, protein-protein and regulatory networks of model
organisms. In all cases, state-of-the-art performance is reported
Kernel methods in genomics and computational biology
Support vector machines and kernel methods are increasingly popular in
genomics and computational biology, due to their good performance in real-world
applications and strong modularity that makes them suitable to a wide range of
problems, from the classification of tumors to the automatic annotation of
proteins. Their ability to work in high dimension, to process non-vectorial
data, and the natural framework they provide to integrate heterogeneous data
are particularly relevant to various problems arising in computational biology.
In this chapter we survey some of the most prominent applications published so
far, highlighting the particular developments in kernel methods triggered by
problems in biology, and mention a few promising research directions likely to
expand in the future
Optimal rates for plug-in estimators of density level sets
In the context of density level set estimation, we study the convergence of
general plug-in methods under two main assumptions on the density for a given
level . More precisely, it is assumed that the density (i) is smooth
in a neighborhood of and (ii) has -exponent at level
. Condition (i) ensures that the density can be estimated at a
standard nonparametric rate and condition (ii) is similar to Tsybakov's margin
assumption which is stated for the classification framework. Under these
assumptions, we derive optimal rates of convergence for plug-in estimators.
Explicit convergence rates are given for plug-in estimators based on kernel
density estimators when the underlying measure is the Lebesgue measure. Lower
bounds proving optimality of the rates in a minimax sense when the density is
H\"older smooth are also provided.Comment: Published in at http://dx.doi.org/10.3150/09-BEJ184 the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Joint segmentation of many aCGH profiles using fast group LARS
Array-Based Comparative Genomic Hybridization (aCGH) is a method used to
search for genomic regions with copy numbers variations. For a given aCGH
profile, one challenge is to accurately segment it into regions of constant
copy number. Subjects sharing the same disease status, for example a type of
cancer, often have aCGH profiles with similar copy number variations, due to
duplications and deletions relevant to that particular disease. We introduce a
constrained optimization algorithm that jointly segments aCGH profiles of many
subjects. It simultaneously penalizes the amount of freedom the set of profiles
have to jump from one level of constant copy number to another, at genomic
locations known as breakpoints. We show that breakpoints shared by many
different profiles tend to be found first by the algorithm, even in the
presence of significant amounts of noise. The algorithm can be formulated as a
group LARS problem. We propose an extremely fast way to find the solution path,
i.e., a sequence of shared breakpoints in order of importance. For no extra
cost the algorithm smoothes all of the aCGH profiles into piecewise-constant
regions of equal copy number, giving low-dimensional versions of the original
data. These can be shown for all profiles on a single graph, allowing for
intuitive visual interpretation. Simulations and an implementation of the
algorithm on bladder cancer aCGH profiles are provided
Kernel matrix regression
We address the problem of filling missing entries in a kernel Gram matrix,
given a related full Gram matrix. We attack this problem from the viewpoint of
regression, assuming that the two kernel matrices can be considered as
explanatory variables and response variables, respectively. We propose a
variant of the regression model based on the underlying features in the
reproducing kernel Hilbert space by modifying the idea of kernel canonical
correlation analysis, and we estimate the missing entries by fitting this model
to the existing samples. We obtain promising experimental results on gene
network inference and protein 3D structure prediction from genomic datasets. We
also discuss the relationship with the em-algorithm based on information
geometry
Epitope prediction improved by multitask support vector machines
Motivation: In silico methods for the prediction of antigenic peptides
binding to MHC class I molecules play an increasingly important role in the
identification of T-cell epitopes. Statistical and machine learning methods, in
particular, are widely used to score candidate epitopes based on their
similarity with known epitopes and non epitopes. The genes coding for the MHC
molecules, however, are highly polymorphic, and statistical methods have
difficulties to build models for alleles with few known epitopes. In this case,
recent works have demonstrated the utility of leveraging information across
alleles to improve the performance of the prediction. Results: We design a
support vector machine algorithm that is able to learn epitope models for all
alleles simultaneously, by sharing information across similar alleles. The
sharing of information across alleles is controlled by a user-defined measure
of similarity between alleles. We show that this similarity can be defined in
terms of supertypes, or more directly by comparing key residues known to play a
role in the peptide-MHC binding. We illustrate the potential of this approach
on various benchmark experiments where it outperforms other state-of-the-art
methods
A bagging SVM to learn from positive and unlabeled examples
We consider the problem of learning a binary classifier from a training set
of positive and unlabeled examples, both in the inductive and in the
transductive setting. This problem, often referred to as \emph{PU learning},
differs from the standard supervised classification problem by the lack of
negative examples in the training set. It corresponds to an ubiquitous
situation in many applications such as information retrieval or gene ranking,
when we have identified a set of data of interest sharing a particular
property, and we wish to automatically retrieve additional data sharing the
same property among a large and easily available pool of unlabeled data. We
propose a conceptually simple method, akin to bagging, to approach both
inductive and transductive PU learning problems, by converting them into series
of supervised binary classification problems discriminating the known positive
examples from random subsamples of the unlabeled set. We empirically
demonstrate the relevance of the method on simulated and real data, where it
performs at least as well as existing methods while being faster
- …