Search CORE

MPG.PuRe

Relating the thermodynamic arrow of time to the causal arrow

Author: Armen E Allahverdyan
Balian R
Becker R
Berdichevsky V L
Berry M V
Campisi M
Carroll S M Chen J
Carroll S M Chen J
De Roeck W Maes C Netocny K
Dominik Janzing
Gemmer J
Gorban A
Haken H
Hoffding H
Kano Y Shimizu S
Kasuga T
Lichtenberg A J
Lindblad G
Munster A
Pearl J
Penrose O
Reichenbach H
Risken H
Sagdeev R Z
Schulman L S
Schölkopf B
Spirtes P
Sun X Janzing D Schölkopf B
Wald R M
Weiss U
Zeh H D
Publication venue: 'IOP Publishing'
Publication date: 08/08/2007
Field of study

Consider a Hamiltonian system that consists of a slow subsystem S and a fast subsystem F. The autonomous dynamics of S is driven by an effective Hamiltonian, but its thermodynamics is unexpected. We show that a well-defined thermodynamic arrow of time (second law) emerges for S whenever there is a well-defined causal arrow from S to F and the back-action is negligible. This is because the back-action of F on S is described by a non-globally Hamiltonian Born-Oppenheimer term that violates the Liouville theorem, and makes the second law inapplicable to S. If S and F are mixing, under the causal arrow condition they are described by microcanonic distributions P(S) and P(S|F). Their structure supports a causal inference principle proposed recently in machine learning.Comment: 10 page

MPG.PuRe

Statistical M-Estimation and Consistency in Large Deformable Models for Image Warping

Author: A. Antoniadis
A. Trouvé
A. Trouvé
A. Waart Van der
A. Waart Van der
A.K. Jain
A.P. Korostelëv
B. Markussen
B. Markussen
B. Schölkopf
C. Boor De
C.A. Glasbey
D.G. Kendall
E. Candès
F. Gamboa
G. Charpiat
G. Charpiat
I.L. Dryden
J. Glaunès
J.-M. Loubes
J.B. MacQueen
Jean-Michel Loubes
Jérémie Bigot
L. Huilling
L. Younes
M. Vaillant
O. Faugeras
R.J. Biscay
S. Allassonière
S. Mallat
S.A. Geer van de
Sébastien Gadat
T. Hastie
U. Grenander
U. Grenander
V. Vapnik
X. Pennec
Y. Amit
Y. Amit
Y. LeCun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The problem of defining appropriate distances between shapes or images and modeling the variability of natural images by group transformations is at the heart of modern image analysis. A current trend is the study of probabilistic and statistical aspects of deformation models, and the development of consistent statistical procedure for the estimation of template images. In this paper, we consider a set of images randomly warped from a mean template which has to be recovered. For this, we define an appropriate statistical parametric model to generate random diffeomorphic deformations in two-dimensions. Then, we focus on the problem of estimating the mean pattern when the images are observed with noise. This problem is challenging both from a theoretical and a practical point of view. M-estimation theory enables us to build an estimator defined as a minimizer of a well-tailored empirical criterion. We prove the convergence of this estimator and propose a gradient descent algorithm to compute this M-estimator in practice. Simulations of template extraction and an application to image clustering and classification are also provided

Open Archive Toulouse Archive Ouverte

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Author: A Su
B Brancotte
B Calvo
B Linghu
B Liu
B Schölkopf
B Schölkopf
B Schölkopf
C Giallourakis
C Perez-Iratxeta
C Son
CC Chang
EA Adie
F Denis
F Mordelet
Fantine Mordelet
FS Turner
G Lanckriet
GRG Lanckriet
J Freudenberg
Jean-Philippe Vert
K Bleakley
K Lage
L Jacob
L Jacob
LC Tranchevent
M van Driel
N López-Bigas
N Tiffin
O Vanunu
P Pavlidis
RI Kondor
S Aerts
S Köhler
S Yu
T De Bie
T Evgeniou
T Hwang
U Ala
V McKusick
X Wu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Results We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases. Conclusions ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

Kernel Spectral Clustering and applications

In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and

L_0

L_1, L_0 + L_1

, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

Multiple Imputation Ensembles (MIE) for dealing with missing data

Author: A Farhangfar
AM Sefidian
B Schölkopf
C Cortes
CT Tran
DA Newman
DB Rubin
DB Rubin
DH Wolpert
EL Silva-Ramírez
GE Batista
GJ van der Heijden
H Gao
IH Witten
J Demšar
J Honaker
J Honaker
J Scheffer
JA Sterne
JL Schafer
JL Schafer
JR Quinlan
K Abayomi
KM Ting
L Breiman
L Breiman
L Rokach
M Fichman
M Khalilia
M Spratt
MA Klebanoff
MJ Azur
NJ Horton
PJ García-Laencina
PJ Kelly
PN Tan
RJ Little
S García
S Van Buuren
S Van Buuren
SS Chae
SS Choi
U Garciarena
V Vapnik
X Chen
Y Dong
Y Freund
Y He
Z Che
Z Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

University of East Anglia digital repository

A kernel-based approach for detecting outliers of high-dimensional biological data

Author: A Malossini
B Schölkopf
C Aggarwal
D Koller
E Knorr
E Knorr
F Angiulli
H Ressom
J Oh
Jean Gao
JS Wang
Jung Hun Oh
K Kadota
L Manevitz
M Tumminello
R Lilien
S Bandyopadhyay
S Zhou
T Fawcett
T Golub
U Alon
W Lee
Publication venue: BioMed Central
Publication date: 29/04/2009
Field of study

Knowledge-based energy functions for computational studies of proteins

Author: A. Ben-Naim
A. Godzik
A. Godzik
A. Rossi
A.J. Bordner
A.V. Finkelstein
B. Fain
B. Krishnamoorthy
B. Kuhlman
B. Schölkopf
B.H. Park
B.I. Dahiyat
B.J. McConkey
B.O. Mitchell
C. Anfinsen
C. Carter Jr.
C. Czaplewski
C. Hoppe
C. Hu
C. Micheletti
C. Papadimitriou
C. Zhang
C. Zhang
C. Zhang
C. Zhang
C. Zhang
C.A. Rohl
C.B. Anfinsen
C.M.R Lemer
C.S. Mészáros
D. Gilis
D. Gilis
D. Gilis
D. Tobi
D. Xu
E. Venclovas
E.I. Shakhnovich
E.I. Shakhnovich
F.A. Momany
H. Dobbs
H. Edelsbrunner
H. Gan
H. Li
H. Li
H. Lu
H. Zhou
H.S. Chan
I. Muegge
J. Khatun
J. Liang
J.A. Kocher
J.A. Rank
J.M. Deutsch
J.R. Bienkowska
K. Nishikawa
K. Sale
K.H. Lee
K.K. Koretke
K.K. Koretke
K.T. Simons
L. Adamian
L. Adamian
L. Adamian
L.A. Mirny
L.L. Looger
L.M. Amzel
M. Karplus
M. Levitt
M. Vendruscolo
M. Vendruscolo
M.H. Hao
M.H. Hao
M.J. Sippl
M.J. Sippl
M.J. Sippl
M.P. Eastwood
M.R. Betancourt
M.S. Friedrichs
N. Karmarkar
N.V. Buchete
N.V. Buchete
P. Koehl
P. Koehl
P.D. Thomas
P.D. Thomas
P.G. Wolynes
P.J. Munson
R. Goldstein
R. Guerois
R. Jackups Jr.
R. Janicke
R. Méndez
R. Samudrala
R. Samudrala
R.B. Hill
R.I. Dima
R.J. Vanderbei
R.K. Singh
R.L. Jernigan
R.S. DeWitte
S. Liu
S. Miyazawa
S. Miyazawa
S. Miyazawa
S. Shimizu
S. Shimizu
S. Tanaka
S.J. Wodak
T. Kortemme
T. Kortemme
T. Kortemme
T. Lazaridis
T.L. Chiu
U. Bastolla
U. Bastolla
V. Vapnik
V. Vapnik
V.N. Maiorov
W.P. Russ
X. Li
X. Li
Y. Duan
Y. Park
Y. Xia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/01/2006
Field of study

This chapter discusses theoretical framework and methods for developing knowledge-based potential functions essential for protein structure prediction, protein-protein interaction, and protein sequence design. We discuss in some details about the Miyazawa-Jernigan contact statistical potential, distance-dependent statistical potentials, as well as geometric statistical potentials. We also describe a geometric model for developing both linear and non-linear potential functions by optimization. Applications of knowledge-based potential functions in protein-decoy discrimination, in protein-protein interactions, and in protein design are then described. Several issues of knowledge-based potential functions are finally discussed.Comment: 57 pages, 6 figures. To be published in a book by Springe

Public Library of Science (PLOS)

Segmentation of Multi-Isotope Imaging Mass Spectrometry Data for Semi-Automatic Detection of Regions of Interest

Author: B Schölkopf
BE Boser
C Burges
C Cortes
C Lechene
C Lechene
C-W Hsu
CC Chang
Christoph W. Turck
Claude Lechene
D-S Zhang
E Frank
G Cohen
G McMahon
G Székely
I El-Naqa
I Guyon
J. Collin Poczatek
JA Nelder
M Steinhauser
MPS Brown
N Cristianini
NR Pal
Philipp Gormanns
RJ Zawadzki
S Hua
Simon Rogers
Stefan Reckow
U Kreßel
X Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Multi-isotope imaging mass spectrometry (MIMS) associates secondary ion mass spectrometry (SIMS) with detection of several atomic masses, the use of stable isotopes as labels, and affiliated quantitative image-analysis software. By associating image and measure, MIMS allows one to obtain quantitative information about biological processes in sub-cellular domains. MIMS can be applied to a wide range of biomedical problems, in particular metabolism and cell fate [1], [2], [3]. In order to obtain morphologically pertinent data from MIMS images, we have to define regions of interest (ROIs). ROIs are drawn by hand, a tedious and time-consuming process. We have developed and successfully applied a support vector machine (SVM) for segmentation of MIMS images that allows fast, semi-automatic boundary detection of regions of interests. Using the SVM, high-quality ROIs (as compared to an expert's manual delineation) were obtained for 2 types of images derived from unrelated data sets. This automation simplifies, accelerates and improves the post-processing analysis of MIMS images. This approach has been integrated into “Open MIMS,” an ImageJ-plugin for comprehensive analysis of MIMS images that is available online at http://www.nrims.hms.harvard.edu/NRIMS_ImageJ.php

Harvard University - DASH

MPG.PuRe

Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Author: A Antoniadis
A Butte
AL Boulesteix
B Nadler
B Schölkopf
B Schölkopf
C Chatfield
CC Chang
CCC Liu
Christian Ruckert
Christoph Bartenhagen
CL Nutt
D Geman
D Singh
DV Nguyen
H Hotelling
Hans-Ulrich Klein
HU Klein
I Del Giudice
IS Lim
IT Jolliffe
J Baek
J Misra
JB Tenenbaum
JI Powell
JJ Dai
K Dawson
KQ Weinberger
KQ Weinberger
KY Yeung
LJP Van der Maaten
LK Saul
M Belkin
M Belkin
M Mramor
M Vlachos
MA Hibbs
Martin Dugas
N Cristianini
N Pochet
O Chapelle
R Verhaak
R Xu
S Chao
S Lafon
SB Cho
ST Roweis
T Li
TF Cox
TJ Umpai
TR Golub
U Alon
VD Silva
X Lin
Xiaoyi Jiang
Y Su
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data. Results A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods. Conclusions Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p