Search CORE

415 research outputs found

Reconciling modern machine learning practice and the bias-variance trade-off

Author: Belkin Mikhail
Hsu Daniel
Ma Siyuan
Mandal Soumik
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 10/09/2019
Field of study

Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning

arXiv.org e-Print Archive

Recommended from our members

An assessment of upper ocean salinity content from the ocean reanalyses inter-comparison project (ORA-IP)

Author: A Storto
A. Storto
AV Fedorov
B Huang
B Ingleby
BD Santer
C Maes
C Maes
E Hackert
ES Johnson
F Hernandez
F. Gaillard
F. Hernandez
G. Chepurin
G. Vernieres
GR Foltz
H Storch von
H Zuo
IM Belkin
IM Belkin
IM Held
J Ballabrera-Poy
J Marshall
J Sprintall
J Vialard
J Waters
J Zhu
J Zhu
J-E Kim
JA Carton
JE Janowiak
K Vranes
K. Haines
K.A. Peterson
L. Shi
M Zhao
M Zhao
M. A. Balmaseda
M. Palmer
M. Valdivieso
MA Balmaseda
MA Balmaseda
MA Balmaseda
MA Balmaseda
MA Balmaseda
MF Cronin
MR Wadley
N. Ferry
NS Cooper
O. Alves
PJ Durack
PJ Durack
R Curry
R Murtugudde
R Zhang
R. Wedd
RR Dickson
S Guinehut
S Levitus
S Levitus
S Masuda
S Rahmstorf
S Zhang
S. A. Good
S. Guinehut
S. Masuda
SC Yang
T Lee
T Toyoda
T Vinje
T. Lee
T. Toyoda
TJ O’Kane
TP Boyer
X Wang
X. Wang
Y Fujii
Y Fujii
Y Xue
Y Xue
Y Yin
Y-S Chang
Y. Chang
Y. Fujii
Y. Yin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/10/2015
Field of study

Many institutions worldwide have developed ocean reanalyses systems (ORAs) utilizing a variety of ocean models and assimilation techniques. However, the quality of salinity reanalyses arising from the various ORAs has not yet been comprehensively assessed. In this study, we assess the upper ocean salinity content (depth-averaged over 0–700 m) from 14 ORAs and 3 objective ocean analysis systems (OOAs) as part of the Ocean Reanalyses Intercomparison Project. Our results show that the best agreement between estimates of salinity from different ORAs is obtained in the tropical Pacific, likely due to relatively abundant atmospheric and oceanic observations in this region. The largest disagreement in salinity reanalyses is in the Southern Ocean along the Antarctic circumpolar current as a consequence of the sparseness of both atmospheric and oceanic observations in this region. The West Pacific warm pool is the largest region where the signal to noise ratio of reanalysed salinity anomalies is >1. Therefore, the current salinity reanalyses in the tropical Pacific Ocean may be more reliable than those in the Southern Ocean and regions along the western boundary currents. Moreover, we found that the assimilation of salinity in ocean regions with relatively strong ocean fronts is still a common problem as seen in most ORAs. The impact of the Argo data on the salinity reanalyses is visible, especially within the upper 500m, where the interannual variability is large. The increasing trend in global-averaged salinity anomalies can only be found within the top 0–300m layer, but with quite large diversity among different ORAs. Beneath the 300m depth, the global-averaged salinity anomalies from most ORAs switch their trends from a slightly growing trend before 2002 to a decreasing trend after 2002. The rapid switch in the trend is most likely an artefact of the dramatic change in the observing system due to the implementation of Argo

Central Archive at the University of Reading

Crossref

ArchiMer - Institutional Archive of Ifremer

HAL-Université de Bretagne Occidentale

HAL-INSU

JAMSTEC Repository

Explore Bristol Research

Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes.

Author: A Amour
A Garcia-Gasca
A Lochter
A Lochter
AC Andres
AJ D'Ardenne
AM Belkin
AR Nelson
B Lelongt
BL Hogan
BM Gumbiner
BS Wiseman
C Brisken
C Brisken
CJ Sympson
CS Atwood
CW Daniel
CW Daniel
D Alford
D Alford
DG Stupack
ED Hay
EW Thompson
F Berdichevsky
F Kheradmand
FE Jones
G Bani
G Giannelli
G Shyamala
GB Silberstein
GB Silberstein
GB Silberstein
GB Silberstein
GL Ganser
GW Robinson
GW Robinson
GW Robinson
H Birkedal-Hansen
H Joseph
HJ Hathaway
HY Ha
J Chen
J Lilla
J Muschler
J Russo
J Schlondorff
J Yant
JE Fata
JE Fata
JE Fata
JE Ferguson
JF Wiesen
JJ Wysolmerski
JJ Wysolmerski
JM Bradbury
JM Whitelock
JM Williams
JP Irigoyen
JP Lydon
JP Witty
JT Emerman
K Elenius
K Hotary
K Ito
K Morita
K Wolf
L Hennighausen
LA Rudolph-Owen
LA Rudolph-Owen
LJ van't Veer
LR Lund
M Affolter
M Durbeej
M Egeblad
M Rytomaa
M Simian
MA Arnaout
MD Infeld
MD Sternlicht
MF Horster
MH Barcellos-Hoff
MJ van de Vijver
MM Zutter
MM Zutter
MP Osborne
MR Crowley
MR Warner
MS Wicha
N Koshikawa
N Quarto
NJ Kenney
NK Wessells
OW Petersen
P Bashkin
PR Warfield
PY Desprez
R Lagace
RH Wetzels
RS Talhouk
S Coleman
S Mukherjee
S Selvarajan
S Sinha
S Stahl
SD Banerjee
SD Robinson
SE Baker
SH Barsky
SL Bullock
SM Ellerbroek
T Darribere
T Gudjonsson
T Hayakawa
T Hayakawa
T Hayakawa
TC Klinowska
TC Klinowska
TC Klinowska
TH Vu
TN Seagroves
U Felbor
V Gouon-Evans
V Noe
VM Weaver
W Ruan
W Wang
W Xie
WF Vogel
WF Vogel
WH Yu
WP Bocchinfuso
Y Fukuda
Y Fukuda
Y Hirai
Y Hirai
Y Kadono
Y Nakanishi
YS Kanwar
Z Werb
Z Zhu
Publication venue: eScholarship, University of California
Publication date: 19/08/2003
Field of study

A considerable body of research indicates that mammary gland branching morphogenesis is dependent, in part, on the extracellular matrix (ECM), ECM-receptors, such as integrins and other ECM receptors, and ECM-degrading enzymes, including matrix metalloproteinases (MMPs) and their inhibitors, tissue inhibitors of metalloproteinases (TIMPs). There is some evidence that these ECM cues affect one or more of the following processes: cell survival, polarity, proliferation, differentiation, adhesion, and migration. Both three-dimensional culture models and genetic manipulations of the mouse mammary gland have been used to study the signaling pathways that affect these processes. However, the precise mechanisms of ECM-directed mammary morphogenesis are not well understood. Mammary morphogenesis involves epithelial 'invasion' of adipose tissue, a process akin to invasion by breast cancer cells, although the former is a highly regulated developmental process. How these morphogenic pathways are integrated in the normal gland and how they become dysregulated and subverted in the progression of breast cancer also remain largely unanswered questions

Crossref

PubMed Central

eScholarship - University of California

Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

Author: A Antoniadis
A Butte
AL Boulesteix
B Nadler
B Schölkopf
B Schölkopf
C Chatfield
CC Chang
CCC Liu
Christian Ruckert
Christoph Bartenhagen
CL Nutt
D Geman
D Singh
DV Nguyen
H Hotelling
Hans-Ulrich Klein
HU Klein
I Del Giudice
IS Lim
IT Jolliffe
J Baek
J Misra
JB Tenenbaum
JI Powell
JJ Dai
K Dawson
KQ Weinberger
KQ Weinberger
KY Yeung
LJP Van der Maaten
LK Saul
M Belkin
M Belkin
M Mramor
M Vlachos
MA Hibbs
Martin Dugas
N Cristianini
N Pochet
O Chapelle
R Verhaak
R Xu
S Chao
S Lafon
SB Cho
ST Roweis
T Li
TF Cox
TJ Umpai
TR Golub
U Alon
VD Silva
X Lin
Xiaoyi Jiang
Y Su
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data. Results A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods. Conclusions Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Optimal Reaction Coordinates

Author: Banushkina
Banushkina
Banushkina
Belkin
Berezhkovskii
Berezhkovskii
Best
Bolhuis
Bolhuis
Ceriotti
Chodera
Chung
Chung
Coifman
Cote
Cox
Darve
Das
Du
Du
Freddolino
Freeman
Geissler
Hu
Huang
Hummer
Hummer
Kevrekidis
Krivov
Krivov
Krivov
Krivov
Krivov
Krivov
Krivov
Krivov
Krivov
Krivov
Li
Li
Lindorff-Larsen
Liu
Lu
Ma
Maximova
Metzner
Metzner
Mori
Mu
Nüske
Nüske
Onsager
Ovchinnikov
Peters
Peters
Peters
Peters
Peters
Peters
Piana
Radou
Rao
Roweis
Rowley
Schuetz
Schwantes
Shaw
Silver
Sosnick
Tian
Tiwary
Torrie
Valsson
Vreede
Weinan
Weinan
Williams
Zwanzig
Zwanzig
Publication venue: 'Wiley'
Publication date: 10/08/2016
Field of study

The dynamic behavior of complex systems with many degrees of freedom is often analyzed by projection onto one or a few reaction coordinates. The dynamics is then described in a simple and intuitive way as diffusion on the associated free energy pro le. In order to use such a picture for a quantitative description of the dynamics one needs to select the coordinate in an optimal way so as to minimize non-Markovian effects due to the projection. For equilibrium dynamics between two boundary states (e.g., a reaction) the optimal coordinate is known as the committor or the pfold coordinate in protein folding studies. While the dynamics projected on the committor is not Markovian, many important quantities of the original multidimensional dynamics on an arbitrarily complex landscape can be computed exactly. Here we summarize the derivation of this result, discuss different approaches to determine and validate the committor coordinate and present three illustrative applications: protein folding, the game of chess, and patient recovery dynamics after kidney transplant

Crossref

White Rose Research Online

Muscle Fiber Viability, a Novel Method for the Fast Detection of Ischemic Muscle Injury in Rats

Author: AB Novikoff
AP Ciena
Attila Szijártó
D Troitzsch
D Troitzsch
DH Moore
Dávid Garbaisz
E Gyurkovics
E Jennische
E Malan
EM Carmo-Araujo
FC Seifert
FW Blaisdell
G Karpati
Gábor Lotz
H Tsubota
HH Klein
J Blebea
J Janda
J Nanobashvili
JL Farber
JP Idstrom
JR Miedema
JX Wang
K Harris
KJ Herbert
KR Knight
L Norgren
László Harsányi
M Belkin
M Pemberton
M Wachstein
MA Creager
MA Merrick
MD Woitaske
MJ Hickey
MK Heatley
ML Brandao
NJ Cahoon
Peter Rosenberger
PF Petrasek
PK Henke
PS Barie
Péter Arányi
RB Jennings
RB Jennings
RB Rutherford
RB Rutherford
RE Bryant
RK Chan
RL Schmidt
S Gillani
S Homer-Vanniasinkam
SE McAllister
TF Lindsay
TJ Walters
W Schaper
WZ Wang
Zsolt Turóczi
Ákos Lukáts
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 13/01/2014
Field of study

Acute lower extremity ischemia is a limb- and life-threatening clinical problem. Rapid detection of the degree of injury is crucial, however at present there are no exact diagnostic tests available to achieve this purpose. Our goal was to examine a novel technique - which has the potential to accurately assess the degree of ischemic muscle injury within a short period of time - in a clinically relevant rodent model. Male Wistar rats were exposed to 4, 6, 8 and 9 hours of bilateral lower limb ischemia induced by the occlusion of the infrarenal aorta. Additional animals underwent 8 and 9 hours of ischemia followed by 2 hours of reperfusion to examine the effects of revascularization. Muscle samples were collected from the left anterior tibial muscle for viability assessment. The degree of muscle damage (muscle fiber viability) was assessed by morphometric evaluation of NADH-tetrazolium reductase reaction on frozen sections. Right hind limbs were perfusion-fixed with paraformaldehyde and glutaraldehyde for light and electron microscopic examinations. Muscle fiber viability decreased progressively over the time of ischemia, with significant differences found between the consecutive times. High correlation was detected between the length of ischemia and the values of muscle fiber viability. After reperfusion, viability showed significant reduction in the 8-hour-ischemia and 2-hour-reperfusion group compared to the 8-hour-ischemia-only group, and decreased further after 9 hours of ischemia and 2 hours of reperfusion. Light- and electron microscopic findings correlated strongly with the values of muscle fiber viability: lesser viability values represented higher degree of ultrastructural injury while similar viability results corresponded to similar morphological injury. Muscle fiber viability was capable of accurately determining the degree of muscle injury in our rat model. Our method might therefore be useful in clinical settings in the diagnostics of acute ischemic muscle injury

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

Semmelweis Repository

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Author: AGK Janacek
André Skupin
BC Vanteru
Bob Schijvenaars
Colin Allen
David Newman
DJ Newman
DK Harman
DM Blei
EM Voorhees
EP Jiang
F Janssens
G Gorrell
G Salton
GL Poulter
GR Hjaltason
HM Müller
J Lewis
J Lin
J Lin
Joseph R. Biberstine
K Börner
K Järvelin
K Sparck Jones
K Sparck Jones
Katy Börner
Kevin W. Boyack
KW Boyack
KW Boyack
KW Boyack
MA Hearst
MD Cao
Michael Patek
MW Berry
N Jardine
Nianli Ma
NJ Belkin
P Ahlgren
P Ahlgren
P Calado
P Castells
R Kassab
R Klavans
Richard Klavans
Russell J. Duhon
S Deerwester
S Martin
SE Robertson
T Couto
T Hofmann
T Kohonen
T Kohonen
T Theodosiou
TG Kolda
TK Landauer
WS Cooper
Y Aphinyanaphongs
Y Yamamoto
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts

Public Library of Science (PLOS)

Crossref

IUScholarWorks (University of Indiana)

Directory of Open Access Journals

PubMed Central

eScholarship - University of California