415 research outputs found
Reconciling modern machine learning practice and the bias-variance trade-off
Breakthroughs in machine learning are rapidly changing science and society,
yet our fundamental understanding of this technology has lagged far behind.
Indeed, one of the central tenets of the field, the bias-variance trade-off,
appears to be at odds with the observed behavior of methods used in the modern
machine learning practice. The bias-variance trade-off implies that a model
should balance under-fitting and over-fitting: rich enough to express
underlying structure in data, simple enough to avoid fitting spurious patterns.
However, in the modern practice, very rich models such as neural networks are
trained to exactly fit (i.e., interpolate) the data. Classically, such models
would be considered over-fit, and yet they often obtain high accuracy on test
data. This apparent contradiction has raised questions about the mathematical
foundations of machine learning and their relevance to practitioners.
In this paper, we reconcile the classical understanding and the modern
practice within a unified performance curve. This "double descent" curve
subsumes the textbook U-shaped bias-variance trade-off curve by showing how
increasing model capacity beyond the point of interpolation results in improved
performance. We provide evidence for the existence and ubiquity of double
descent for a wide spectrum of models and datasets, and we posit a mechanism
for its emergence. This connection between the performance and the structure of
machine learning models delineates the limits of classical analyses, and has
implications for both the theory and practice of machine learning
Recommended from our members
An assessment of upper ocean salinity content from the ocean reanalyses inter-comparison project (ORA-IP)
Many institutions worldwide have developed ocean reanalyses systems (ORAs) utilizing a variety of ocean models and assimilation techniques. However, the quality of salinity reanalyses arising from the various ORAs has not yet been comprehensively assessed. In this study, we assess the upper ocean salinity content (depth-averaged over 0–700 m) from 14 ORAs and 3 objective ocean analysis systems (OOAs) as part of the Ocean Reanalyses Intercomparison Project. Our results show that the best agreement between estimates of salinity from different ORAs is obtained in the tropical Pacific, likely due to relatively abundant atmospheric and oceanic observations in this region. The largest disagreement in salinity reanalyses is in the Southern Ocean along the Antarctic circumpolar current as a consequence of the sparseness of both atmospheric and oceanic observations in this region. The West Pacific warm pool is the largest region where the signal to noise ratio of reanalysed salinity anomalies is >1. Therefore, the current salinity reanalyses in the tropical Pacific Ocean may be more reliable than those in the Southern Ocean and regions along the western boundary currents. Moreover, we found that the assimilation of salinity in ocean regions with relatively strong ocean fronts is still a common problem as seen in most ORAs. The impact of the Argo data on the salinity reanalyses is visible, especially within the upper 500m, where the interannual variability is large. The increasing trend in global-averaged salinity anomalies can only be found within the top 0–300m layer, but with quite large diversity among different ORAs.
Beneath the 300m depth, the global-averaged salinity anomalies from most ORAs switch their trends from a slightly growing trend before 2002 to a decreasing trend after 2002. The rapid switch in the trend is most likely an artefact of the dramatic change in the observing system due to the implementation of Argo
Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes.
A considerable body of research indicates that mammary gland branching morphogenesis is dependent, in part, on the extracellular matrix (ECM), ECM-receptors, such as integrins and other ECM receptors, and ECM-degrading enzymes, including matrix metalloproteinases (MMPs) and their inhibitors, tissue inhibitors of metalloproteinases (TIMPs). There is some evidence that these ECM cues affect one or more of the following processes: cell survival, polarity, proliferation, differentiation, adhesion, and migration. Both three-dimensional culture models and genetic manipulations of the mouse mammary gland have been used to study the signaling pathways that affect these processes. However, the precise mechanisms of ECM-directed mammary morphogenesis are not well understood. Mammary morphogenesis involves epithelial 'invasion' of adipose tissue, a process akin to invasion by breast cancer cells, although the former is a highly regulated developmental process. How these morphogenic pathways are integrated in the normal gland and how they become dysregulated and subverted in the progression of breast cancer also remain largely unanswered questions
Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data
<p>Abstract</p> <p>Background</p> <p>Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data.</p> <p>Results</p> <p>A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods.</p> <p>Conclusions</p> <p>Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p
Optimal Reaction Coordinates
The dynamic behavior of complex systems with many degrees of freedom is often analyzed by projection onto one or a few reaction coordinates. The dynamics is then described in a simple and intuitive way as diffusion on the associated free energy pro le. In order to use such a picture for a quantitative description of the dynamics one needs to select the coordinate in an optimal way so as to minimize non-Markovian effects due to the projection. For equilibrium dynamics between two boundary states (e.g., a reaction) the optimal coordinate is known as the committor or the pfold coordinate in protein folding studies. While the dynamics projected on the committor is not Markovian, many important quantities of the original multidimensional dynamics on an arbitrarily complex landscape can be computed exactly. Here we summarize the derivation of this result, discuss different approaches to determine and validate the committor coordinate and present three illustrative applications: protein folding, the game of chess, and patient recovery dynamics after kidney transplant
Muscle Fiber Viability, a Novel Method for the Fast Detection of Ischemic Muscle Injury in Rats
Acute lower extremity ischemia is a limb- and life-threatening clinical problem. Rapid detection of the degree of injury is crucial, however at present there are no exact diagnostic tests available to achieve this purpose. Our goal was to examine a novel technique - which has the potential to accurately assess the degree of ischemic muscle injury within a short period of time - in a clinically relevant rodent model. Male Wistar rats were exposed to 4, 6, 8 and 9 hours of bilateral lower limb ischemia induced by the occlusion of the infrarenal aorta. Additional animals underwent 8 and 9 hours of ischemia followed by 2 hours of reperfusion to examine the effects of revascularization. Muscle samples were collected from the left anterior tibial muscle for viability assessment. The degree of muscle damage (muscle fiber viability) was assessed by morphometric evaluation of NADH-tetrazolium reductase reaction on frozen sections. Right hind limbs were perfusion-fixed with paraformaldehyde and glutaraldehyde for light and electron microscopic examinations. Muscle fiber viability decreased progressively over the time of ischemia, with significant differences found between the consecutive times. High correlation was detected between the length of ischemia and the values of muscle fiber viability. After reperfusion, viability showed significant reduction in the 8-hour-ischemia and 2-hour-reperfusion group compared to the 8-hour-ischemia-only group, and decreased further after 9 hours of ischemia and 2 hours of reperfusion. Light- and electron microscopic findings correlated strongly with the values of muscle fiber viability: lesser viability values represented higher degree of ultrastructural injury while similar viability results corresponded to similar morphological injury. Muscle fiber viability was capable of accurately determining the degree of muscle injury in our rat model. Our method might therefore be useful in clinical settings in the diagnostics of acute ischemic muscle injury
Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches
We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts
- …