415 research outputs found

    Reconciling modern machine learning practice and the bias-variance trade-off

    Full text link
    Breakthroughs in machine learning are rapidly changing science and society, yet our fundamental understanding of this technology has lagged far behind. Indeed, one of the central tenets of the field, the bias-variance trade-off, appears to be at odds with the observed behavior of methods used in the modern machine learning practice. The bias-variance trade-off implies that a model should balance under-fitting and over-fitting: rich enough to express underlying structure in data, simple enough to avoid fitting spurious patterns. However, in the modern practice, very rich models such as neural networks are trained to exactly fit (i.e., interpolate) the data. Classically, such models would be considered over-fit, and yet they often obtain high accuracy on test data. This apparent contradiction has raised questions about the mathematical foundations of machine learning and their relevance to practitioners. In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance. We provide evidence for the existence and ubiquity of double descent for a wide spectrum of models and datasets, and we posit a mechanism for its emergence. This connection between the performance and the structure of machine learning models delineates the limits of classical analyses, and has implications for both the theory and practice of machine learning

    Regulation of mammary gland branching morphogenesis by the extracellular matrix and its remodeling enzymes.

    Get PDF
    A considerable body of research indicates that mammary gland branching morphogenesis is dependent, in part, on the extracellular matrix (ECM), ECM-receptors, such as integrins and other ECM receptors, and ECM-degrading enzymes, including matrix metalloproteinases (MMPs) and their inhibitors, tissue inhibitors of metalloproteinases (TIMPs). There is some evidence that these ECM cues affect one or more of the following processes: cell survival, polarity, proliferation, differentiation, adhesion, and migration. Both three-dimensional culture models and genetic manipulations of the mouse mammary gland have been used to study the signaling pathways that affect these processes. However, the precise mechanisms of ECM-directed mammary morphogenesis are not well understood. Mammary morphogenesis involves epithelial 'invasion' of adipose tissue, a process akin to invasion by breast cancer cells, although the former is a highly regulated developmental process. How these morphogenic pathways are integrated in the normal gland and how they become dysregulated and subverted in the progression of breast cancer also remain largely unanswered questions

    Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Visualization of DNA microarray data in two or three dimensional spaces is an important exploratory analysis step in order to detect quality issues or to generate new hypotheses. Principal Component Analysis (PCA) is a widely used linear method to define the mapping between the high-dimensional data and its low-dimensional representation. During the last decade, many new nonlinear methods for dimension reduction have been proposed, but it is still unclear how well these methods capture the underlying structure of microarray gene expression data. In this study, we assessed the performance of the PCA approach and of six nonlinear dimension reduction methods, namely Kernel PCA, Locally Linear Embedding, Isomap, Diffusion Maps, Laplacian Eigenmaps and Maximum Variance Unfolding, in terms of visualization of microarray data.</p> <p>Results</p> <p>A systematic benchmark, consisting of Support Vector Machine classification, cluster validation and noise evaluations was applied to ten microarray and several simulated datasets. Significant differences between PCA and most of the nonlinear methods were observed in two and three dimensional target spaces. With an increasing number of dimensions and an increasing number of differentially expressed genes, all methods showed similar performance. PCA and Diffusion Maps responded less sensitive to noise than the other nonlinear methods.</p> <p>Conclusions</p> <p>Locally Linear Embedding and Isomap showed a superior performance on all datasets. In very low-dimensional representations and with few differentially expressed genes, these two methods preserve more of the underlying structure of the data than PCA, and thus are favorable alternatives for the visualization of microarray data.</p

    Optimal Reaction Coordinates

    Get PDF
    The dynamic behavior of complex systems with many degrees of freedom is often analyzed by projection onto one or a few reaction coordinates. The dynamics is then described in a simple and intuitive way as diffusion on the associated free energy pro le. In order to use such a picture for a quantitative description of the dynamics one needs to select the coordinate in an optimal way so as to minimize non-Markovian effects due to the projection. For equilibrium dynamics between two boundary states (e.g., a reaction) the optimal coordinate is known as the committor or the pfold coordinate in protein folding studies. While the dynamics projected on the committor is not Markovian, many important quantities of the original multidimensional dynamics on an arbitrarily complex landscape can be computed exactly. Here we summarize the derivation of this result, discuss different approaches to determine and validate the committor coordinate and present three illustrative applications: protein folding, the game of chess, and patient recovery dynamics after kidney transplant

    Muscle Fiber Viability, a Novel Method for the Fast Detection of Ischemic Muscle Injury in Rats

    Get PDF
    Acute lower extremity ischemia is a limb- and life-threatening clinical problem. Rapid detection of the degree of injury is crucial, however at present there are no exact diagnostic tests available to achieve this purpose. Our goal was to examine a novel technique - which has the potential to accurately assess the degree of ischemic muscle injury within a short period of time - in a clinically relevant rodent model. Male Wistar rats were exposed to 4, 6, 8 and 9 hours of bilateral lower limb ischemia induced by the occlusion of the infrarenal aorta. Additional animals underwent 8 and 9 hours of ischemia followed by 2 hours of reperfusion to examine the effects of revascularization. Muscle samples were collected from the left anterior tibial muscle for viability assessment. The degree of muscle damage (muscle fiber viability) was assessed by morphometric evaluation of NADH-tetrazolium reductase reaction on frozen sections. Right hind limbs were perfusion-fixed with paraformaldehyde and glutaraldehyde for light and electron microscopic examinations. Muscle fiber viability decreased progressively over the time of ischemia, with significant differences found between the consecutive times. High correlation was detected between the length of ischemia and the values of muscle fiber viability. After reperfusion, viability showed significant reduction in the 8-hour-ischemia and 2-hour-reperfusion group compared to the 8-hour-ischemia-only group, and decreased further after 9 hours of ischemia and 2 hours of reperfusion. Light- and electron microscopic findings correlated strongly with the values of muscle fiber viability: lesser viability values represented higher degree of ultrastructural injury while similar viability results corresponded to similar morphological injury. Muscle fiber viability was capable of accurately determining the degree of muscle injury in our rat model. Our method might therefore be useful in clinical settings in the diagnostics of acute ischemic muscle injury

    Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

    Get PDF
    We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts
    corecore