76 research outputs found

    The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

    Full text link
    We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated random data compared to real-world data, (ii) this scaling behavior can be completely recovered by introducing long range correlations in a simple way to the synthetic data, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, and substantially smaller in strongly correlated datasets compared to uncorrelated synthetic data, and requires fewer samples to reach the distribution entropy. These findings can have numerous implications to the characterization of the complexity of data sets, including differentiating synthetically generated from natural data, quantifying noise, developing better data pruning methods and classifying effective learning models utilizing these scaling laws.Comment: 16 pages, 7 figure

    The Universal Statistical Structure and Scaling Laws of Chaos and Turbulence

    Full text link
    Turbulence is a complex spatial and temporal structure created by the strong non-linear dynamics of fluid flows at high Reynolds numbers. Despite being an ubiquitous phenomenon that has been studied for centuries, a full understanding of turbulence remained a formidable challenge. Here, we introduce tools from the fields of quantum chaos and Random Matrix Theory (RMT) and present a detailed analysis of image datasets generated from turbulence simulations of incompressible and compressible fluid flows. Focusing on two observables: the data Gram matrix and the single image distribution, we study both the local and global eigenvalue statistics and compare them to classical chaos, uncorrelated noise and natural images. We show that from the RMT perspective, the turbulence Gram matrices lie in the same universality class as quantum chaotic rather than integrable systems, and the data exhibits power-law scalings in the bulk of its eigenvalues which are vastly different from uncorrelated classical chaos, random data, natural images. Interestingly, we find that the single sample distribution only appears as fully RMT chaotic, but deviates from chaos at larger correlation lengths, as well as exhibiting different scaling properties.Comment: 9 pages, 4 figure

    Scheduling with Testing

    No full text
    We study a new class of scheduling problems that capture common settings in service environments, in which one has to serve a collection of jobs that have a priori uncertain attributes (e.g., processing times and priorities) and the service provider has to decide how to dynamically allocate resources (e.g., people, equipment, and time) between testing (diagnosing) jobs to learn more about their respective uncertain attributes and processing jobs. The former could inform future decisions, but could delay the service time for other jobs, while the latter directly advances the processing of the jobs but requires making decisions under uncertainty. Through novel analysis we obtain surprising structural results of optimal policies that provide operational managerial insights, efficient optimal and near-optimal algorithms, and quantification of the value of testing. We believe that our approach will lead to further research to explore this important practical trade-off

    Mucin gene expression in bile of patients with and without gallstone disease, collected by endoscopic retrograde cholangiography

    No full text
    AIM: To investigate the pattern of mucin expression and concentration in bile obtained during endoscopic retrograde cholangiography (ERC) in relation to gallstone disease

    Decision Fatigue and Heuristic Analyst Forecasts

    No full text
    Psychological evidence indicates that decision quality declines after an extensive session of decision-making, a phenomenon known as decision fatigue. We study whether decision fatigue affects analysts’ judgments. Analysts cover multiple firms and often issue several forecasts in a single day. We find that forecast accuracy declines over the course of a day as the number of forecasts the analyst has already issued increases. Also consistent with decision fatigue, we find that the more forecasts an analyst issues, the higher the likelihood the analyst resorts to more heuristic decisions by herding more closely with the consensus forecast, by self-herding (i.e., reissuing their own previous outstanding forecasts), and by issuing a rounded forecast. Finally, we find that the stock market understands these effects and discounts for analyst decision fatigue
    • …
    corecore