94 research outputs found
The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets
We study universal traits which emerge both in real-world complex datasets,
as well as in artificially generated ones. Our approach is to analogize data to
a physical system and employ tools from statistical physics and Random Matrix
Theory (RMT) to reveal their underlying structure. We focus on the
feature-feature covariance matrix, analyzing both its local and global
eigenvalue statistics. Our main observations are: (i) The power-law scalings
that the bulk of its eigenvalues exhibit are vastly different for uncorrelated
random data compared to real-world data, (ii) this scaling behavior can be
completely recovered by introducing long range correlations in a simple way to
the synthetic data, (iii) both generated and real-world datasets lie in the
same universality class from the RMT perspective, as chaotic rather than
integrable systems, (iv) the expected RMT statistical behavior already
manifests for empirical covariance matrices at dataset sizes significantly
smaller than those conventionally used for real-world training, and can be
related to the number of samples required to approximate the population
power-law scaling behavior, (v) the Shannon entropy is correlated with local
RMT structure and eigenvalues scaling, and substantially smaller in strongly
correlated datasets compared to uncorrelated synthetic data, and requires fewer
samples to reach the distribution entropy. These findings can have numerous
implications to the characterization of the complexity of data sets, including
differentiating synthetically generated from natural data, quantifying noise,
developing better data pruning methods and classifying effective learning
models utilizing these scaling laws.Comment: 16 pages, 7 figure
The Universal Statistical Structure and Scaling Laws of Chaos and Turbulence
Turbulence is a complex spatial and temporal structure created by the strong
non-linear dynamics of fluid flows at high Reynolds numbers. Despite being an
ubiquitous phenomenon that has been studied for centuries, a full understanding
of turbulence remained a formidable challenge. Here, we introduce tools from
the fields of quantum chaos and Random Matrix Theory (RMT) and present a
detailed analysis of image datasets generated from turbulence simulations of
incompressible and compressible fluid flows. Focusing on two observables: the
data Gram matrix and the single image distribution, we study both the local and
global eigenvalue statistics and compare them to classical chaos, uncorrelated
noise and natural images. We show that from the RMT perspective, the turbulence
Gram matrices lie in the same universality class as quantum chaotic rather than
integrable systems, and the data exhibits power-law scalings in the bulk of its
eigenvalues which are vastly different from uncorrelated classical chaos,
random data, natural images. Interestingly, we find that the single sample
distribution only appears as fully RMT chaotic, but deviates from chaos at
larger correlation lengths, as well as exhibiting different scaling properties.Comment: 9 pages, 4 figure
Growth factors in human disease: The realities, pitfalls, and promise
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/25599/1/0000146.pd
Information Architecture and Intertemporal Choice: A Randomized Field Experiment in the United States
In a randomized field experiment, I show that information architecture significantly affects individuals' spending and savings behavior. I present users of a large online account aggregation provider with a personalized financial index. This index represents the inflation-protected, lifetime monthly cash flow that they can obtain, given their personal financial and demographic information and current market prices. Users receiving this information tool reduce their spending by 10.7% relative to a control group. This effect is sensitive to the description of the index using a consumption frame rather than an investment frame and to the presentation of an explicit comparison between the index and historical spending levels. Further, spending reductions are primarily in large, infrequent transactions. This experiment is the first to directly affect overall spending behavior and to demonstrate the importance of information architecture in that context. It demonstrates the potential of low cost digital information tools to impact financial behavior on a large scale
Scheduling with Testing
We study a new class of scheduling problems that capture common settings in service environments, in which one has to serve a collection of jobs that have a priori uncertain attributes (e.g., processing times and priorities) and the service provider has to decide how to dynamically allocate resources (e.g., people, equipment, and time) between testing (diagnosing) jobs to learn more about their respective uncertain attributes and processing jobs. The former could inform future decisions, but could delay the service time for other jobs, while the latter directly advances the processing of the jobs but requires making decisions under uncertainty. Through novel analysis we obtain surprising structural results of optimal policies that provide operational managerial insights, efficient optimal and near-optimal algorithms, and quantification of the value of testing. We believe that our approach will lead to further research to explore this important practical trade-off
- …