    Foundational principles for large scale inference: Illustrations through correlation mining

    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    Solving the tasks of subsurface resources management in GIS RAPID environment

    Purpose. Solving the tasks of subsurface resources management based on the created GIS RAPID geoinformation technology. Methods. Close spatial relationships of lineament network characteristics and earthquake epicenters were detected in 3 seismically active areas located in the mountainous regions of Central Europe. Digital elevation models (DEM) based on ASTER satellite surveys and earthquake epicenter data were used. The nature of spatial relationship of lineament network and vein ore objects was studied in the territory of Congo DR, in the Lake Kivu area using space imagery. Gold ore objects were searched and forecasted in Uzbekistan in the site of Jamansai Mountains. High- resolution imagery from QuickBird 2 satellite, geophysical field surveys, geological and geochemical data were used. Findings. It was found that a significant number of epicenters are located in areas of high concentration of “non-standard” azimuths lineaments – from 27 to 34% of the total number of lineaments. It was revealed that 59.6% of the epicenters are located within 10% of sites with the highest values of complex deformation maps; 50% of the areas with the highest values of these maps contain, on average, 89% of all earthquake epicenters. It was found that satellite image lineament concentration maps with “non-standard” azimuths reflect the spatial relationship with known deposits much better than the concentration map of all lineaments. It was detected that the total area of gold ore objects perspective sites is about 20 km2. Originality. The use of GIS RAPID in a number of earth’s crust areas has allowed to establish new regularities linking the networks of physical field and landscape lineament characteristics with ore bodies and earthquake epicenters localization. Practical implications.     Role based behavior analysis

    Automatic Bayesian Density Analysis

    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    Conditional network embeddings

    Network Embeddings (NEs) map the nodes of a given network into dd-dimensional Euclidean space Rd\mathbb{R}^d. Ideally, this mapping is such that 'similar' nodes are mapped onto nearby points, such that the NE can be used for purposes such as link prediction (if 'similar' means being 'more likely to be connected') or classification (if 'similar' means 'being more likely to have the same label'). In recent years various methods for NE have been introduced, all following a similar strategy: defining a notion of similarity between nodes (typically some distance measure within the network), a distance measure in the embedding space, and a loss function that penalizes large distances for similar nodes and small distances for dissimilar nodes. A difficulty faced by existing methods is that certain networks are fundamentally hard to embed due to their structural properties: (approximate) multipartiteness, certain degree distributions, assortativity, etc. To overcome this, we introduce a conceptual innovation to the NE literature and propose to create \emph{Conditional Network Embeddings} (CNEs); embeddings that maximally add information with respect to given structural properties (e.g. node degrees, block densities, etc.). We use a simple Bayesian approach to achieve this, and propose a block stochastic gradient descent algorithm for fitting it efficiently. We demonstrate that CNEs are superior for link prediction and multi-label classification when compared to state-of-the-art methods, and this without adding significant mathematical or computational complexity. Finally, we illustrate the potential of CNE for network visualization