43,132 research outputs found

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    Solving the tasks of subsurface resources management in GIS RAPID environment

    Get PDF
    Purpose. Solving the tasks of subsurface resources management based on the created GIS RAPID geoinformation technology. Methods. Close spatial relationships of lineament network characteristics and earthquake epicenters were detected in 3 seismically active areas located in the mountainous regions of Central Europe. Digital elevation models (DEM) based on ASTER satellite surveys and earthquake epicenter data were used. The nature of spatial relationship of lineament network and vein ore objects was studied in the territory of Congo DR, in the Lake Kivu area using space imagery. Gold ore objects were searched and forecasted in Uzbekistan in the site of Jamansai Mountains. High- resolution imagery from QuickBird 2 satellite, geophysical field surveys, geological and geochemical data were used. Findings. It was found that a significant number of epicenters are located in areas of high concentration of “non-standard” azimuths lineaments – from 27 to 34% of the total number of lineaments. It was revealed that 59.6% of the epicenters are located within 10% of sites with the highest values of complex deformation maps; 50% of the areas with the highest values of these maps contain, on average, 89% of all earthquake epicenters. It was found that satellite image lineament concentration maps with “non-standard” azimuths reflect the spatial relationship with known deposits much better than the concentration map of all lineaments. It was detected that the total area of gold ore objects perspective sites is about 20 km2. Originality. The use of GIS RAPID in a number of earth’s crust areas has allowed to establish new regularities linking the networks of physical field and landscape lineament characteristics with ore bodies and earthquake epicenters localization. Practical implications. A new technology has been developed for solving geological forecasting and prospecting problems. The technology can be used to solve a wide range of practical problems, especially in difficult geological conditions when searching for deep objects weakly presented in external fields and landscape.Мета. Рішення задач надрокористування на базі створеної геоінформаційної технології ГІС РАПІД. Методика. Виявлення тісних просторових взаємозв’язків різноманітних характеристик мереж лінеаментів і епіцентрів землетрусів проводилося у 3 сейсмоактивних ділянках, розташованих в гірських районах Центральної Європи. Використовувалися цифрові моделі рельєфу (DEM), побудовані за зйомками зі супутника ASTER і дані по епіцентрах землетрусів. Дослідження характеру просторового взаємозв’язку мережі лінеаментів і жильних рудних об’єктів проводилися на території Демократичної Республіки Конго, в районі озера Ківу із використанням космічних зйомок. Дослідження пошуку та прогнозу золоторудних об’єктів виконувалися в Узбекистані на ділянці Джамансайскіх гір. Використовувалися високоточні космічні зйомки зі супутника QuickBird 2, зйомки геофізичних полів, геологічні та геохімічні дані. Результати. Виявлено, що значна частина епіцентрів приурочена саме до ділянок підвищеної концентрації лінеаментів “нестандартних” азимутів, складаючи від 27 до 34% загального числа лінеаментів. Встановлено, що 59.6% епіцентрів знаходяться всередині 10% території ділянок, що володіють найвищими значеннями комплексних карт деформацій; 50% території з найвищими значеннями цих карт вміщають, в середньому, 89% усіх епіцентрів землетрусів. Визначено, що карти концентрації лінеаментів космознімків з “нестанартними” азимутами значно краще відображають просторовий взаємозв’язок з відомими родовищами у порівнянні з картою концентрації всіх лінеаментів. Встановлено, що сумарна площа перспективних ділянок золоторудних об’єктів склала близько 20 км2. Наукова новизна. Застосування ГІС РАПІД на ряді ділянок земної кори дозволило встановити нові закономірності, що зв’язують характеристики мережі лінеаментів фізичних полів і ландшафту з локалізацією рудних тіл та епіцентрів землетрусів. Практична значимість. Розроблено нову технологію рішення прогнозних і пошукових геологічних завдань, яка може застосовуватися для вирішення широкого кола практичних задач, особливо у складних геологічних умовах при пошуках глибокозалягаючих об’єктів, що слабо виявляються в зовнішніх полях і ландшафті.Цель. Решения задач недропользования на базе созданной геоинформационной технологии ГИС РАПИД. Методика. Выявление тесных пространственных взаимосвязей разнообразных характеристик сетей линеаментов и эпицентров землетрясений проводилось в 3 сейсмоактивных участках, расположенных в горных районах Центральной Европы. Использовались цифровые модели рельефа (DEM), построенные по съемкам со спутника ASTER, и данные об эпицентрах землетрясений. Исследования характера пространственной взаимосвязи сети линеаментов и жильных рудных объектов проводились на территории Демократической Республики Конго, в районе озера Киву с использованием космических съемок. Исследования поиска и прогноза золоторудных объектов выполнялись в Узбекистане на участке Джамансайских гор. Использовались высокоточные космические съемки со спутника QuickBird 2, съемки геофизических полей, геологические и геохимические данные. Результаты. Выявлено, что значительная часть эпицентров приурочена именно к участкам повышенной концентрации линеаментов “нестандартных” азимутов, составляя от 27 до 34% общего числа линеаментов. Установлено, что 59.6% эпицентров находятся внутри 10% территории участков, обладающих наивысшими значениями комплексных карт деформаций; 50% территории с наивысшими значениями этих карт вмещают, в среднем, 89% всех эпицентров землетрясений. Определено, что карты концентрации линеаментов космоснимков с “нестанартными” азимутами значительно лучше отражают пространственную взаимосвязь с известными месторождениями по сравнению с картой концентрации всех линеаментов. Установлено, что суммарная площадь перспективных участков золоторудных объектов составила около 20 км2. Научная новизна. Применение ГИС РАПИД на ряде участков земной коры позволило установить новые закономерности, связывающие характеристики сети линеаментов физических полей и ландшафта с локализацией рудных тел и эпицентров землетрясений. Практическая значимость. Разработана новая технология решения прогнозных и поисковых геологических задач, которая может применяться для решения широкого круга практических задач, особенно в сложных геологических условиях при поисках глубокозалегающих объектов, слабо проявляющихся во внешних полях и ландшафте.The work is performed as a part of planned research of the geoinformation systems department of the Dnipro University of Technology. The results are obtained without any financial support of grants and research projects. The authors express appreciation to reviewers and editors for their valuable comments, recommendations, and attention to the work

    Role based behavior analysis

    Get PDF
    Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2009Nos nossos dias, o sucesso de uma empresa depende da sua agilidade e capacidade de se adaptar a condições que se alteram rapidamente. Dois requisitos para esse sucesso são trabalhadores proactivos e uma infra-estrutura ágil de Tecnologias de Informacão/Sistemas de Informação (TI/SI) que os consiga suportar. No entanto, isto nem sempre sucede. Os requisitos dos utilizadores ao nível da rede podem nao ser completamente conhecidos, o que causa atrasos nas mudanças de local e reorganizações. Além disso, se não houver um conhecimento preciso dos requisitos, a infraestrutura de TI/SI poderá ser utilizada de forma ineficiente, com excessos em algumas áreas e deficiências noutras. Finalmente, incentivar a proactividade não implica acesso completo e sem restrições, uma vez que pode deixar os sistemas vulneráveis a ameaças externas e internas. O objectivo do trabalho descrito nesta tese é desenvolver um sistema que consiga caracterizar o comportamento dos utilizadores do ponto de vista da rede. Propomos uma arquitectura de sistema modular para extrair informação de fluxos de rede etiquetados. O processo é iniciado com a criação de perfis de utilizador a partir da sua informação de fluxos de rede. Depois, perfis com características semelhantes são agrupados automaticamente, originando perfis de grupo. Finalmente, os perfis individuais são comprados com os perfis de grupo, e os que diferem significativamente são marcados como anomalias para análise detalhada posterior. Considerando esta arquitectura, propomos um modelo para descrever o comportamento de rede dos utilizadores e dos grupos. Propomos ainda métodos de visualização que permitem inspeccionar rapidamente toda a informação contida no modelo. O sistema e modelo foram avaliados utilizando um conjunto de dados reais obtidos de um operador de telecomunicações. Os resultados confirmam que os grupos projectam com precisão comportamento semelhante. Além disso, as anomalias foram as esperadas, considerando a população subjacente. Com a informação que este sistema consegue extrair dos dados em bruto, as necessidades de rede dos utilizadores podem sem supridas mais eficazmente, os utilizadores suspeitos são assinalados para posterior análise, conferindo uma vantagem competitiva a qualquer empresa que use este sistema.In our days, the success of a corporation hinges on its agility and ability to adapt to fast changing conditions. Proactive workers and an agile IT/IS infrastructure that can support them is a requirement for this success. Unfortunately, this is not always the case. The user’s network requirements may not be fully understood, which slows down relocation and reorganization. Also, if there is no grasp on the real requirements, the IT/IS infrastructure may not be efficiently used, with waste in some areas and deficiencies in others. Finally, enabling proactivity does not mean full unrestricted access, since this may leave the systems vulnerable to outsider and insider threats. The purpose of the work described on this thesis is to develop a system that can characterize user network behavior. We propose a modular system architecture to extract information from tagged network flows. The system process begins by creating user profiles from their network flows’ information. Then, similar profiles are automatically grouped into clusters, creating role profiles. Finally, the individual profiles are compared against the roles, and the ones that differ significantly are flagged as anomalies for further inspection. Considering this architecture, we propose a model to describe user and role network behavior. We also propose visualization methods to quickly inspect all the information contained in the model. The system and model were evaluated using a real dataset from a large telecommunications operator. The results confirm that the roles accurately map similar behavior. The anomaly results were also expected, considering the underlying population. With the knowledge that the system can extract from the raw data, the users network needs can be better fulfilled, the anomalous users flagged for inspection, giving an edge in agility for any company that uses it

    Automatic Bayesian Density Analysis

    Full text link
    Making sense of a dataset in an automatic and unsupervised fashion is a challenging problem in statistics and AI. Classical approaches for {exploratory data analysis} are usually not flexible enough to deal with the uncertainty inherent to real-world data: they are often restricted to fixed latent interaction models and homogeneous likelihoods; they are sensitive to missing, corrupt and anomalous data; moreover, their expressiveness generally comes at the price of intractable inference. As a result, supervision from statisticians is usually needed to find the right model for the data. However, since domain experts are not necessarily also experts in statistics, we propose Automatic Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible at large. Specifically, ABDA allows for automatic and efficient missing value estimation, statistical data type and likelihood discovery, anomaly detection and dependency structure mining, on top of providing accurate density estimation. Extensive empirical evidence shows that ABDA is a suitable tool for automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19

    Conditional network embeddings

    Get PDF
    Network Embeddings (NEs) map the nodes of a given network into dd-dimensional Euclidean space Rd\mathbb{R}^d. Ideally, this mapping is such that 'similar' nodes are mapped onto nearby points, such that the NE can be used for purposes such as link prediction (if 'similar' means being 'more likely to be connected') or classification (if 'similar' means 'being more likely to have the same label'). In recent years various methods for NE have been introduced, all following a similar strategy: defining a notion of similarity between nodes (typically some distance measure within the network), a distance measure in the embedding space, and a loss function that penalizes large distances for similar nodes and small distances for dissimilar nodes. A difficulty faced by existing methods is that certain networks are fundamentally hard to embed due to their structural properties: (approximate) multipartiteness, certain degree distributions, assortativity, etc. To overcome this, we introduce a conceptual innovation to the NE literature and propose to create \emph{Conditional Network Embeddings} (CNEs); embeddings that maximally add information with respect to given structural properties (e.g. node degrees, block densities, etc.). We use a simple Bayesian approach to achieve this, and propose a block stochastic gradient descent algorithm for fitting it efficiently. We demonstrate that CNEs are superior for link prediction and multi-label classification when compared to state-of-the-art methods, and this without adding significant mathematical or computational complexity. Finally, we illustrate the potential of CNE for network visualization
    corecore