10,961 research outputs found

    The OTree: multidimensional indexing with efficient data sampling for HPC

    Get PDF
    Spatial big data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of petabytes of spatial data per year. However, many authors have pointed out that the lack of specialized frameworks for multidimensional Big Data is limiting possible applications and precluding many scientific breakthroughs. Paramount in achieving High-Performance Data Analytics is to optimize and reduce the I/O operations required to analyze large data sets. To do so, we need to organize and index the data according to its multidimensional attributes. At the same time, to enable fast and interactive exploratory analysis, it is vital to generate approximate representations of large datasets efficiently. In this paper, we propose the Outlook Tree (or OTree), a novel Multidimensional Indexing with efficient data Sampling (MIS) algorithm. The OTree enables exploratory analysis of large multidimensional datasets with arbitrary precision, a vital missing feature in current distributed data management solutions. Our algorithm reduces the indexing overhead and achieves high performance even for write-intensive HPC applications. Indeed, we use the OTree to store the scientific results of a study on the efficiency of drug inhalers. Then we compare the OTree implementation on Apache Cassandra, named Qbeast, with PostgreSQL and plain storage. Lastly, we demonstrate that our proposal delivers better performance and scalability.Peer ReviewedPostprint (author's final draft

    Providing Diversity in K-Nearest Neighbor Query Results

    Full text link
    Given a point query Q in multi-dimensional space, K-Nearest Neighbor (KNN) queries return the K closest answers according to given distance metric in the database with respect to Q. In this scenario, it is possible that a majority of the answers may be very similar to some other, especially when the data has clusters. For a variety of applications, such homogeneous result sets may not add value to the user. In this paper, we consider the problem of providing diversity in the results of KNN queries, that is, to produce the closest result set such that each answer is sufficiently different from the rest. We first propose a user-tunable definition of diversity, and then present an algorithm, called MOTLEY, for producing a diverse result set as per this definition. Through a detailed experimental evaluation on real and synthetic data, we show that MOTLEY can produce diverse result sets by reading only a small fraction of the tuples in the database. Further, it imposes no additional overhead on the evaluation of traditional KNN queries, thereby providing a seamless interface between diversity and distance.Comment: 20 pages, 11 figure

    Anonymous Networking amidst Eavesdroppers

    Full text link
    The problem of security against timing based traffic analysis in wireless networks is considered in this work. An analytical measure of anonymity in eavesdropped networks is proposed using the information theoretic concept of equivocation. For a physical layer with orthogonal transmitter directed signaling, scheduling and relaying techniques are designed to maximize achievable network performance for any given level of anonymity. The network performance is measured by the achievable relay rates from the sources to destinations under latency and medium access constraints. In particular, analytical results are presented for two scenarios: For a two-hop network with maximum anonymity, achievable rate regions for a general m x 1 relay are characterized when nodes generate independent Poisson transmission schedules. The rate regions are presented for both strict and average delay constraints on traffic flow through the relay. For a multihop network with an arbitrary anonymity requirement, the problem of maximizing the sum-rate of flows (network throughput) is considered. A selective independent scheduling strategy is designed for this purpose, and using the analytical results for the two-hop network, the achievable throughput is characterized as a function of the anonymity level. The throughput-anonymity relation for the proposed strategy is shown to be equivalent to an information theoretic rate-distortion function

    Measuring and validating social cohesion: a bottom-up approach

    Get PDF
    The aim of this paper is to provide a synthetic macro index of social cohesion based on the observation of several individual level variables. Based on the definition of social cohesion by Bernard (1999) and Chan et al. (2006) an index of social cohesion (henceforth VALCOS Index) was created. It covers the political and sociocultural domains of life in their formal and substantial relations. Results suggest that the VALCOS-Index of social cohesion is strongly and significantly correlated with other macro indicators largely used by the scientific community. The aggregation of EVS 2008 data on social cohesion together with many macro indicators of several dimensions of social life (including economic, socio-demographic, health and subjective well-being indicators) allowed us to rank social cohesion across 39 European countries and to explore differences across groups of countries. Subsequently, we validated our index by correlating it with many national level variables.social cohesion; methodology; macro index; micro index; EVS
    corecore