7,796 research outputs found

    A temporal and spatial locality theory for characterizing very large data bases

    Get PDF
    Bibliography: p. 22-23.Stuart E. Madnick, Allen Moulton

    CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases

    Full text link
    © 2019 The Authors. Methods in Ecology and Evolution published by John Wiley & Sons Ltd on behalf of British Ecological Society. Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo-referencing or dating, can diminish their usefulness. Manual cleaning is time-consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records. Here, we present CoordinateCleaner, an r-package to scan datasets of species occurrence records for geo-referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo-referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio-temporal tests for fossils. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing datasets (18.5%) might be biased by rasterized coordinates. In PBDB, 1205 records (6.3%) are potentially problematic. All cleaning functions and the biodiversity institution database are open-source and available within the CoordinateCleaner r-package

    One size does not fit all : accelerating OLAP workloads with GPUs

    Get PDF
    GPU has been considered as one of the next-generation platforms for real-time query processing databases. In this paper we empirically demonstrate that the representative GPU databases [e.g., OmniSci (Open Source Analytical Database & SQL Engine,, 2019)] may be slower than the representative in-memory databases [e.g., Hyper (Neumann and Leis, IEEE Data Eng Bull 37(1):3-11, 2014)] with typical OLAP workloads (with Star Schema Benchmark) even if the actual dataset size of each query can completely fit in GPU memory. Therefore, we argue that GPU database designs should not be one-size-fits-all; a general-purpose GPU database engine may not be well-suited for OLAP workloads without careful designed GPU memory assignment and GPU computing locality. In order to achieve better performance for GPU OLAP, we need to re-organize OLAP operators and re-optimize OLAP model. In particular, we propose the 3-layer OLAP model to match the heterogeneous computing platforms. The core idea is to maximize data and computing locality to specified hardware. We design the vector grouping algorithm for data-intensive workload which is proved to be assigned to CPU platform adaptive. We design the TOP-DOWN query plan tree strategy to guarantee the optimal operation in final stage and pushing the respective optimizations to the lower layers to make global optimization gains. With this strategy, we design the 3-stage processing model (OLAP acceleration engine) for hybrid CPU-GPU platform, where the computing-intensive star-join stage is accelerated by GPU, and the data-intensive grouping & aggregation stage is accelerated by CPU. This design maximizes the locality of different workloads and simplifies the GPU acceleration implementation. Our experimental results show that with vector grouping and GPU accelerated star-join implementation, the OLAP acceleration engine runs 1.9x, 3.05x and 3.92x faster than Hyper, OmniSci GPU and OmniSci CPU in SSB evaluation with dataset of SF = 100.Peer reviewe

    Exploiting Data Skew for Improved Query Performance

    Full text link
    Analytic queries enable sophisticated large-scale data analysis within many commercial, scientific and medical domains today. Data skew is a ubiquitous feature of these real-world domains. In a retail database, some products are typically much more popular than others. In a text database, word frequencies follow a Zipf distribution with a small number of very common words, and a long tail of infrequent words. In a geographic database, some regions have much higher populations (and data measurements) than others. Current systems do not make the most of caches for exploiting skew. In particular, a whole cache line may remain cache resident even though only a small part of the cache line corresponds to a popular data item. In this paper, we propose a novel index structure for repositioning data items to concentrate popular items into the same cache lines. The net result is better spatial locality, and better utilization of limited cache resources. We develop a theoretical model for analyzing the cache behavior, and implement database operators that are efficient in the presence of skew. Our experiments on real and synthetic data show that exploiting skew can significantly improve in-memory query performance. In some cases, our techniques can speed up queries by over an order of magnitude

    Face Transplantation

    Get PDF

    NSFW: An Empirical Study of Scandalous Trademarks

    Get PDF
    This project is an empirical analysis of trademarks that have received rejections based on their “scandalous” nature. It is the first of its kind. The Lanham Act bars registration for trademarks that are “scandalous” and “immoral.” While much has been written on the morality provisions in the Lanham Act, this piece is the first scholarly project that engages an empirical analysis of the Section 2(a) rejections based on scandalousness; it contains a look behind the scenes at how the morality provisions are applied throughout the trademark registration process. This study analyzes which marks are being rejected, what evidence is being used to reject them, and who the applicants are. Our data pays particularly close attention to the evidence used to determine whether a mark is scandalous. We also consider whether this bar is effective at removing these marks from the consumer marketplace

    Upgrade of the CEDIT database of earthquake-induced ground effects in Italy

    Get PDF
    The database related to the Italian Catalogue of EarthquakeInduced Ground Failures (CEDIT), was recently upgraded and updated to 2017 in the frame of a work-in-progress focused on the following issues: i) reorganization of the geo-database architecture; ii) revision of the earthquake parameters from the CFTI5 e CPTI15 catalogues by INGV; ii) addition of new data on effects induced by earthquakes occurred from 2009 to 2017; iv) attribution of macroseismic intensity value to each effect site, according to the CFTI5 e CPTI15 catalogues by INGV. The revised CEDIT database aims at achieving: i) the optimization of the CEDIT catalogue in order to increase its usefulness for both Public Institutions and individual users; ii) a new architecture of the geo-database in view of a future implementation of the online catalogue which implies its usability via web-app also to support post-event detection and surveying activities. Here we illustrate the new geo-database design and discuss the statistics that can be derived from the updated database. Statistical analysis was carried out on the data recorded in the last update of CEDIT to 2017 and compared with the analysis of the previous update outline that: - the most represented ground effects are the landslides with a percentage of 55% followed by ground cracks with a percentage of 23%; - the MCS intensity (IMCS) distribution of the effect sites shows a maximum in correspondence of the IMCS class 8 even if a second frequency peak appears in the IMCS class 7 only for surface faulting effects; - the distribution of the effects according to the epicentral distance shows a decrease for all the typologies of induced ground effects with increasing epicentral distance
    • …
    corecore