25,940 research outputs found

    Multi-Resource Parallel Query Scheduling and Optimization

    Full text link
    Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and communicates with remote sites by message-passing. Earlier work on parallel query scheduling employs either (a) one-dimensional models of parallel task scheduling, effectively ignoring the potential benefits of resource sharing, or (b) models of globally accessible resource units, which are appropriate only for shared-memory architectures, since they cannot capture the affinity of system resources to sites. In this paper, we develop a general approach capturing the full complexity of scheduling distributed, multi-dimensional resource units for all forms of parallelism within and across queries and operators. We present a level-based list scheduling heuristic algorithm for independent query tasks (i.e., physical operator pipelines) that is provably near-optimal for given degrees of partitioned parallelism (with a worst-case performance ratio that depends on the number of time-shared and space-shared resources per site and the granularity of the clones). We also propose extensions to handle blocking constraints in logical operator (e.g., hash-join) pipelines and bushy query plans as well as on-line task arrivals (e.g., in a dynamic or multi-query execution environment). Experiments with our scheduling algorithms implemented on top of a detailed simulation model verify their effectiveness compared to existing approaches in a realistic setting. Based on our analytical and experimental results, we revisit the open problem of designing efficient cost models for parallel query optimization and propose a solution that captures all the important parameters of parallel execution.Comment: 50 pages; Conference version of the paper has appeared in the Proceedings of the 23rd International Conference on Very Large Databases (VLDB'1997), Athens, Greece, August 199

    Granular association rules for multi-valued data

    Full text link
    Granular association rule is a new approach to reveal patterns hide in many-to-many relationships of relational databases. Different types of data such as nominal, numeric and multi-valued ones should be dealt with in the process of rule mining. In this paper, we study multi-valued data and develop techniques to filter out strong however uninteresting rules. An example of such rule might be "male students rate movies released in 1990s that are NOT thriller." This kind of rules, called negative granular association rules, often overwhelms positive ones which are more useful. To address this issue, we filter out negative granules such as "NOT thriller" in the process of granule generation. In this way, only positive granular association rules are generated and strong ones are mined. Experimental results on the movielens data set indicate that most rules are negative, and our technique is effective to filter them out.Comment: Proceedings of The 2013 Canadian Conference on Electrical and Computer Engineering (to appear

    Firm Entry, Trade, and Welfare in Zipf's World

    Get PDF
    Firm size follows Zipf's Law, a very fat-tailed distribution that implies a few large firms account for a disproportionate share of overall economic activity. This distribution of firm size is crucial for evaluating the welfare impact of macroeconomic policies such as barriers to entry or trade liberalization. Using a multi-country model of production and trade in which the parameters are calibrated to match the observed distribution of firm size, we show that the welfare impact of high entry costs is small. In the sample of the largest 50 economies in the world, a reduction in entry costs all the way to the U.S. level leads to an average increase in welfare of only 3.25%. In addition, when the firm size distribution follows Zipf's Law, the welfare impact of the extensive margin of trade -- newly imported goods -- vanishes. The extensive margin of imports accounts for only about 3.5% of the total gains from a 10% reduction in trade barriers in our model. This is because under Zipf's Law, the large, inframarginal firms have a far greater welfare impact than the much smaller firms that comprise the extensive margin in these policy experiments. The distribution of firm size matters for these results: in a counterfactual model economy that does not exhibit Zipf's Law the gains from a reduction in entry barriers are an order of magnitude larger, while the gains from trade liberalization are an order of magnitude smaller.Zipf's Law, welfare, entry costs, trade barriers

    Effective Spatial Data Partitioning for Scalable Query Processing

    Full text link
    Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and running query tasks on those partitions in parallel. Therefore, effective data partitioning is critical for task parallelization, load balancing, and directly affects system performance. However, several pitfalls of spatial data partitioning make this task particularly challenging. First, data skew is very common in spatial applications. To achieve best query performance, data skew need to be reduced. Second, spatial partitioning approaches generate boundary objects that cross multiple partitions, and add extra query processing overhead. Consequently, boundary objects need to be minimized. Third, the high computational complexity of spatial partitioning algorithms combined with massive amounts of data require an efficient approach for partitioning to achieve overall fast query response. In this paper, we provide a systematic evaluation of multiple spatial partitioning methods with a set of different partitioning strategies, and study their implications on the performance of MapReduce based spatial queries. We also study sampling based partitioning methods and their impact on queries, and propose several MapReduce based high performance spatial partitioning methods. The main objective of our work is to provide a comprehensive guidance for optimal spatial data partitioning to support scalable and fast spatial data processing in massively parallel data processing frameworks such as MapReduce. The algorithms developed in this work are open source and can be easily integrated into different high performance spatial data processing systems

    StreetX: Spatio-Temporal Access Control Model for Data

    Full text link
    Cities are a big source of spatio-temporal data that is shared across entities to drive potential use cases. Many of the Spatio-temporal datasets are confidential and are selectively shared. To allow selective sharing, several access control models exist, however user cannot express arbitrary space and time constraints on data attributes using them. In this paper we focus on spatio-temporal access control model. We show that location and time attributes of data may decide its confidentiality via a motivating example and thus can affect user's access control policy. In this paper, we present StreetX which enables user to represent constraints on multiple arbitrary space regions and time windows using a simple abstract language. StreetX is scalable and is designed to handle large amount of spatio-temporal data from multiple users. Multiple space and time constraints can affect performance of the query and may also result in conflicts. StreetX automatically resolve conflicts and optimizes the query evaluation with access control to improve performance. We implemented and tested prototype of StreetX using space constraints by defining region having 1749 polygon coordinates on 10 million data records. Our testing shows that StreetX extends the current access control with spatio-temporal capabilities.Comment: 10 page

    Sector Concentration in Loan Portfolios and Economic Capital

    Get PDF
    The purpose of this paper is to measure the potential impact of business-sector concentration on economic capital for loan portfolios and to explore a tractable model for its measurement. The empirical part evaluates the increase in economic capital in a multi-factor asset value model for portfolios with increasing sector concentration. The sector composition is based on credit information from the German central credit register. Finding that business sector concentration can substantially increase economic capital, the theoretical part of the paper explores whether this risk can be measured by a tractable model that avoids Monte Carlo simulations. We analyze a simplified version of the analytic value-at-risk approximation developed by Pykhtin (2004), which only requires risk parameters on a sector level. Sensitivity analyses with various input parameters show that the analytic approximation formulae perform well in approximating economic capital for portfolios which are homogeneous on a sector level in terms of PD and exposure size. Furthermore, we explore the robustness of our results for portfolios which are heterogeneous in terms of these two characteristics. We find that low granularity ceteris paribus causes the analytic approximation formulae to underestimate economic capital, whereas heterogeneity in individual PDs causes overestimation. Indicative results imply that in typical credit portfolios, PD heterogeneity will at least compensate for the granularity effect. This suggests that the analytic approximations estimate economic capital reasonably well and/or err on the conservative side.sector concentration risk, economic capital

    Geometry-Informed Material Recognition

    Full text link
    Our goal is to recognize material categories using images and geometry information. In many applications, such as construction management, coarse geometry information is available. We investigate how 3D geometry (surface normals, camera intrinsic and extrinsic parameters) can be used with 2D features (texture and color) to improve material classification. We introduce a new dataset, GeoMat, which is the first to provide both image and geometry data in the form of: (i) training and testing patches that were extracted at different scales and perspectives from real world examples of each material category, and (ii) a large scale construction site scene that includes 160 images and over 800,000 hand labeled 3D points. Our results show that using 2D and 3D features both jointly and independently to model materials improves classification accuracy across multiple scales and viewing directions for both material patches and images of a large scale construction site scene.Comment: IEEE Conference on Computer Vision and Pattern Recognition 2016 (CVPR '16

    Toward Open Data Blockchain Analytics: A Bitcoin Perspective

    Full text link
    Bitcoin is the first implementation of what has become known as a 'public permissionless' blockchain. Guaranteeing security and protocol conformity through its elegant combination of cryptographic assurances and game theoretic economic incentives, it permits censorship resistant public read-write access to its append-only blockchain database without the need for any mediating central authority. Not until its advent has such a trusted, transparent, comprehensive and granular data set of digital economic behaviours been available for public network analysis. In this article, by translating the cumbersome binary data structure of the Bitcoin blockchain into a high fidelity graph model, we demonstrate through various analyses the often overlooked social and econometric benefits of employing such a novel open data architecture. Specifically we show (a) how repeated patterns of transaction behaviours can be revealed to link user activity across the blockchain; (b) how newly mined bitcoin can be associated to demonstrate individual accumulations of wealth; (c) through application of the naive quantity theory of money that Bitcoin's disinflationary properties can be revealed and measured; and (d) how the user community can develop coordinated defences against repeated denial of service attacks on the network. All of the aforementioned being exemplary benefits that would be lost with the closed data models of the 'private permissioned' distributed ledger architectures that are dominating enterprise level development due to existing blockchain issues of governance, scalability and confidentiality.Comment: 17 pages, 9 Figure

    Automatic Investigation Framework for Android Malware Cyber-Infrastructures

    Full text link
    The popularity of Android system, not only in the handset devices but also in IoT devices, makes it a very attractive destination for malware. Indeed, malware is expanding at a similar rate targeting such devices that rely, in most cases, on Internet to work properly. The state-of-the-art malware mitigation solutions mainly focus on the detection of the actual malicious Android apps using dy- namic and static analyses features to distinguish malicious apps from benign ones. However, there is a small coverage for the In- ternet/network dimension of the Android malicious apps. In this paper, we present ToGather, an automatic investigation framework that takes the Android malware samples, as input, and produces a situation awareness about the malicious cyber infrastructure of these samples families. ToGather leverages the state-of-the-art graph theory techniques to generate an actionable and granular intelligence to mitigate the threat imposed by the malicious Internet activity of the Android malware apps. We experiment ToGather on real malware samples from various Android families, and the obtained results are interesting and very promisingComment: 12 Page

    Unsupervised Iterative Deep Learning of Speech Features and Acoustic Tokens with Applications to Spoken Term Detection

    Full text link
    In this paper we aim to automatically discover high quality frame-level speech features and acoustic tokens directly from unlabeled speech data. A Multi-granular Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets of acoustic tokens from the given corpus. Each acoustic token set is specified by a set of hyperparameters describing the model configuration. These different sets of acoustic tokens carry different characteristics for the given corpus and the language behind, thus can be mutually reinforced. The multiple sets of token labels are then used as the targets of a Multi-target Deep Neural Network (MDNN) trained on frame-level acoustic features. Bottleneck features extracted from the MDNN are then used as the feedback input to the MAT and the MDNN itself in the next iteration. The multi-granular acoustic token sets and the frame-level speech features can be iteratively optimized in the iterative deep learning framework. We call this framework the Multi-granular Acoustic Tokenizing Deep Neural Network (MATDNN). The results were evaluated using the metrics and corpora defined in the Zero Resource Speech Challenge organized at Interspeech 2015, and improved performance was obtained with a set of experiments of query-by-example spoken term detection on the same corpora. Visualization for the discovered tokens against the English phonemes was also shown.Comment: Accepted by IEEE/ACM Transactions on Audio Speech and Language Processing. arXiv admin note: text overlap with arXiv:1602.00426, arXiv:1506.0232
    • …
    corecore