25,940 research outputs found
Multi-Resource Parallel Query Scheduling and Optimization
Scheduling query execution plans is a particularly complex problem in
shared-nothing parallel systems, where each site consists of a collection of
local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory)
resources and communicates with remote sites by message-passing. Earlier work
on parallel query scheduling employs either (a) one-dimensional models of
parallel task scheduling, effectively ignoring the potential benefits of
resource sharing, or (b) models of globally accessible resource units, which
are appropriate only for shared-memory architectures, since they cannot capture
the affinity of system resources to sites. In this paper, we develop a general
approach capturing the full complexity of scheduling distributed,
multi-dimensional resource units for all forms of parallelism within and across
queries and operators. We present a level-based list scheduling heuristic
algorithm for independent query tasks (i.e., physical operator pipelines) that
is provably near-optimal for given degrees of partitioned parallelism (with a
worst-case performance ratio that depends on the number of time-shared and
space-shared resources per site and the granularity of the clones). We also
propose extensions to handle blocking constraints in logical operator (e.g.,
hash-join) pipelines and bushy query plans as well as on-line task arrivals
(e.g., in a dynamic or multi-query execution environment). Experiments with our
scheduling algorithms implemented on top of a detailed simulation model verify
their effectiveness compared to existing approaches in a realistic setting.
Based on our analytical and experimental results, we revisit the open problem
of designing efficient cost models for parallel query optimization and propose
a solution that captures all the important parameters of parallel execution.Comment: 50 pages; Conference version of the paper has appeared in the
Proceedings of the 23rd International Conference on Very Large Databases
(VLDB'1997), Athens, Greece, August 199
Granular association rules for multi-valued data
Granular association rule is a new approach to reveal patterns hide in
many-to-many relationships of relational databases. Different types of data
such as nominal, numeric and multi-valued ones should be dealt with in the
process of rule mining. In this paper, we study multi-valued data and develop
techniques to filter out strong however uninteresting rules. An example of such
rule might be "male students rate movies released in 1990s that are NOT
thriller." This kind of rules, called negative granular association rules,
often overwhelms positive ones which are more useful. To address this issue, we
filter out negative granules such as "NOT thriller" in the process of granule
generation. In this way, only positive granular association rules are generated
and strong ones are mined. Experimental results on the movielens data set
indicate that most rules are negative, and our technique is effective to filter
them out.Comment: Proceedings of The 2013 Canadian Conference on Electrical and
Computer Engineering (to appear
Firm Entry, Trade, and Welfare in Zipf's World
Firm size follows Zipf's Law, a very fat-tailed distribution that implies a few large firms account for a disproportionate share of overall economic activity. This distribution of firm size is crucial for evaluating the welfare impact of macroeconomic policies such as barriers to entry or trade liberalization. Using a multi-country model of production and trade in which the parameters are calibrated to match the observed distribution of firm size, we show that the welfare impact of high entry costs is small. In the sample of the largest 50 economies in the world, a reduction in entry costs all the way to the U.S. level leads to an average increase in welfare of only 3.25%. In addition, when the firm size distribution follows Zipf's Law, the welfare impact of the extensive margin of trade -- newly imported goods -- vanishes. The extensive margin of imports accounts for only about 3.5% of the total gains from a 10% reduction in trade barriers in our model. This is because under Zipf's Law, the large, inframarginal firms have a far greater welfare impact than the much smaller firms that comprise the extensive margin in these policy experiments. The distribution of firm size matters for these results: in a counterfactual model economy that does not exhibit Zipf's Law the gains from a reduction in entry barriers are an order of magnitude larger, while the gains from trade liberalization are an order of magnitude smaller.Zipf's Law, welfare, entry costs, trade barriers
Effective Spatial Data Partitioning for Scalable Query Processing
Recently, MapReduce based spatial query systems have emerged as a cost
effective and scalable solution to large scale spatial data processing and
analytics. MapReduce based systems achieve massive scalability by partitioning
the data and running query tasks on those partitions in parallel. Therefore,
effective data partitioning is critical for task parallelization, load
balancing, and directly affects system performance. However, several pitfalls
of spatial data partitioning make this task particularly challenging. First,
data skew is very common in spatial applications. To achieve best query
performance, data skew need to be reduced. Second, spatial partitioning
approaches generate boundary objects that cross multiple partitions, and add
extra query processing overhead. Consequently, boundary objects need to be
minimized. Third, the high computational complexity of spatial partitioning
algorithms combined with massive amounts of data require an efficient approach
for partitioning to achieve overall fast query response. In this paper, we
provide a systematic evaluation of multiple spatial partitioning methods with a
set of different partitioning strategies, and study their implications on the
performance of MapReduce based spatial queries. We also study sampling based
partitioning methods and their impact on queries, and propose several MapReduce
based high performance spatial partitioning methods. The main objective of our
work is to provide a comprehensive guidance for optimal spatial data
partitioning to support scalable and fast spatial data processing in massively
parallel data processing frameworks such as MapReduce. The algorithms developed
in this work are open source and can be easily integrated into different high
performance spatial data processing systems
StreetX: Spatio-Temporal Access Control Model for Data
Cities are a big source of spatio-temporal data that is shared across
entities to drive potential use cases. Many of the Spatio-temporal datasets are
confidential and are selectively shared. To allow selective sharing, several
access control models exist, however user cannot express arbitrary space and
time constraints on data attributes using them. In this paper we focus on
spatio-temporal access control model. We show that location and time attributes
of data may decide its confidentiality via a motivating example and thus can
affect user's access control policy. In this paper, we present StreetX which
enables user to represent constraints on multiple arbitrary space regions and
time windows using a simple abstract language. StreetX is scalable and is
designed to handle large amount of spatio-temporal data from multiple users.
Multiple space and time constraints can affect performance of the query and may
also result in conflicts. StreetX automatically resolve conflicts and optimizes
the query evaluation with access control to improve performance. We implemented
and tested prototype of StreetX using space constraints by defining region
having 1749 polygon coordinates on 10 million data records. Our testing shows
that StreetX extends the current access control with spatio-temporal
capabilities.Comment: 10 page
Sector Concentration in Loan Portfolios and Economic Capital
The purpose of this paper is to measure the potential impact of business-sector concentration on economic capital for loan portfolios and to explore a tractable model for its measurement. The empirical part evaluates the increase in economic capital in a multi-factor asset value model for portfolios with increasing sector concentration. The sector composition is based on credit information from the German central credit register. Finding that business sector concentration can substantially increase economic capital, the theoretical part of the paper explores whether this risk can be measured by a tractable model that avoids Monte Carlo simulations. We analyze a simplified version of the analytic value-at-risk approximation developed by Pykhtin (2004), which only requires risk parameters on a sector level. Sensitivity analyses with various input parameters show that the analytic approximation formulae perform well in approximating economic capital for portfolios which are homogeneous on a sector level in terms of PD and exposure size. Furthermore, we explore the robustness of our results for portfolios which are heterogeneous in terms of these two characteristics. We find that low granularity ceteris paribus causes the analytic approximation formulae to underestimate economic capital, whereas heterogeneity in individual PDs causes overestimation. Indicative results imply that in typical credit portfolios, PD heterogeneity will at least compensate for the granularity effect. This suggests that the analytic approximations estimate economic capital reasonably well and/or err on the conservative side.sector concentration risk, economic capital
Geometry-Informed Material Recognition
Our goal is to recognize material categories using images and geometry
information. In many applications, such as construction management, coarse
geometry information is available. We investigate how 3D geometry (surface
normals, camera intrinsic and extrinsic parameters) can be used with 2D
features (texture and color) to improve material classification. We introduce a
new dataset, GeoMat, which is the first to provide both image and geometry data
in the form of: (i) training and testing patches that were extracted at
different scales and perspectives from real world examples of each material
category, and (ii) a large scale construction site scene that includes 160
images and over 800,000 hand labeled 3D points. Our results show that using 2D
and 3D features both jointly and independently to model materials improves
classification accuracy across multiple scales and viewing directions for both
material patches and images of a large scale construction site scene.Comment: IEEE Conference on Computer Vision and Pattern Recognition 2016 (CVPR
'16
Toward Open Data Blockchain Analytics: A Bitcoin Perspective
Bitcoin is the first implementation of what has become known as a 'public
permissionless' blockchain. Guaranteeing security and protocol conformity
through its elegant combination of cryptographic assurances and game theoretic
economic incentives, it permits censorship resistant public read-write access
to its append-only blockchain database without the need for any mediating
central authority. Not until its advent has such a trusted, transparent,
comprehensive and granular data set of digital economic behaviours been
available for public network analysis. In this article, by translating the
cumbersome binary data structure of the Bitcoin blockchain into a high fidelity
graph model, we demonstrate through various analyses the often overlooked
social and econometric benefits of employing such a novel open data
architecture. Specifically we show (a) how repeated patterns of transaction
behaviours can be revealed to link user activity across the blockchain; (b) how
newly mined bitcoin can be associated to demonstrate individual accumulations
of wealth; (c) through application of the naive quantity theory of money that
Bitcoin's disinflationary properties can be revealed and measured; and (d) how
the user community can develop coordinated defences against repeated denial of
service attacks on the network. All of the aforementioned being exemplary
benefits that would be lost with the closed data models of the 'private
permissioned' distributed ledger architectures that are dominating enterprise
level development due to existing blockchain issues of governance, scalability
and confidentiality.Comment: 17 pages, 9 Figure
Automatic Investigation Framework for Android Malware Cyber-Infrastructures
The popularity of Android system, not only in the handset devices but also in
IoT devices, makes it a very attractive destination for malware. Indeed,
malware is expanding at a similar rate targeting such devices that rely, in
most cases, on Internet to work properly. The state-of-the-art malware
mitigation solutions mainly focus on the detection of the actual malicious
Android apps using dy- namic and static analyses features to distinguish
malicious apps from benign ones. However, there is a small coverage for the In-
ternet/network dimension of the Android malicious apps. In this paper, we
present ToGather, an automatic investigation framework that takes the Android
malware samples, as input, and produces a situation awareness about the
malicious cyber infrastructure of these samples families. ToGather leverages
the state-of-the-art graph theory techniques to generate an actionable and
granular intelligence to mitigate the threat imposed by the malicious Internet
activity of the Android malware apps. We experiment ToGather on real malware
samples from various Android families, and the obtained results are interesting
and very promisingComment: 12 Page
Unsupervised Iterative Deep Learning of Speech Features and Acoustic Tokens with Applications to Spoken Term Detection
In this paper we aim to automatically discover high quality frame-level
speech features and acoustic tokens directly from unlabeled speech data. A
Multi-granular Acoustic Tokenizer (MAT) was proposed for automatic discovery of
multiple sets of acoustic tokens from the given corpus. Each acoustic token set
is specified by a set of hyperparameters describing the model configuration.
These different sets of acoustic tokens carry different characteristics for the
given corpus and the language behind, thus can be mutually reinforced. The
multiple sets of token labels are then used as the targets of a Multi-target
Deep Neural Network (MDNN) trained on frame-level acoustic features. Bottleneck
features extracted from the MDNN are then used as the feedback input to the MAT
and the MDNN itself in the next iteration. The multi-granular acoustic token
sets and the frame-level speech features can be iteratively optimized in the
iterative deep learning framework. We call this framework the Multi-granular
Acoustic Tokenizing Deep Neural Network (MATDNN). The results were evaluated
using the metrics and corpora defined in the Zero Resource Speech Challenge
organized at Interspeech 2015, and improved performance was obtained with a set
of experiments of query-by-example spoken term detection on the same corpora.
Visualization for the discovered tokens against the English phonemes was also
shown.Comment: Accepted by IEEE/ACM Transactions on Audio Speech and Language
Processing. arXiv admin note: text overlap with arXiv:1602.00426,
arXiv:1506.0232
- …