2,138 research outputs found
Explicit probabilistic models for databases and networks
Recent work in data mining and related areas has highlighted the importance
of the statistical assessment of data mining results. Crucial to this endeavour
is the choice of a non-trivial null model for the data, to which the found
patterns can be contrasted. The most influential null models proposed so far
are defined in terms of invariants of the null distribution. Such null models
can be used by computation intensive randomization approaches in estimating the
statistical significance of data mining results.
Here, we introduce a methodology to construct non-trivial probabilistic
models based on the maximum entropy (MaxEnt) principle. We show how MaxEnt
models allow for the natural incorporation of prior information. Furthermore,
they satisfy a number of desirable properties of previously introduced
randomization approaches. Lastly, they also have the benefit that they can be
represented explicitly. We argue that our approach can be used for a variety of
data types. However, for concreteness, we have chosen to demonstrate it in
particular for databases and networks.Comment: Submitte
An Improved Technique for Multi-Dimensional Constrained Gradient Mining
Multi-dimensional Constrained Gradient Mining, which is an aspect of data mining, is based on mining constrained frequent gradient pattern pairs with significant difference in their measures in transactional database. Top-k Fp-growth with Gradient Pruning and Top-k Fp-growth with No Gradient Pruning were the two algorithms used for Multi-dimensional Constrained Gradient Mining in previous studies. However, these algorithms have their shortcomings. The first requires construction of Fp-tree before searching through the database and the second algorithm requires searching of database twice in finding frequent pattern pairs. These cause the problems of using large amount of time and memory space, which retrogressively make mining of database cumbersome. Based on this anomaly, a new algorithm that combines Top-k Fp-growth with Gradient pruning and Top-k Fp-growth with No Gradient pruning is designed to eliminate these drawbacks. The new algorithm called Top-K Fp-growth with support Gradient pruning (SUPGRAP) employs the method of scanning the database once, by searching for the node and all the descendant of the node of every task at each level. The idea is to form projected Multidimensional Database and then find the Multidimensional patterns within the projected databases. The evaluation of the new algorithm shows significant improvement in terms of time and space required over the existing algorithms.  
SLOTH: Structured Learning and Task-based Optimization for Time Series Forecasting on Hierarchies
Multivariate time series forecasting with hierarchical structure is widely
used in real-world applications, e.g., sales predictions for the geographical
hierarchy formed by cities, states, and countries. The hierarchical time series
(HTS) forecasting includes two sub-tasks, i.e., forecasting and reconciliation.
In the previous works, hierarchical information is only integrated in the
reconciliation step to maintain coherency, but not in forecasting step for
accuracy improvement. In this paper, we propose two novel tree-based feature
integration mechanisms, i.e., top-down convolution and bottom-up attention to
leverage the information of the hierarchical structure to improve the
forecasting performance. Moreover, unlike most previous reconciliation methods
which either rely on strong assumptions or focus on coherent constraints
only,we utilize deep neural optimization networks, which not only achieve
coherency without any assumptions, but also allow more flexible and realistic
constraints to achieve task-based targets, e.g., lower under-estimation penalty
and meaningful decision-making loss to facilitate the subsequent downstream
tasks. Experiments on real-world datasets demonstrate that our tree-based
feature integration mechanism achieves superior performances on hierarchical
forecasting tasks compared to the state-of-the-art methods, and our neural
optimization networks can be applied to real-world tasks effectively without
any additional effort under coherence and task-based constraint
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
05051 Abstracts Collection -- Probabilistic, Logical and Relational Learning - Towards a Synthesis
From 30.01.05 to 04.02.05, the Dagstuhl Seminar 05051 ``Probabilistic, Logical and Relational Learning - Towards a Synthesis\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper. The first section
describes the seminar topics and goals in general.
Links to extended abstracts or full papers are provided, if available
- …