10,515 research outputs found
Why Do Cascade Sizes Follow a Power-Law?
We introduce random directed acyclic graph and use it to model the
information diffusion network. Subsequently, we analyze the cascade generation
model (CGM) introduced by Leskovec et al. [19]. Until now only empirical
studies of this model were done. In this paper, we present the first
theoretical proof that the sizes of cascades generated by the CGM follow the
power-law distribution, which is consistent with multiple empirical analysis of
the large social networks. We compared the assumptions of our model with the
Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201
An information theoretic approach to rule induction from databases
The knowledge acquisition bottleneck in obtaining
rules directly from an expert is well known. Hence, the problem
of automated rule acquisition from data is a well-motivated one,
particularly for domains where a database of sample data exists.
In this paper we introduce a novel algorithm for the induction
of rules from examples. The algorithm is novel in the sense
that it not only learns rules for a given concept (classification),
but it simultaneously learns rules relating multiple concepts.
This type of learning, known as generalized rule induction is
considerably more general than existing algorithms which tend
to be classification oriented. Initially we focus on the problem of
determining a quantitative, well-defined rule preference measure.
In particular, we propose a quantity called the J-measure as
an information theoretic alternative to existing approaches. The
J-measure quantifies the information content of a rule or a
hypothesis. We will outline the information theoretic origins
of this measure and examine its plausibility as a hypothesis
preference measure. We then define the ITRULE algorithm which
uses the newly proposed measure to learn a set of optimal rules
from a set of data samples, and we conclude the paper with an
analysis of experimental results on real-world data
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks
In order to gain knowledge from large databases, scalable data mining technologies are needed. Data are captured on a large scale and thus databases are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach.are increasing at a fast pace. This leads to the utilisation of parallel computing technologies in order to cope with large amounts of data. In the area of classification rule induction, parallelisation of classification rules has focused on the divide and conquer approach, also known as the Top Down Induction of Decision Trees (TDIDT). An alternative approach to classification rule induction is separate and conquer which has only recently been in the focus of parallelisation. This work introduces and evaluates empirically a framework for the parallel induction of classification rules, generated by members of the Prism family of algorithms. All members of the Prism family of algorithms follow the separate and conquer approach
Testing statistical hypothesis on random trees and applications to the protein classification problem
Efficient automatic protein classification is of central importance in
genomic annotation. As an independent way to check the reliability of the
classification, we propose a statistical approach to test if two sets of
protein domain sequences coming from two families of the Pfam database are
significantly different. We model protein sequences as realizations of Variable
Length Markov Chains (VLMC) and we use the context trees as a signature of each
protein family. Our approach is based on a Kolmogorov--Smirnov-type
goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences
of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is
a supremum over the space of trees of a function of the two samples; its
computation grows, in principle, exponentially fast with the maximal number of
nodes of the potential trees. We show how to transform this problem into a
max-flow over a related graph which can be solved using a Ford--Fulkerson
algorithm in polynomial time on that number. We apply the test to 10 randomly
chosen protein domain families from the seed of Pfam-A database (high quality,
manually curated families). The test shows that the distributions of context
trees coming from different families are significantly different. We emphasize
that this is a novel mathematical approach to validate the automatic clustering
of sequences in any context. We also study the performance of the test via
simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Family of Quasisymmetry Models
We present a one-parameter family of models for square contingency tables
that interpolates between the classical quasisymmetry model and its Pearsonian
analogue. Algebraically, this corresponds to deformations of toric ideals
associated with graphs. Our discussion of the statistical issues centers around
maximum likelihood estimation.Comment: 17 pages, 10 figure
Inducing Probabilistic Grammars by Bayesian Model Merging
We describe a framework for inducing probabilistic grammars from corpora of
positive samples. First, samples are {\em incorporated} by adding ad-hoc rules
to a working grammar; subsequently, elements of the model (such as states or
nonterminals) are {\em merged} to achieve generalization and a more compact
representation. The choice of what to merge and when to stop is governed by the
Bayesian posterior probability of the grammar given the data, which formalizes
a trade-off between a close fit to the data and a default preference for
simpler models (`Occam's Razor'). The general scheme is illustrated using three
types of probabilistic grammars: Hidden Markov models, class-based -grams,
and stochastic context-free grammars.Comment: To appear in Grammatical Inference and Applications, Second
International Colloquium on Grammatical Inference; Springer Verlag, 1994. 13
page
- …