2,665 research outputs found
Learning probability distributions generated by finite-state machines
We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference
in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
A Grammatical Inference Approach to Language-Based Anomaly Detection in XML
False-positives are a problem in anomaly-based intrusion detection systems.
To counter this issue, we discuss anomaly detection for the eXtensible Markup
Language (XML) in a language-theoretic view. We argue that many XML-based
attacks target the syntactic level, i.e. the tree structure or element content,
and syntax validation of XML documents reduces the attack surface. XML offers
so-called schemas for validation, but in real world, schemas are often
unavailable, ignored or too general. In this work-in-progress paper we describe
a grammatical inference approach to learn an automaton from example XML
documents for detecting documents with anomalous syntax.
We discuss properties and expressiveness of XML to understand limits of
learnability. Our contributions are an XML Schema compatible lexical datatype
system to abstract content in XML and an algorithm to learn visibly pushdown
automata (VPA) directly from a set of examples. The proposed algorithm does not
require the tree representation of XML, so it can process large documents or
streams. The resulting deterministic VPA then allows stream validation of
documents to recognize deviations in the underlying tree structure or
datatypes.Comment: Paper accepted at First Int. Workshop on Emerging Cyberthreats and
Countermeasures ECTCM 201
Bootstrapping and learning PDFA in data streams
Best Student Paper ICGI 2012Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specic classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Here we focus on PDFA and give an algorithm for infering models in this class under the stringent data stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, and processes input items in amortized constant time. We provide rigorous PAC-like bounds for all of the above, as well as an
evaluation on synthetic data showing that the algorithm performs well in practice. Our
algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapping in a streaming setting which may be of independent interest. In experiments we have observed that this sketch yields important reductions in the examples required for performing some crucial statistical tests in our algorithm.Peer ReviewedAward-winningPostprint (published version
Towards a Robuster Interpretive Parsing
The input data to grammar learning algorithms often consist of overt forms that do not contain full structural descriptions. This lack of information may contribute to the failure of learning. Past work on Optimality Theory introduced Robust Interpretive Parsing (RIP) as a partial solution to this problem. We generalize RIP and suggest replacing the winner candidate with a weighted mean violation of the potential winner candidates. A Boltzmann distribution is introduced on the winner set, and the distribution’s parameter is gradually decreased. Finally, we show that GRIP, the Generalized Robust Interpretive Parsing Algorithm significantly improves the learning success rate in a model with standard constraints for metrical stress assignment
A Tutorial on the Expectation-Maximization Algorithm Including Maximum-Likelihood Estimation and EM Training of Probabilistic Context-Free Grammars
The paper gives a brief review of the expectation-maximization algorithm
(Dempster 1977) in the comprehensible framework of discrete mathematics. In
Section 2, two prominent estimation methods, the relative-frequency estimation
and the maximum-likelihood estimation are presented. Section 3 is dedicated to
the expectation-maximization algorithm and a simpler variant, the generalized
expectation-maximization algorithm. In Section 4, two loaded dice are rolled. A
more interesting example is presented in Section 5: The estimation of
probabilistic context-free grammars.Comment: Presented at the 15th European Summer School in Logic, Language and
Information (ESSLLI 2003). Example 5 extended (and partially corrected
Distributional Sentence Entailment Using Density Matrices
Categorical compositional distributional model of Coecke et al. (2010)
suggests a way to combine grammatical composition of the formal, type logical
models with the corpus based, empirical word representations of distributional
semantics. This paper contributes to the project by expanding the model to also
capture entailment relations. This is achieved by extending the representations
of words from points in meaning space to density operators, which are
probability distributions on the subspaces of the space. A symmetric measure of
similarity and an asymmetric measure of entailment is defined, where lexical
entailment is measured using von Neumann entropy, the quantum variant of
Kullback-Leibler divergence. Lexical entailment, combined with the composition
map on word representations, provides a method to obtain entailment relations
on the level of sentences. Truth theoretic and corpus-based examples are
provided.Comment: 11 page
Maintaining regularity and generalization in data using the minimum description length principle and genetic algorithm: case of grammatical inference
In this paper, a genetic algorithm with minimum description length (GAWMDL) is proposed for grammatical inference. The primary challenge of identifying a language of infinite cardinality from a finite set of examples should know when to generalize and specialize the training data. The minimum description length principle that has been incorporated addresses this issue is discussed in this paper. Previously, the e-GRIDS learning model was proposed, which enjoyed the merits of the minimum description length principle, but it is limited to positive examples only. The proposed GAWMDL, which incorporates a traditional genetic algorithm and has a powerful global exploration capability that can exploit an optimum offspring. This is an effective approach to handle a problem which has a large search space such the grammatical inference problem. The computational capability, the genetic algorithm poses is not questionable, but it still suffers from premature convergence mainly arising due to lack of population diversity. The proposed GAWMDL incorporates a bit mask oriented data structure that performs the reproduction operations, creating the mask, then Boolean based procedure is applied to create an offspring in a generative manner. The Boolean based procedure is capable of introducing diversity into the population, hence alleviating premature convergence. The proposed GAWMDL is applied in the context free as well as regular languages of varying complexities. The computational experiments show that the GAWMDL finds an optimal or close-to-optimal grammar. Two fold performance analysis have been performed. First, the GAWMDL has been evaluated against the elite mating pool genetic algorithm which was proposed to introduce diversity and to address premature convergence. GAWMDL is also tested against the improved tabular representation algorithm. In addition, the authors evaluate the performance of the GAWMDL against a genetic algorithm not using the minimum description length principle. Statistical tests demonstrate the superiority of the proposed algorithm. Overall, the proposed GAWMDL algorithm greatly improves the performance in three main aspects: maintains regularity of the data, alleviates premature convergence and is capable in grammatical inference from both positive and negative corpora
- …