289 research outputs found
Speeding-up -gram mining on grammar-based compressed texts
We present an efficient algorithm for calculating -gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP of size that represents string , the
algorithm computes the occurrence frequencies of all -grams in , by
reducing the problem to the weighted -gram frequencies problem on a
trie-like structure of size , where
is a quantity that represents the amount of
redundancy that the SLP captures with respect to -grams. The reduced problem
can be solved in linear time. Since , the running time of our
algorithm is , improving our
previous algorithm when
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Compact q-gram Profiling of Compressed Strings
We consider the problem of computing the q-gram profile of a string \str of
size compressed by a context-free grammar with production rules. We
present an algorithm that runs in expected time and uses
O(n+q+\kq) space, where is the exact number of characters
decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct
q-grams in \str. This simultaneously matches the current best known time
bound and improves the best known space bound. Our space bound is
asymptotically optimal in the sense that any algorithm storing the grammar and
the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we
introduce the q-gram graph that space-efficiently captures the structure of a
string with respect to its q-grams, and show how to construct it from a
grammar
The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)
Humanities researchers are faced with an overwhelming volume of digitised
primary source material, and "born digital" information, of relevance to their
research as a result of large-scale digitisation projects. The current digital tools
do not provide consistent support for analysing the content of digital archives
that are potentially large in scale, multilingual, and come in a range of data
formats. The current language-dependent, or project specific, approach to tool
development often puts the tools out of reach for many research disciplines in
the humanities. In addition, the tools can be incompatible with the way
researchers locate and compare the relevant sources. For instance, researchers
are interested in shared structural text patterns, known as \parallel passages"
that describe a specific cultural, social, or historical context relevant to their
research topic. Identifying these shared structural text patterns is challenging
due to their repeated yet highly variable nature, as a result of differences in
the domain, author, language, time period, and orthography.
The contribution of the thesis is a novel infrastructure that directly addresses
the need for generic,
flexible, extendable, and sustainable digital tools
that are applicable to a wide range of digital archives and research in the
humanities. The infrastructure adopts a character-level n-gram Statistical
Language Model (SLM), stored in a space-optimised k-truncated suffix tree
data structure as its underlying data model. A character-level n-gram model
is a relatively new approach that is competitive with word-level n-gram models,
but has the added advantage that it is domain and language-independent,
requiring little or no preprocessing of the document text unlike word-level
models that require some form of language-dependent tokenisation and stemming.
Character-level n-grams capture word internal features that are ignored
by word-level n-gram models, which provides greater
exibility in addressing
the information need of the user through tolerant search, and compensation
for erroneous query specification or spelling errors in the document text. Furthermore,
the SLM provides a unified approach to information retrieval and
text mining, where traditional approaches have tended to adopt separate data
models that are often ad-hoc or based on heuristic assumptions. In addition,
the performance of the character-level n-gram SLM was formally evaluated
through crowdsourcing, which demonstrates that the retrieval performance of
the SLM is close to that of the human level performance.
The proposed infrastructure, supports the development of the Samtla (Search
And Mining Tools for Language Archives), which provides humanities researchers
digital tools for search, browsing, and text mining of digital archives
in any domain or language, within a single system. Samtla supersedes many of
the existing tools for humanities researchers, by supporting the same or similar
functionality of the systems, but with a domain-independent and languageindependent
approach. The functionality includes a browsing tool constructed
from the metadata and named entities extracted from the document text, a
hybrid-recommendation system for recommending related queries and documents.
However, some tools are novel tools and developed in response to
the specific needs of the researchers, such as the document comparison tool
for visualising shared sequences between groups of related documents. Furthermore,
Samtla is the first practical example of a system with a SLM as
its primary data model that supports the real research needs of several case
studies covering different areas of research in the humanities
A Field Guide to Genetic Programming
xiv, 233 p. : il. ; 23 cm.Libro ElectrónicoA Field Guide to Genetic Programming (ISBN 978-1-4092-0073-4) is an introduction to genetic programming (GP). GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done. Using ideas from natural evolution, GP starts from an ooze of random computer programs, and progressively refines them through processes of mutation and sexual recombination, until solutions emerge. All this without the user having to know or specify the form or structure of solutions in advance. GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions. The authorsIntroduction --
Representation, initialisation and operators in Tree-based GP --
Getting ready to run genetic programming --
Example genetic programming run --
Alternative initialisations and operators in Tree-based GP --
Modular, grammatical and developmental Tree-based GP --
Linear and graph genetic programming --
Probalistic genetic programming --
Multi-objective genetic programming --
Fast and distributed genetic programming --
GP theory and its applications --
Applications --
Troubleshooting GP --
Conclusions.Contents
xi
1 Introduction
1.1 Genetic Programming in a Nutshell
1.2 Getting Started
1.3 Prerequisites
1.4 Overview of this Field Guide I
Basics
2 Representation, Initialisation and GP
2.1 Representation
2.2 Initialising the Population
2.3 Selection
2.4 Recombination and Mutation Operators in Tree-based
3 Getting Ready to Run Genetic Programming 19
3.1 Step 1: Terminal Set 19
3.2 Step 2: Function Set 20
3.2.1 Closure 21
3.2.2 Sufficiency 23
3.2.3 Evolving Structures other than Programs 23
3.3 Step 3: Fitness Function 24
3.4 Step 4: GP Parameters 26
3.5 Step 5: Termination and solution designation 27
4 Example Genetic Programming Run
4.1 Preparatory Steps 29
4.2 Step-by-Step Sample Run 31
4.2.1 Initialisation 31
4.2.2 Fitness Evaluation Selection, Crossover and Mutation Termination and Solution Designation Advanced Genetic Programming
5 Alternative Initialisations and Operators in
5.1 Constructing the Initial Population
5.1.1 Uniform Initialisation
5.1.2 Initialisation may Affect Bloat
5.1.3 Seeding
5.2 GP Mutation
5.2.1 Is Mutation Necessary?
5.2.2 Mutation Cookbook
5.3 GP Crossover
5.4 Other Techniques 32
5.5 Tree-based GP 39
6 Modular, Grammatical and Developmental Tree-based GP 47
6.1 Evolving Modular and Hierarchical Structures 47
6.1.1 Automatically Defined Functions 48
6.1.2 Program Architecture and Architecture-Altering 50
6.2 Constraining Structures 51
6.2.1 Enforcing Particular Structures 52
6.2.2 Strongly Typed GP 52
6.2.3 Grammar-based Constraints 53
6.2.4 Constraints and Bias 55
6.3 Developmental Genetic Programming 57
6.4 Strongly Typed Autoconstructive GP with PushGP 59
7 Linear and Graph Genetic Programming 61
7.1 Linear Genetic Programming 61
7.1.1 Motivations 61
7.1.2 Linear GP Representations 62
7.1.3 Linear GP Operators 64
7.2 Graph-Based Genetic Programming 65
7.2.1 Parallel Distributed GP (PDGP) 65
7.2.2 PADO 67
7.2.3 Cartesian GP 67
7.2.4 Evolving Parallel Programs using Indirect Encodings 68
8 Probabilistic Genetic Programming
8.1 Estimation of Distribution Algorithms 69
8.2 Pure EDA GP 71
8.3 Mixing Grammars and Probabilities 74
9 Multi-objective Genetic Programming 75
9.1 Combining Multiple Objectives into a Scalar Fitness Function 75
9.2 Keeping the Objectives Separate 76
9.2.1 Multi-objective Bloat and Complexity Control 77
9.2.2 Other Objectives 78
9.2.3 Non-Pareto Criteria 80
9.3 Multiple Objectives via Dynamic and Staged Fitness Functions 80
9.4 Multi-objective Optimisation via Operator Bias 81
10 Fast and Distributed Genetic Programming 83
10.1 Reducing Fitness Evaluations/Increasing their Effectiveness 83
10.2 Reducing Cost of Fitness with Caches 86
10.3 Parallel and Distributed GP are Not Equivalent 88
10.4 Running GP on Parallel Hardware 89
10.4.1 Master–slave GP 89
10.4.2 GP Running on GPUs 90
10.4.3 GP on FPGAs 92
10.4.4 Sub-machine-code GP 93
10.5 Geographically Distributed GP 93
11 GP Theory and its Applications 97
11.1 Mathematical Models 98
11.2 Search Spaces 99
11.3 Bloat 101
11.3.1 Bloat in Theory 101
11.3.2 Bloat Control in Practice 104
III
Practical Genetic Programming
12 Applications
12.1 Where GP has Done Well
12.2 Curve Fitting, Data Modelling and Symbolic Regression
12.3 Human Competitive Results – the Humies
12.4 Image and Signal Processing
12.5 Financial Trading, Time Series, and Economic Modelling
12.6 Industrial Process Control
12.7 Medicine, Biology and Bioinformatics
12.8 GP to Create Searchers and Solvers – Hyper-heuristics xiii
12.9 Entertainment and Computer Games 127
12.10The Arts 127
12.11Compression 128
13 Troubleshooting GP
13.1 Is there a Bug in the Code?
13.2 Can you Trust your Results?
13.3 There are No Silver Bullets
13.4 Small Changes can have Big Effects
13.5 Big Changes can have No Effect
13.6 Study your Populations
13.7 Encourage Diversity
13.8 Embrace Approximation
13.9 Control Bloat
13.10 Checkpoint Results
13.11 Report Well
13.12 Convince your Customers
14 Conclusions
Tricks of the Trade
A Resources
A.1 Key Books
A.2 Key Journals
A.3 Key International Meetings
A.4 GP Implementations
A.5 On-Line Resources 145
B TinyGP 151
B.1 Overview of TinyGP 151
B.2 Input Data Files for TinyGP 153
B.3 Source Code 154
B.4 Compiling and Running TinyGP 162
Bibliography 167
Inde
- …