200 research outputs found
Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks.
Genetic studies are increasingly based on short noisy next generation scanners. Typically complete DNA sequences are assembled by matching short NextGen sequences against reference genomes. Despite considerable algorithmic gains since the turn of the millennium, matching both single ended and paired end strings to a reference remains computationally demanding. Further tailoring Bioinformatics tools to each new task or scanner remains highly skilled and labour intensive. With this in mind, we recently demonstrated a genetic programming based automated technique which generated a version of the state-of-the-art alignment tool Bowtie2 which was considerably faster on short sequences produced by a scanner at the Broad Institute and released as part of The Thousand Genome Project
Genetic programming for mining DNA chip data from cancer patients
In machine learning terms DNA (gene) chip data is unusual in having thousands of attributes (the gene expression values) but few (<100) records (the patients). A GP based method for both feature selection and generating simple models based on a few genes is demonstrated on cancer data
Evolving DNA motifs to predict GeneChip probe performance
Background: Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. We use correlations between measurements for the same gene across 6685 human tissue samples from NCBI's GEO database to indicated the quality of individual HG-U133A probes. Low correlation indicates a poor probe. Results: Regular expressions can be automatically created from a Backus-Naur form (BNF) context-free grammar using strongly typed genetic programming. Conclusion: The automatically produced motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided. © 2009 Langdon and Harrison; licensee BioMed Central Ltd
Kin selection with twin genetic programming
In steady state Twin GP both children created by sub-tree crossover and point mutation are used. They are born together and die together. Evolution is little changed. Indeed fitness selection using the twin’s co-conceived doppelganger is possible
Mycoplasma Contamination in The 1000 Genomes Project
Background: In silco Biology is increasingly important and is often based on public datasets. While the problem of contamination is well recognised in microbiology labs the corresponding problem of database corruption has received less attention. Results: Mapping 50 billion next generation DNA sequences from The Thousand Genome Project against published genomes reveals many that match one or more Mycoplasma but are not included in the reference human genome GRCh37.p5. Many of these are of low quality but NCBI BLAST searches confirm some high quality, high entropy sequences match Mycoplasma but no human sequences. Conclusions: It appears at least 7percent of 1000G samples are contaminated
Failed disruption propagation in integer genetic programming
We inject a random value into the evaluation of highly evolved deep integer GP trees 9 743 720 times and find 99.7% of test outputs are unchanged. Suggesting crossover and mutation's impact are dissipated and seldom propagate outside the program. Indeed only errors near the root node have impact and disruption falls exponentially with depth at between e-depth/3 and e-depth/5 for recursive Fibonacci GP trees, allowing five to seven levels of nesting between the runtime perturbation and an optimal test oracle for it to detect most errors. Information theory explains this locally flat fitness landscape is due to FDP. Overflow is not important and instead, integer GP, like deep symbolic regression floating point GP and software in general, is not fragile, is robust, is not chaotic and suffers little from Lorenz' butterfly
Dissipative Arithmetic
Large arithmetic expressions are dissipative: they lose information and are robust to perturbations. Lack of conservation gives resilience to fluc-tuations. The limited precision of floating point and the mixture of linear and nonlinear operations make such functions anti-fragile and give a largely stable locally flat plateau a rich fitness landscape. This slows long-term evolution of complex programs, suggesting a need for depth-aware crossover and mutation operators in tree-based genetic program-ming. It also suggests that deeply nested computer program source code is error tolerant because disruptions tend to fail to propagate, and there-fore the optimal placement of test oracles is as close to software defects as practical
Deep Genetic Programming Trees Are Robust
We sample the genetic programming tree search space and show it is smooth, since many mutations on many test cases have little or no fitness impact. We generate uniformly at random high-order polynomials composed of 12,500 and 750,000 additions and multiplications and follow the impact of small changes to them. From information theory, 32 bit floating point arithmetic is dissipative, and even with 1,501 test cases, deep mutations seldom have any impact on fitness. Absolute difference between parent and child evaluation can grow as well as fall further from the code change location, but the number of disrupted fitness tests falls monotonically. In many cases, deeply nested expressions are robust to crossover syntax changes, bugs, errors, run time glitches, perturbations, and so on, because their disruption falls to zero, and so it fails to propagate beyond the program
CSM-423 - Evolutionary Solo Pong Players
An Internet Java Applet http://www.cs.essex.ac.uk/staff/poli/ SoloPong/ allows users anywhere to play the Solo Pong game. We compare people?s performance to a hand coded ?Optimal? player and programs automatically produced by artificial intelligence. The AI techniques are: genetic programming, including a hybrid of GP and a human designed algorithm, and a particle swarm optimiser. The AI approaches are not fine tuned. GP and PSO find good players. Evolutionary computation (EC) is able to beat both human designed code and human players
Optimising Existing Software with Genetic Programming
We show genetic improvement of programs (GIP) can scale by evolving increased performance in a widely-used and highly complex 50000 line system. GISMOE found code that is 70 times faster (on average) and yet is at least as good functionally. Indeed it even gives a small semantic gain
- …