Skip to content
forked from PonyGE/PonyGE2

Genetic Improvement for Regex Performance (wall-clock time)

License

Notifications You must be signed in to change notification settings

codykenb/PonyGE2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction


Grammatical Evolution (GE) is a population-based evolutionary algorithm, where a BNF-style grammar is used in the genotype to phenotype mapping process.

PonyGE2 is an implementation of GE in Python. It's intended as an advertisement and a starting-point for those new to GE, a reference for students and researchers, a rapid-prototyping medium for our own experiments, and as a Python workout.

The original version of PonyGE (https://github.com/jmmcd/ponyge) was originally designed to be a small single-file implementation of GE. However, over time this has grown to a stage where a more formal structured approach was needed. This has led to the development of PonyGE2 (https://github.com/jmmcd/ponyge2), presented here.

A technical paper describing PonyGE2 has been published and been made freely available on arXiv at:

https://arxiv.org/abs/1703.08535

PonyGE2 can be referenced using the following citation:

    Fenton, M., McDermott, J., Fagan, D., Forstenlechner, S., Hemberg, E., and O'Neill, M. PonyGE2: Grammatical Evolution in Python. arXiv preprint, arXiv:1703.08535, 2017.

The PonyGE2 development team can be contacted at:

Michael Fenton <[email protected]>,
James McDermott <[email protected]>,
Erik Hemberg <[email protected]>,
David Fagan <[email protected]>,
Stefan Forstenlechner <[email protected]>.

PonyGE2 is copyright (C) 2009-2017

Requirements


PonyGE requires Python 3.5 or higher. Using matplotlib, numpy, scipy, scikit-learn (sklearn), pandas.

All requirements can be satisfied with Anaconda.

Running PonyGE


We don't provide any setup script. You can run an example problem (the default is regression, see below) just by typing:

$ cd src
$ python ponyge.py

This will run an example problem and generate a results folder. The folder contains several files showing the run's stats, producing graphs and documenting the parameters used, as well as a file detailing the best individual. For a more verbose command line experience run the following:

$ cd src
$ python ponyge.py --verbose

Each line of the verbose output corresponds to a generation in the evolution, and prints out all statistics on the current run (only if --verbose is specified). Upon completion of a run, the best individual is printed to the command line, along with summary statistics.

There are a number of arguments that can be used for passing values via the command-line. To see a full list of these just run the following:

$ python ponyge.py --help

About PonyGE2


Grammatical Evolution (GE) [O'Neill & Ryan, 2003] is a grammar-based form of Genetic Programming [Koza, 1992]. It marries principles from molecular biology to the representational power of formal grammars. GE’s rich modularity gives a unique flexibility, making it possible to use alternative search strategies, whether evolutionary, deterministic or some other approach, and to radically change its behaviour by merely changing the grammar supplied. As a grammar is used to describe the structures that are generated by GE, it is trivial to modify the output structures by simply editing the plain text grammar. This is one of the main advantages that makes the GE approach so attractive. The genotype-phenotype mapping also means that instead of operating exclusively on solution trees, as in standard GP, GE allows search operators to be performed on the genotype (e.g., integer or binary chromosomes), in addition to partially derived phenotypes, and the fully formed phenotypic derivation trees themselves.

PonyGE2 is primarily a Python implementation of canonical Grammatical Evolution, but it also includes a number of other popular techniques and EC aspects.

Evolutionary Parameters


One of the central components of PonyGE is the algorithm.parameters.params dictionary. This dictionary is referenced throughout the entire program and is used to streamline the whole process by keeping all optional parameters in the one place. This also means that there is little to no need for arguments to the various functions in PonyGE, as these arguments can often be read directly from the parameters dictionary. Furthermore, the parameters dictionary is used to specify and store optional functions such as initialisation, crossover, mutation, and replacement.

There are three different ways to specify operational parameters with PonyGE.

  1. The first and most basic method is to modify the algorithm.parameters.params dictionary directly in the code. This is not encouraged, as the algorithm.parameters.params dictionary contains the default values for many parameters.

  2. The second method is to list your desired parameters in a specialised parameters text file. Example parameters files are located in the parameters folder. When using parameters files, it is necessary to specify the desired parameter file from the command line. This is done by calling

     --parameters [FULL FILE NAME INCLUDING EXTENSION]
    
  3. The third and final method is to list desired parameters from the command line. To see a list of all currently available command-line arguments implemented in the parser, type

     $ python ponyge.py --help
    

NOTE that each of the above three options successively supersedes the previous ones, i.e. parameters specified in a parameters file will over-write those set in the original algorithm.parameters.params dictionary, and parameters set from the command line will over-write those set in the parameters file.

PonyGE2 automatically parses the correct path for all operators, meaning you don't have to specify the full direct path but only the name of the desired operator, e.g.

--crossover subtree

instead of

--crossover operators.crossover.subtree

However, it is still possible to specify the full correct path if you so desire. Specifying the full direct path allows you to create new operators and place them wherever you like.

It is possible to add new parameters to the algorithm.parameters.params dictionary either by editing the dictionary directly, or simply by specifying any desired parameters in a parameters file. However, it is not currently possible to specify new parameters directly from the command line. This is because any existing parameters that were merely mis-spelled from the command line would then create new parameters, and the evolutionary system would not perform as expected. Thus, any parameters entered from the command line that do not exist in the command line parser in utilities.algorithm.command_line_parser.parse_cmd_args will produce an error. Of course, it is entirely possible to modify the command line parser to accept new arguments.

NOTE that parameter names should follow the naming convention demonstrated with the existing parameters. All names should contain only uppercase letters and underscores. Avoid the use of spaces where possible and do not use a colon (:) in the name of a parameter.

Grammars


When tackling a problem with GE, a suitable BNF (Backus Naur Form) grammar definition must initially be defined. The BNF can be either the specification of an entire language or, perhaps more usefully, a subset of a language geared towards the problem at hand.

In GE, a BNF definition is used to describe the output language to be produced by the system. BNF is a notation for expressing the grammar of a language in the form of production rules. BNF grammars consist of terminals, which are items that can appear in the language, e.g. locally or globally defined variables, binary boolean operators and, or, xor, and nand, unary boolean operators not, constants, True and False etc. and non-terminals, which can be expanded into one or more terminals and non-terminals.

Each production rule is composed of a left-hand side (a single non-terminal), followed by the "goes-to" symbol ::=, followed by a list of production choices separated by the "or" symbol |. Production choices can be composed of any combination of terminals or non-terminals. Non-terminals are enclosed by angle brackets <>. For example, consider the following production rule:

<a> ::= <b>c | d

In this rule, the non-terminal <a> maps to either the choice <b>c (a combination of a new non-terminal <b> and a terminal c), or a single terminal d.

A full grammar is built up of any combinations of such rules.

NOTE that all non-terminals must be fully defined in the grammar, i.e. every non-terminal that appears in the grammar must have a full production rule with defined production choices.

Recursion

One of the most powerful aspects of GE is that the representation can be variable in length. Notably, rules can be recursive (i.e. a non-terminal production rule can contain itself as a production choice), which can allow GE to generate solutions of arbitrary size, e.g.:

<a> ::= <a> + b | b

The code produced by a grammar will consist of elements of the terminal set T. The grammar is used in a developmental approach whereby the evolutionary process evolves the production rules to be applied at each stage of a mapping process, starting from the start symbol, until a complete program is formed. A complete program is one that is comprised solely from elements of T .

In PonyGE2 the BNF definition is comprised entirely of the set of production rules, with the definition of terminals and non-terminals implicit in these rules. The first non-terminal symbol is by default the start symbol. As the BNF definition is a plug-in component of the system, it means that GE can produce code in any language thereby giving the system a unique flexibility.

Writing Grammars

You can have:

  • production separators in multiple lines,
  • entire lines of commented text,
  • comments at the end of any line,
  • single quotation within double quotation and vice versa, and
  • any characters can be used in quotation, even separators ("|") or angle brackets ("<" / ">").

Additionally, the code becomes more readable as well as maintainable and it is not as error prone. Example grammars are provided in the grammars folder.

Grammars are parsed using regular expressions. Examples on parsing some of grammars can be found here:

Variable ranges in grammars

A useful special case is available when writing grammars: a production can be given as GE_RANGE:4, for example, and this will be replaced by a set of productions: 0 | 1 | 2 | 3. With GE_RANGE:dataset_n_vars, the number of productions will be set by the number of columns in the file given by the --dataset argument, if any. Using grammar productions like the following, we can avoid hard-coding the number of independent variables in the grammar:

<var> ::= x[<varidx>]
<varidx> ::= GE_RANGE:dataset_n_vars

See grammars/supervised_learning.bnf for a full example.

Along with the fitness function, grammars are one of the most problem-specific components of the PonyGE2 algorithm. The performance of PonyGE2 can be vastly affected by the quality of the grammar used.

Grammar Files

All grammars are stored in the grammars folder. Grammars can be set with the argument:

--grammar_file [FILE_NAME]

or by setting the parameter GRAMMAR_FILE to [FILE_NAME] in either a parameters file or in the params dictionary, where [FILE_NAME] is a the full file name of the desired grammar file including the file extension.

NOTE that the full file extension (e.g. ".bnf") must be specified, but the full file path (e.g. grammars/example_grammar.bnf) does not need to be specified.

A note on unit productions.

Traditionally GE would not consume a codon for unit productions. This was a design decision taken by O'Neill et al [O'Neill and Ryan, 2003]. However, in PonyGE2 unit productions consume codons, the logic being that it helps to do linear tree-style operations. Furthermore, the checks needed for unit productions during the running of the algorithm can add up to millions of checks that aren't needed if we just consume codons for unit productions.

The original design decision on unit productions was also taken before the introduction of evolvable grammars whereby the arity of a unit production could change over time. In this case consuming codons will help to limit the ripple effect from that change in arity. This also replicates non coding regions of genome as seen in nature.

In summary, the merits for not consuming a codon for unit productions are not clearly defined in the literature. The benefits in consuming codons are a reduction in computation and improved speed with linear tree style operations. Other benefits are an increase in non-coding regions in the chromosome (more in line with nature) that through evolution of the grammar may then express useful information.

Linear Genome Representation


Canonical Grammatical Evolution uses linear genomes (also called chromosomes) to encode genetic information [O'Neill & Ryan, 2003]. These linear genomes are then mapped via the use of a formal BNF-style grammar to produce a phenotypic output. All individuals in PonyGE2 have an associated linear genome which can be used to exactly reproduce that individual.

PonyGE2 contains a number of operators that manage linear genomes. These are discussed in later sections.

NOTE that in general the use of a linear genome does not allow for "intelligent" operations. Although intelligent linear genome operators exist, e.g. [Byrne et al., 2009], they are not implemented here as similar functions can be performed in a simpler manner using derivation-tree based operations.

Codon Size

Each codon in a genome is an integer value that maps to a specific production choice when passed through the grammar. When generating a codon to represent a production choice, a random integer value is chosen that represents that correct production choice. The maximum value a codon can take is set by default at 10000. This value can be changed with the argument:

--codon_size [INT]

or by setting the parameter CODON_SIZE to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the maximum value a codon can take.

Genotype-Phenotype Mapping Process

The genotype is used to map the start symbol as defined in the Grammar onto terminals by reading codons to generate a corresponding integer value, from which an appropriate production rule is selected by using the following mapping function:

Rule = c mod r

where c is the codon integer value, and r is the number of rule choices for the current non-terminal symbol.

Consider the following rule from the given grammar, i.e., given the non-terminal <op>, which describes the set of mathematical operators that can be used, there are four production rules to select from. As can be seen, the choices are effectively labelled with integers counting from zero.

    <op> ::= + (0)
           | - (1)
           | * (2)
           | / (3)

If we assume the codon being read produces the integer 6, then

6 mod 4 = 2

would select rule (2) *. Therefore, the non-terminal <op> is replaced with the terminal * in the derivation string. Each time a production rule has to be selected to transform a non-terminal, another codon is read. In this way the system traverses the genome.

Wrapping

During the genotype-to-phenotype mapping process, it is possible for the genome to run out of codons before the mapping process has terminated. In this case, a wrapping operator can applied which results in the mapping process re-reading the genome again from the start (i.e. wrapping past the end of the genome back to the beginning). As such, codons are reused when wrapping occurs. This is quite an unusual approach in evolutionary algorithms as it is entirely possible for certain codons to be used two or more times depending on the number of wraps specified. This technique of wrapping the individual draws inspiration from the gene-overlapping phenomenon that has been observed in many organisms [Lewin, 2000]. GE works with or without wrapping, and wrapping has been shown to be useful on some problems [O'Neill & Ryan, 2003], however, it does come at the cost of introducing functional dependencies between codons that would not otherwise arise.

By default, wrapping in PonyGE2 is not used (i.e. the MAX_WRAPS parameter is set to 0). This can be changed with the argument:

--max_wraps [INT]

or by setting the parameter MAX_WRAPS to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the desired maximum number of times the mapping process is permitted to wrap past the end of the genome back to the beginning again.

NOTE that permitting the mapping process to wrap on genomes does not necessarily mean it will wrap across genomes. The provision is merely allowed.

Invalid Individuals

In GE each time the same codon is expressed it will always generate the same integer value, but depending on the current non-terminal to which it is being applied, it may result in the selection of a different production rule. This feature is referred to as intrinsic polymorphism. What is crucial however, is that each time a particular individual is mapped from its genotype to its phenotype, the same output is generated. This is the case because the same choices are made each time. It is possible that an incomplete mapping could occur, even after several wrapping events, and typically in this case the mapping process is aborted and the individual in question is given the lowest possible fitness value. The selection and replacement mechanisms then operate accordingly to increase the likelihood that this individual is removed from the population.

An incomplete mapping could arise if the integer values expressed by the genotype were applying the same production rules repeatedly. For example, consider an individual whose full genome consists of three codons [3, 21, 9], all three of which map to production choice 0 from the following rule:

<e> ::= (<e><op><e>) (0)
      | <e>          (1)
      | <op>         (2)

Even after wrapping, the mapping process would be incomplete and would carry on indefinitely unless terminated. This occurs because the non-terminal <e> is being mapped recursively by production rule 0, i.e., <e> becomes (<e><op><e>). Therefore, the leftmost <e> after each application of a production would itself be mapped to a (<e><op><e>), resulting in an expression continually growing as follows:

((<e><op><e>)<op><e>)

followed by

(((<e><op><e>)<op><e>)<op><e>

and so on.

Since the genome has been completely traversed (even after wrapping), and the derivation string (i.e. the derived expression) still contains non-terminals, such an individual is dubbed invalid as it will never undergo a complete mapping to a set of terminals. For this reason an upper limit on the number of wrapping events that can occur is imposed (as specified by the parameter MAX_WRAPS, detailed above), otherwise mapping could continue indefinitely in this case. During the mapping process therefore, beginning from the left hand side of the genome codon integer values are generated and used to select rules from the BNF grammar, until one of the following situations arise:

  1. A complete program is generated. This occurs when all the non-terminals in the expression being mapped are transformed into elements from the terminal set of the BNF grammar.
  2. The end of the genome is reached, in which case the wrapping operator is invoked. This results in the return of the genome reading frame to the left hand side of the genome once again. The reading of codons will then continue, unless an upper threshold representing the maximum number of wrapping events has occurred during this individual’s mapping process.
  3. In the event that a threshold on the number of wrapping events has occurred and the individual is still incompletely mapped, the mapping process is halted, and the individual is assigned a NaN fitness value.

To reduce the number of invalid individuals being passed from generation to generation various strategies can be employed. Strong selection pressure could be applied, for example, through a steady state replacement. One consequence of the use of a steady state method is its tendency to maintain fit individuals at the expense of less fit, and in particular, invalid individuals. Alternatively, a repair strategy can be adopted, which ensures that every individual results in a valid program. For example, in the case that there are non-terminals remaining after using all the genetic material of an individual (with or without the use of wrapping) default rules for each non-terminal can be pre-specified that are used to complete the mapping in a deterministic fashion. Another strategy is to remove the recursive production rules that cause an individual’s phenotype to grow, and then to reuse the genotype to select from the remaining non-recursive rules. Finally, the use of genetic operators which manipulate the derivation tree rather than the linear genome can be used to ensure the generation of completely mapped phenotype strings.

Derivation Tree Representation


During the genotype-to-phenotype mapping process, a derivation tree is implicitly generated. Since each production choice generates a codon, it can be viewed as a node in an overall derivation tree. The parent rule that generated that choice is viewed as the parent node, and any production choices resultant from non-terminals in the current production choice are viewed as child nodes. The depth of a particular node is defined as how many parents exist in the tree directly above it, with the root node of the entire tree (the start symbol of the grammar) being at depth 1. Finally, the root of each individual node in the derivation tree is the non-terminal production rule that generated the node choice itself.

While linear genome mapping means that each individual codon specifies the production choice to be selected from the given production rule, it is possible to do the opposite. Deriving an individual solution purely using the derivation tree (i.e. not using the genotype-to-phenotype mapping process) is entirely possible, and indeed provides a lot more flexibility towards the generation of individuals than a linear mapping.

In a derivation tree based mapping process, each individual begins with the start rule of the grammar (as with the linear mapping). However, instead of a codon from the genome defining the production to be chosen from the given rule, a random production is chosen. Once a production is chosen, it is then possible to retroactively create a codon that would result in that same production being chosen if a linear mapping were to be used. In order to generate a viable codon, first the index of the chosen production is taken from the overall list of production choices for that rule. Then, a random integer from within the range [number of choices : number of choices : CODON_SIZE] (i.e. a number from number of choices to CODON_SIZE with a step size of number of choices). Finally, the index of the chosen production is added to this random integer. This results in a codon which will re-produce the production choice. For example, consider the following rule:

<e> ::= a | b | c

Now, let us randomly select the production choice b. The index of production choice b is 1. Next, we randomly select an integer from within the range [3, CODON_SIZE], giving us a random number of 768. Finally, we add the index of production choice b, to give a codon of 769. In this manner it is possible to build a derivation tree, where each node will have an associated codon. Simply combining all codons into a list gives the full genome for the individual.

Importantly, since the genome does not define the mapping process, it is not possible for "invalid" solutions to be generated by derivation tree based methods.

Intelligent Operations

Since production choices are not set with the use of a derivation tree representation (i.e. the production choice defines the codon, rather than the codon defining the production choice), it is possible to build derivation trees in an intelligent manner by restricting certain production choices. For example, it is possible to force derivation trees to a certain depth by only allowing recursive production choices to be made until the tree is deep enough that branches can be terminated at the desired depth. This is the basis of "intelligent" derivation methods such as Ramped Half-Half (or Sensible) initialisation.

It is also possible to perform intelligent variation operations using derivation tree methods. For example, crossover and mutation can be controlled by only selecting desired types of sub-trees for variation. Such operators are included in PonyGE2, and are described in later sections.

Bloat


Bloat occurs in evolutionary algorithms when large increases in genetic material are observed without an observed increase in fitness. There are currently three methods implemented to control genetic bloat in PonyGE2:

  1. Limiting the maximum derivation tree depth
  2. Limiting the number of nodes in a derivation tree
  3. Limiting the maximum length of the genome.

Any combination of these three methods can be used with PonyGE2. Furthermore, these bloat control measures are not limited to specific representation types; it is possible to use genome length limitation with derivation tree based operators, and vice versa.

Max Tree Depth

By default the maximum depth limit for a derivation tree is set to 90. This limit is due to Python's eval() stack limit which is set from 92-99 (depending on the Python version used), but grammar design also plays a large part in stack and memory overflow errors. Such a high depth limit can lead to genetic bloat, dramatically slowing down the overall evolutionary process. One way to prevent this is to specify a global maximum tree depth with the argument:

--max_tree_depth [INT]

or by setting the parameter MAX_TREE_DEPTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the desired maximum depth limit for derivation trees.

NOTE that setting the parameter MAX_TREE_DEPTH or argument --max_tree_depth to 0 is the same as setting no maximum tree depth, i.e. trees will be allowed to grow in an un-controlled manner. This may cause memory errors and cause PonyGE2 to crash

NOTE that the parameter MAX_TREE_DEPTH is distinct from the parameter MAX_INIT_TREE_DEPTH, which is used solely to control derivation tree depth during derivation tree-based initialisation.

Max Tree Nodes

By default there are no limits to the maximum number of nodes a derivation tree can have. This can lead to genetic bloat, dramatically slowing down the overall evolutionary process. One way to prevent this is to specify a global maximum number of derivation tree nodes with the argument:

--max_tree_nodes [INT]

or by setting the parameter MAX_TREE_NODES to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the desired maximum number of nodes for derivation trees.

NOTE that setting the parameter MAX_TREE_NODES or argument --max_tree_nodes to 0 is the same as setting no limit on the maximum number of nodes a derivation tree can have, i.e. trees will be allowed to grow in an un-controlled manner.

Max Genome Length

By default there are no limits to the maximum length a genome can take. This can lead to genetic bloat, dramatically slowing down the overall evolutionary process. One way to prevent this is to specify a global maximum genome length with the argument:

--max_genome_length [INT]

or by setting the parameter MAX_GENOME_LENGTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the desired maximum global genome length.

NOTE that setting the parameter MAX_GENOME_LENGTH or argument --max_genome_length to 0 is the same as setting no limit to the lengths of genomes, i.e. genomes will be allowed to grow in an un-controlled manner.

NOTE that the parameter MAX_GENOME_LENGTH is distinct from the parameter MAX_INIT_GENOME_LENGTH, which is used solely to control genome size during genome-based initialisation.

A full breakdown of the currently implemented elements in PonyGE2 is provided below. This includes a brief description of each individual component and how to activate them.

Population Options


There are a number of parameters within PonyGE2 for controlling overall populations.

Population Size

The population size controls the total number of individuals to be generated at each generation. The default value is 500. This value can be changed with the argument:

--population_size [INT]

or by setting the parameter POPULATION_SIZE to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the population size.

Higher population sizes can improve performance on difficult problems, but require more computational effort and may lead to premature convergence.

Generations

The number of generations the evolutionary algorithm will run for. The default value is 50. This value can be changed with the argument:

--generations [INT]

or by setting the parameter GENERATIONS to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the number of generations.

Higher numbers of generations can improve performance, but will lead to longer run-times.

Search Options


The algorithm.search_loop.search_loop() function in PonyGE2 controls the overall duration of the search process. The search loop controls the initialisation of the initial population, along with the main generations loop. In canonical GE, the search process loops over the total number of specified generations.

NOTE that in PonyGE2 the total number of generations refers to the number of generations over which the search process loops (i.e. over which evolution occurs), NOT including initialisation. Thus, specifying 50 generations will mean an initial population will be generated and evaluated, and then the evolutionary process will loop for 50 generations. Since the initialised generation will be Generation 0, the total number of individuals evaluated across an entire evolutionary run will by population x (generations + 1).

At each generation in the main search loop, the main algorithm.step.step() function is called. The step function executes a full step of the evolutionary process:

  1. Selection
  2. Variation
  • Crossover
  • Mutation
  1. Evaluation
  2. Replacement

The main search loop functions of PonyGE2 are stored in algorithm.search_loop and algorithm.step. While PonyGE2 is currently set up to only use the main search loop and step functions (save for special cases such as re-loading an evolutionary run from state), it is possible for users to write their own search loop or step functions. As long as these new functions are saved in their respective files, it is possible to specify the desired search loop or step function directly through the parameters dictionary. The desired search loop can be specified with the argument:

--search_loop [SEARCH_LOOP]

or by or by setting the parameter SEARCH_LOOP to [SEARCH_LOOP] in either a parameters file or in the params dictionary, where [SEARCH_LOOP] is the name of the desired search loop function contained in the algorithm.search_loop.py file.

The desired search loop step function can be specified with the argument:

--step [STEP]

or by or by setting the parameter STEP to [STEP] in either a parameters file or in the params dictionary, where [STEP] is the name of the desired step function contained in the algorithm.step.py file.

Initialisation


As detailed previously, there are two main ways to initialise a GE individual: by generating a genome, or by generating a derivation tree. Generation of a genome can only be done by creating a random genome string, and as such the use of genome initialisation cannot guarantee control over any aspects of the initial population. Population initialisation via derivation tree generation on the other hand allows for fine control over many aspects of the initial population, e.g. depth limits or derivation tree shape. Unlike with genome initialisation, there are a number of different ways to initialise a population using derivation trees. Currently implemented methods are detailed below.

Genome

Random

To generate individuals from initialised genomes, the only option currently implemented is to generate random genome strings.

Activate with:

--initialisation uniform_genome

or by setting the parameter INITIALISATION to uniform_genome in either a parameters file or in the params dictionary.

By default in PonyGE2, genomes of length 200 codons are generated when using random genome initialisation. However, this parameter can be changed using the argument:

--init_genome_length [INT]

or by setting the parameter INIT_GENOME_LENGTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the length of genomes to be initialised.

NOTE that random genome initialisation in Grammatical Evolution should be used with caution as poor grammar design can have a negative impact on the quality of randomly initialised solutions due to the inherent bias capabilities of GE [Fagan et al., 2016; Nicolau & Fenton, 2016].

Derivation Tree

There are currently three options provided in PonyGE2 for initialising a population of individuals using derivation tree methods. You can either initialise a population of random derivation trees, or you can use various "smart" initialisation methods implemented here.

Random

Random derivation tree initialisation generates individuals by randomly building derivation trees up to the specified maximum initialisation depth limit.

Activate with:

--initialisation uniform_tree

or by setting the parameter INITIALISATION to uniform_tree in either a parameters file or in the params dictionary.

NOTE that there is no obligation that randomly generated derivation trees will extend to the depth limit; they will be of random size [Fagan et al., 2016].

NOTE that randomly generated derivation trees will have a tendency towards smaller tree sizes with the use of a grammar-based mapping [Fagan et al., 2016].

Ramped Half-Half

Ramped Half-Half initialisation in Grammatical Evolution is often called "Sensible Initialisation" [Ryan and Azad, 2003]. Sensible Initialisation follows traditional GP Ramped Half-Half initialisation by initialising a population of individuals using two separate methods: Full and Grow.

Full initialisation generates a derivation tree where all branches extend to the specified depth limit. This tends to generate very bushy, evenly balanced trees [Fagan et al., 2016].

Grow initialisation generates a randomly built derivation tree where no branch extends past the depth limit.

NOTE that Grow is analogous to random derivation tree initialisation, i.e. no branch in the tree is forced to reach the specified depth. Depending on how the grammar is written, this can result in a very high probability of small trees being generated, regardless of the specified depth limit.

Activate with:

--initialisation rhh

or by setting the parameter INITIALISATION to rhh in either a parameters file or in the params dictionary.

RHH initialisation generates pairs of solutions using both full and grow methods for a ramped range of depths. The maximum initialisation depth is set with the argument:

--max_init_tree_depth [INT]

or by setting the parameter MAX_INIT_TREE_DEPTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the maximum depth to which derivation trees are to be initialised. The default value is set at 10.

By default in PonyGE, initialisation ramping begins at a depth where sufficient unique solutions can be generated for the number of required solutions at that depth [Nicolau & Fenton, 2016]. However, this value can be over-written in favor of a user-defined minimum ramping depth. This can be set with the argument:

--min_init_tree_depth [INT]

or by setting the parameter MIN_INIT_TREE_DEPTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the minimum depth from which derivation trees are to be initialised.

NOTE that RHH initialisation with the use of a grammar-based mapping process such as GE can potentially result in a high number of duplicate individuals in the initial generation, resulting from a potentially high number of very small solutions [Nicolau & Fenton, 2016, Fagan et al., 2016]. As such, caution is advised when using RHH initialisation in grammar-based systems, as particular care needs to be given to grammar design in order to minimise this effect [Fagan et al., 2016].

Position Independent Grow (PI Grow)

Position Independent Grow (PI Grow) initialisation in Grammatical Evolution mirrors Sensible/Ramped Half-Half initialisation by initialising a population of individuals over a ramped range of depths. However, while RHH uses two separate methods Full and Grow to generate pairs of individuals at each depth, PI Grow eschews the Full component and only uses the Grow aspect. There are two further differences between traditional GP Grow and PI Grow [Fagan et al., 2016]:

  1. At least one branch of the derivation tree is forced to the specified maximum depth in PI Grow, and
  2. Non-terminals are expanded in random (i.e. position independent) order rather than the left-first derivation of traditional mappers.

Activate with:

--initialisation PI_grow

or by setting the parameter INITIALISATION to to PI_grow in either a parameters file or in the params dictionary.

As with RHH initialisation, PI Grow initialisation generates individuals for a ramped range of depths. The maximum initialisation depth is set with the argument:

--max_init_tree_depth [INT]

or by setting the parameter MAX_INIT_TREE_DEPTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the maximum depth to which derivation trees are to be initialised. The default value is set at 10.

By default in PonyGE, initialisation ramping begins at a depth where sufficient unique solutions can be generated for the number of required solutions at that depth [Nicolau & Fenton, 2016]. However, this value can be over-written in favor of a user-defined minimum ramping depth. This can be set with the argument:

--min_init_tree_depth [INT]

or by setting the parameter MIN_INIT_TREE_DEPTH to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the minimum depth from which derivation trees are to be initialised.

Selection


The selection process is a key step in Evolutionary Algorithms. Selection drives the search process towards specific areas of the search space. The selection process operates on a population of individuals, and produces a population of "parents". These parents are then traditionally used by variation operators (detailed in the next section).

The linear genome mapping process in Grammatical Evolution can generate "invalid" individuals. Only valid individuals are selected by default in PonyGE2, however this can be changed with the argument:

--invalid_selection

or by setting the parameter INVALID_SELECTION to True in either a parameters file or in the params dictionary.

Tournament

Tournament selection selects TOURNAMENT_SIZE individuals from the overall population, sorts them, and then returns the single individual with the best fitness. Since no individuals are removed from the original population, it is possible that the same individuals may be selected multiple times to appear in multiple tournaments, although the same individual may not appear multiple times in the same tournament.

Activate with:

--selection tournament

or by setting the parameter SELECTION to tournament in either a parameters file or in the params dictionary.

Tournament size is set by default at 2. This value can be changed with the argument:

--tournament_size [INT]

or by setting the parameter TOURNAMENT_SIZE to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the tournament size.

Truncation

Truncation selection takes an entire population, sorts it, and returns the best SELECTION_PROPORTION of that population.

Activate with:

--selection truncation

or by setting the parameter SELECTION to truncation in either a parameters file or in the params dictionary.

Selection proportion is set by default at 0.5 (i.e. return the top 50% of the population). This value can be changed with the argument:

--selection_proportion [NUM]

or by setting the parameter SELECTION_PROPORTION to [NUM] in either a parameters file or in the params dictionary, where [NUM] is a float between 0 and 1.

NOTE that unless the specified SELECTION_PROPORTION is 1.0 (i.e. 100%), truncation selection necessarily returns a selected parent population that is smaller in size than the original population.

Variation


Variation operators in evolutionary algorithms explore the search space by varying genetic material of individuals in order to explore new areas of the search space. There are two main types of variation operator:

  1. Crossover
  2. Mutation

Crossover

Given a parent population of individuals picked using the given selection process described in the previous section, crossover randomly selects two parents and directly swaps genetic material between them. Parents are selected from the parent population in a non-exclusive manner, i.e. it is possible to select the same parent multiple times for multiple crossover events.

Given these two parents, the crossover probability defines the probability that a given crossover operator will perform crossover on their genetic material. The probability of crossover occurring is set with the argument:

--crossover_probability [NUM]

or by setting the parameter CROSSOVER_PROBABILITY to [NUM] in either a parameters file or in the params dictionary, where [NUM] is a float between 0 and 1. The default value for crossover is 0.75 (i.e. two selected parent individuals have a 75% chance of having genetic material crossed over between them).

Unlike canonical Genetic Programming [Koza, 1992], crossover in Grammatical Evolution always produces two children given two parents [O'Neill et al., 2003]. However, this is not a requirement for PonyGE2; the user is free to add new crossover operators producing as many children as desired.

There are currently four linear genome crossover operators implemented in PonyGE2:

  1. Fixed Onepoint
  2. Fixed Twopoint
  3. Variable Onepoint
  4. Variable Twopoint

There is also currently one derivation tree based crossover method implemented in PonyGE2.

Since linear genome crossover operators are not intelligent, i.e. crossover is applied randomly, it is therefore possible for linear crossover operators to generate invalid individuals (i.e. individuals who do not terminate mapping). In order to mitigate this issue, provision has been made in PonyGE2 to prevent crossover from generating invalid solutions. While this option is set to False by default, it can be selected with the argument:

--no_crossover_invalids

or by setting the parameter NO_CROSSOVER_INVALIDS to True in either a parameters file or in the params dictionary.

If the NO_CROSSOVER_INVALIDS parameter is used, crossover will select two new parents and perform crossover again in order to generate two valid children. This process loops until valid children are created.

NOTE that since the NO_CROSSOVER_INVALIDS parameter uses a while loop to force crossover to generate valid solutions, it is possible for crossover to get stuck in an infinite loop if this option is selected. As such, caution is advised when using this option.

NOTE that since crossover operators modify the parents in some fashion, copies of the parents must first be made before crossover is applied. If copies are not made, then the original parents in the selected parent population would be modified in-place, and subsequent modification of these parents would change any children so produced.

Fixed Onepoint

Given two individuals, fixed onepoint crossover creates two children by selecting the same point on both genomes for crossover to occur. The head of genome 0 is then combined with the tail of genome 1, and the head of genome 1 is combined with the tail of genome 0. This means that genomes will always remain the same length after crossover. Fixed onepoint crossover can be activated with the argument:

--crossover fixed_onepoint

or by setting the parameter CROSSOVER to fixed_onepoint in either a parameters file or in the params dictionary.

Crossover points are selected within the used portion of the genome by default (i.e. crossover does not occur in the unused tail of the individual). This parameter can be selected with the argument:

--within_used

or by setting the parameter WITHIN_USED to either True or False in either a parameters file or in the params dictionary.

NOTE that by default WITHIN_USED is set to True.

NOTE that selecting the argument --within_used will also set the WITHIN_USED parameter to True As such, the only way to change the WITHIN_USED parameter to False is to set so in either a parameters file or in the params dictionary.

Fixed Twopoint

Given two individuals, fixed twopoint crossover creates two children by selecting the same points on both genomes for crossover to occur. The head and tail of genome 0 are then combined with the mid-section of genome 1, and the head and tail of genome 1 are combined with the mid-section of genome 0. This means that genomes will always remain the same length after crossover. Fixed twopoint crossover can be activated with the argument:

--crossover fixed_twopoint

or by setting the parameter CROSSOVER to fixed_twopoint in either a parameters file or in the params dictionary.

As with all linear genome crossovers, crossover points are selected within the used portion of the genome by default (i.e. crossover does not occur in the unused tail of the individual).

Variable Onepoint

Given two individuals, variable onepoint crossover creates two children by selecting a different point on each genome for crossover to occur. The head of genome 0 is then combined with the tail of genome 1, and the head of genome 1 is combined with the tail of genome 0. This allows genomes to grow or shrink in length. Variable onepoint crossover can be activated with the argument:

--crossover variable_onepoint

or by setting the parameter CROSSOVER to variable_onepoint in either a parameters file or in the params dictionary.

As with all linear genome crossovers, crossover points are selected within the used portion of the genome by default (i.e. crossover does not occur in the unused tail of the individual).

NOTE that variable linear crossovers can cause individuals to grow in size, leading to bloat.

Variable Twopoint

Given two individuals, variable twopoint crossover creates two children by selecting two different points on each genome for crossover to occur. The head and tail of genome 0 are then combined with the mid-section of genome 1, and the head and tail of genome 1 are combined with the mid-section of genome 0. This allows genomes to grow or shrink in length. Variable twopoint crossover can be activated with the argument:

--crossover variable_twopoint

or by setting the parameter CROSSOVER to variable_twopoint in either a parameters file or in the params dictionary.

As with all linear genome crossovers, crossover points are selected within the used portion of the genome by default (i.e. crossover does not occur in the unused tail of the individual).

NOTE that variable linear crossovers can cause individuals to grow in size, leading to bloat.

Subtree

Given two individuals, subtree crossover creates two children by selecting candidate subtrees from both parents based on matching non-terminal nodes. The chosen subtrees are then swapped between parents, creating new children. Subtree crossover can be activated with the argument:

--crossover subtree

or by setting the parameter CROSSOVER to subtree in either a parameters file or in the params dictionary.

NOTE that subtree crossover can cause individuals to grow in size, leading to bloat.

NOTE that subtree crossover will not produce invalid individuals, i.e. given two valid parents, both children are guaranteed to be valid.

Mutation

While crossover operates on pairs of selected parents to produce new children, mutation in Grammatical Evolution operates on every individual in the child population after crossover has been applied. Note that this is different in implementation so canonical GP crossover and mutation, whereby a certain percentage of the population would be selected for crossover with the remaining members of the population subjected to mutation [Koza, 1992].

There are currently two linear mutation operators and one subtree mutation operator implemented in PonyGE2.

NOTE that linear genome mutation operators are not intelligent, i.e. mutation is applied randomly. It is therefore possible for linear mutation operators to generate invalid individuals (i.e. individuals who do not terminate mapping).

Since linear genome mutation operators are not intelligent, i.e. mutation is applied randomly, it is therefore possible for linear mutation operators to generate invalid individuals (i.e. individuals who do not terminate mapping). In order to mitigate this issue, provision has been made in PonyGE2 to prevent mutation from generating invalid solutions. While this option is set to False by default, it can be selected with the argument:

--no_mutation_invalids

or by setting the parameter NO_MUTATION_INVALIDS to True in either a parameters file or in the params dictionary.

If the NO_MUTATION_INVALIDS parameter is used, mutation will be performed on the individual indefinitely until a valid solution is created.

NOTE that even though the NO_MUTATION_INVALIDS parameter uses a while loop to force mutation to generate valid solutions, unlike with crossover it is not possible for mutation to get stuck in an infinite loop if this option is selected.

Int Flip Per Codon

Int Flip Per Codon mutation operates on linear genomes and randomly mutates every individual codon in the genome with a probability [MUTATION_PROBABILITY]. Int Flip Per Codon mutation can be activated with the argument:

--mutation int_flip_per_codon

or by setting the parameter MUTATION to int_flip_per_codon in either a parameters file or in the params dictionary. The default mutation probability is for every codon 1 over the entire length of the genome. This can be changed with the argument:

--mutation_probability [NUM]

or by setting the parameter MUTATION_PROBABILITY to [NUM] in either a parameters file or in the params dictionary, where [NUM] is a float between 0 and 1. This will change the mutation probability for each codon to the probability specified. Mutation is performed over the entire genome by default, but the argument within_used is provided to limit mutation to only the effective length of the genome.

NOTE that specifying the within_used argument for int_flip_per_codon mutation will alter the probability of per-codon mutation accordingly as the used portion of the genome may be shorter than the overall length of the genome.

Int Flip Per Ind

Int Flip Per Ind mutation operates on linear genomes and mutates MUTATION_EVENTS randomly selected codons in the genome. Int Flip Per Ind mutation can be activated with the argument:

--mutation int_flip_per_ind

or by setting the parameter MUTATION to int_flip_per_ind in either a parameters file or in the params dictionary. The default mutation events is set to 1. This can be changed with the argument:

--mutation_events [INT]

or by setting the parameter MUTATION_EVENTS to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer specifying the number of desired mutation events across the entire genome. Mutation is performed over the entire genome by default, but the argument within_used is provided to limit mutation to only the effective length of the genome.

NOTE that the parameter MUTATION_PROBABILITY does not apply to int_flip_per_ind mutation.

Subtree

Subtree mutation randomly selects a subtree from the overall derivation tree of an individual and mutates that subtree by building a new random subtree from the root node. Subtree mutation uses the same random derivation function as the Grow component of Ramped Half-Half initialisation. Subtree mutation can be activated with the argument:

--mutation subtree

or by setting the parameter MUTATION to subtree in either a parameters file or in the params dictionary.

NOTE that the parameter MUTATION_PROBABILITY does not apply to subtree mutation, i.e. each individual is guaranteed MUTATION_EVENTS mutation events.

Mutation Events

The ability to specify the number of mutation events per individual is provided in PonyGE2. This works for all mutation operators currently implemented, but works slightly differently in each case. The default number of mutation events is 1 per individual. This value can be changed with the argument:

--mutation_events [INT]

or by setting the parameter MUTATION_EVENTS to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the number of mutation events per individual.

For subtree mutation, exactly MUTATION_EVENTS number of mutation events will occur. This is accomplished by calling the subtree mutation operator MUTATION_EVENTS times for each individual. NOTE that this means that the same subtree can be mutated multiple times.

For linear genome mutation operators, the MUTATION_EVENTS parameter operates slightly differently to subtree mutation. As detailed previously, with the linear mutation operator int_flip_per_ind, exactly MUTATION_EVENTS mutations will occur on the genome (i.e. there is no MUTATION_PROBABILITY used). However, with int_flip_per_codon mutation the MUTATION_EVENTS parameter will only affect the probability of per-codon mutation events occurring. This is done by changing the probability of mutation to MUTATION_EVENTS divided by the length of the genome.

NOTE that the default value for MUTATION_EVENTS is 1, meaning to the default mutation probability for int_flip_per_codon mutation is 1 divided by the length of the genome unless either MUTATION_EVENTS or MUTATION_PROBABILITY are explicitly specified.

NOTE that the parameters MUTATION_EVENTS and MUTATION_PROBABILITY cannot both be specified for int_flip_per_codon mutation as these are mutually exclusive parameters in this case.

Evaluation


Fitness Functions

Evaluation of individuals in PonyGE2 is carried out by the specified fitness function. All fitness functions are located in src/fitness. Fitness functions in PonyGE2 must be a class instance contained in its own separate file.

NOTE that fitness function classes in PonyGE2 must have the same name as their containing file.

All newly implemented fitness functions in PonyGE2 must have a maximise attribute which indicates whether or not the fitness function seeks to maximise or minimise fitness.

All newly implemented fitness functions in PonyGE2 must have a default_fitness attribute which is used to set the fitness value of those individuals who cannot be evaluated, e.g. invalid individuals.

All newly implemented fitness functions in PonyGE2 should require only one input to the __call__ method: the individual itself. Fitness functions are called from representation.individual.Individual.evaluate.

For supervised learning fitness functions which evaluate solutions on either training or test data, PonyGE2 requires the fitness function to have a training_test attribute. This allows PonyGE2 to evaluate individuals on training data by default, and on test data by specifying the correct optional input argument to the fitness function call itself. There are a number of example supervised learning problems implemented in PonyGE2 which demonstrate these attributes.

NOTE that the call to evaluate individuals on optional test data for supervised learning problems is made in stats.stats.get_stats.

Fitness functions can be specified with the argument:

--fitness_function [FITNESS_FUNCTION_NAME]

or by setting the parameter FITNESS_FUNCTION to [FITNESS_FUNCTION_NAME] in either a parameters file or in the params dictionary, where [FITNESS_FUNCTION_NAME] is a string specifying the name of the desired fitness function.

Error Metrics

Some supervised learning fitness functions require an error metric (e.g. mean-squared error, or mse) to be specified. While the default regression and classification fitness functions provided in PonyGE2 have their error metrics set to rnse and f1_score respectively by default, it is possible to specify new error metrics with the argument:

--error_metric [ERROR_METRIC_NAME]

or by setting the parameter ERROR_METRIC to [ERROR_METRIC_NAME] in either a parameters file or in the params dictionary, where [ERROR_METRIC_NAME] is a string specifying the name of the desired error metric. A list of currently implemented error metrics is available in utilities.fitness.error_metric.

NOTE that for some supervised learning problems that use specified error metrics (e.g. mean-squared error), these error metrics have their own maximise attributes. This maximise attribute is automatically used by the specified supervised learning fitness function.

Datasets

Some fitness functions may require a dataset across which to be evaluated. Datasets for PonyGE2 are saved in the datasets folder. Most supervised learning problems require two datasets: training and test data. These are specified independently.

Training datasets can be specified with the argument:

--dataset_train [DATASET_NAME]

or by setting the parameter DATASET_TRAIN to [DATASET_NAME] in either a parameters file or in the params dictionary, where [DATASET_NAME] is a string specifying the full file name of the desired training dataset. For example, the argument:

--dataset_train Dow/Train.txt

will load in the Train.txt dataset saved in the Dow folder within datasets.

Testing datasets can be specified with the argument:

--dataset_test [DATASET_NAME]

or by setting the parameter DATASET_TEST to [DATASET_NAME] in either a parameters file or in the params dictionary, where [DATASET_NAME] is a string specifying the full file name of the desired testing dataset.

NOTE that you must specify the file extension when specifying the dataset names.

While it is recommended that supervised learning problems implement training and unseen testing data, it is not necessary to use testing data with these problems. If you wish to run PonyGE2 with no test dataset, you can simply use the argument:

--dataset_test None

Dataset Delimiters

By default, PonyGE2 will try to automatically parse specified datasets using the following set of data delimiters in the following order:

  1. "\t" [tab]
  2. "," [comma]
  3. ";" [semi-colon]
  4. ":" [colon]

If the data cannot be parsed using the above separators, then PonyGE2 will default to using whitespace as the delimiter for separating data. However, it is possible to directly specify the desired dataset delimiter with the argument:

--dataset_delimiter [DELIMITER]

or by setting the parameter DATASET_DELIMITER to [DELIMITER] in either a parameters file or in the params dictionary, where [DELIMITER] is a string specifying the desired dataset delimiter.

NOTE that you do not need to escape special characters such as "\t" with "\\t" when passing in the --dataset_delimiter argument from the command line. Simply specify the raw desired string.

Targets

Some fitness functions may require a target. For example, a string match fitness function will require a target value to match. All target values will be stored in PonyGE2 as a string by default. If a fitness function requires any data structure other than a string, the target string itself must be parsed to the desired data structure within the fitness function itself (NOTE that if this parsing is done in the __init__() call of the fitness function, it only needs to be done once rather than at every fitness evaluation.)

Target strings can be specified with the argument:

--target [TARGET_STRING]

or by setting the parameter TARGET to [TARGET_STRING] in either a parameters file or in the params dictionary, where [TARGET_STRING] is a string specifying the desired target string.

Multicore evaluation

Evaluation of a population of individuals can be done in series (single core evaluation) or in parallel (multi core evaluation). Multicore evaluation can be activated with the argument:

--multicore

or by setting the parameter MULTICORE to True in either a parameters file or in the params dictionary.

Additionally, the number of processor cores used for multicore evaluation can be controlled with the argument:

--cores [INT]

or by setting the parameter CORES to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the number of cores used for fitness evaluations. The default value is to use all available cores.

NOTE that at present multicore evaluation does not work on Windows operating systems.

NOTE that multicore evaluations may not necessarily improve computational runtime for small problems as a certain overhead is necessary to run the multicore evaluation process.

NOTE that for smaller problems fitness evaluations may not necessarily present a bottleneck in terms of computational run-time. It is advised to use a python profiler to ascertain whether or not fitness evaluations present such a bottleneck. If this is the case, multicore evaluation may improve the run-time of a single evolutionary run.

NOTE that when running batches of multiple experiments, it will always be faster to run multiple single-core experiments in parallel, rather than multiple multi-core experiments in series.

Caching

Caching is provided in PonyGE2 to save on fitness evaluations and to track the number of unique solutions encountered during an evolutionary run. Cached individuals have their fitness stored in the utilities.trackers.cache dictionary. Dictionary keys are the string of the phenotype. Caching can be activated with the argument:

--cache

or by setting the parameter CACHE to True in either a parameters file or in the params dictionary.

There are currently three optional extras for use with the cache:

1. Fitness Lookup

This is the default case when caching is activated. Individuals which have already been evaluated have their previous fitness read directly from the cache, thus saving fitness evaluations. Fitness lookup can be de-activated with:

--dont_lookup_fitness

or by setting the parameter LOOKUP_FITNESS to False in either a parameters file or in the params dictionary.

2. Fitness Penalty

Individuals which have already been evaluated (i.e. duplicate individuals) are given a default bad fitness. The fitness to be assigned is the default fitness value specified in the fitness function. This parameter can be activated with the argument:

--lookup_bad_fitness

or by setting the parameter LOOKUP_BAD_FITNESS to True in either a parameters file or in the params dictionary.

3. Mutate Duplicates

Individuals which have already been evaluated (i.e. duplicate individuals) are mutated to produce new unique individuals which have not been encountered yet by the search process. This parameter can be activated with the argument:

--mutate_duplicates

or by setting the parameter MUTATE_DUPLICATES to True in either a parameters file or in the params dictionary.

NOTE that the various caching options are mutually exclusive. For example, you cannot specify --mutate_duplicates with --lookup_bad_fitness.

Replacement


The replacement strategy for an Evolutionary Algorithm defines which parents and children survive into the next generation.

Generational

Generational replacement replaces the entire parent population with the newly generated child population at every generation. Generational replacement can be activated with the argument:

--replacement generational

or by setting the parameter REPLACEMENT to generational in either a parameters file or in the params dictionary.

Elitism

Generational replacement is most commonly used in conjunction with elitism. With elitism, the best [ELITE_SIZE] individuals in the parent population are copied over unchanged to the next generation. Elitism ensures continuity of the best ever solution at all stages through the evolutionary process, and allows for the best solution to be updated at each generation.

The number of children created at each generation is known as the GENERATION_SIZE, and is equal to [POPULATION_SIZE] - [ELITE_SIZE]. This parameter is set automatically by PonyGE2.

The default number of elites is 1 percent of the population size. This value can be changed with the argument:

--elite_size [INT]

or by setting the parameter ELITE_SIZE to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the number of elites to be saved between generations.

Steady State

With steady state replacement, only 2 children are created by the evolutionary process at each evolutionary step (i.e. the GENERATION_SIZE is automatically set to 2). Steady state replacement first selects two parents, performs crossover on them to produce two children, mutates and evaluates these children, and then replaces the two worst individuals in the original population with the new children, regardless of whether or not these children are fitter than the individuals they replace. As such, steady state replacement implements its own specialised step loop.

At each generation, steady state replacement continues until POPULATION_SIZE children have been created and inserted into the original population. Adopting a steady state replacement strategy ensures that successive populations overlap to a significant degree (i.e. parents and their children can co-exist). This requires less memory as only one population of individuals needs to be maintained at any given point in time. This strategy also allows the evolutionary process to exploit good solutions as soon as they appear.

Steady state replacement can be activated with the argument:

--replacement steady_state

or by setting the parameter REPLACEMENT to steady_state in either a parameters file or in the params dictionary.

Example Problems


Seven example problems are currently provided:

  1. String-match
  2. Regression
  3. Classification
  4. Pymax
  5. Integer sequence match
  6. Program synthesis
  7. Genetic Improvement of Regex Runtime Performance

A brief description is given below of each problem, along with the command-line arguments necessary to call each problem. The developers of PonyGE2 encourage users to test out the various different operators and options available within PonyGE2 using these example problems in order to gain an appreciation of how they work.

String-match


The grammar specifies words as lists of vowels and consonants along with special characters. The aim is to match a target string.

To use it, specify the following command-line argument:

--parameters string_match.txt

The default string match target is Hello world!, but this can be changed with the --target argument.

Regression


The grammar generates a symbolic function composed of standard mathematical operations and a set of variables. This function is then evaluated using a pre-defined set of inputs, given in the datasets folder. Each problem suite has a unique set of inputs. The aim is to minimise some error between the expected output of the function and the desired output specified in the datasets. This is the default problem for PonyGE.

To try this problem, specify the following command-line argument:

--parameters regression.txt

The default regression problem is Vladislavleva4, but this can be changed with the --grammar_file, --dataset_train and --dataset_test arguments.

Classification


Classification can be considered a special case of symbolic regression but with a different error metric. Like with regression, the grammar generates a symbolic function composed of standard mathematical operations and a set of variables. This function is then evaluated using a pre-defined set of inputs, given in the datasets folder. Each problem suite has a unique set of inputs. The aim is to minimise some classification error between the expected output of the function and the desired output specified in the datasets.

To try this problem, specify the following command-line argument:

--parameters classification.txt

The default classification problem is Banknote, but this can be changed with the --grammar_file, --dataset_train and --dataset_test arguments.

Pymax


One of the strongest aspects of a grammatical mapping approach such as PonyGE2 is the ability to generate executable computer programs in an arbitrary language [O'Neill & Ryan, 2003]. In order to demonstrate this in the simplest way possible, we have included an example Python programming problem.

The Pymax problem is a traditional maximisation problem, where the goal is to produce as large a number as possible. However, instead of encoding the grammar in a symbolic manner and evaluating the result, we have encoded the grammar for the Pymax problem as a basic Python programming example. The phenotypes generated by this grammar are executable python functions, whose outputs represent the fitness value of the individual. Users are encouraged to examine the pymax.bnf grammar and the resultant individual phenotypes to gain an understanding of how grammars can be used to generate such arbitrary programs.

To try this problem, specify the following command-line argument:

--parameters pymax.txt

Integer sequence match


In the sequence-match problem, we're given an integer sequence target, say [0, 5, 0, 5, 0, 5], and we try to synthesize a program (loops, if-statements, etc) which will yield that sequence, one item at a time. There are several components to the provided fitness function, which are weighted by numerical parameters. We can specify the target sequence and weights using parameters on the command line or in a parameters file.

To try this problem, specify the following command-line argument:

--parameters sequence_match.txt

NOTE that the sequence match dependencies are currently outside of Anaconda.

Program synthesis


The General Program Synthesis Benchmark Suite is available in PonyGE2. Grammars and datasets have been provided by HeuristicLab.CFGGP. The individuals produce executable Python code. Note: multicore is currently not supported for this type of problem.

To try this problem, specify the following command-line argument:

--parameters progsys.txt

Genetic Improvement of Regex Runtime Performance


An example of Genetic Improvement for software engineering is given for program improvement of Regular Expressions (regexs). The given examples improve existing regexes by seeding them into the population. The fitness function measures runtime and functionality of regexs.

The fitness function for the regex performance improvement problem has a number of sub-modules and programs which are used for generating test cases and for timing the execution of candidate programs. This is an exmaple of how a fitness function can be extended in PonyGE2 with customised modules and classes to perform complex tasks.

To try this problem, specify the following command-line argument:

python scripts/seed_PonyGE2.py --parameters regex_improvement.txt

Adding New Problems


It has been made as simple as possible to add new problems to PonyGE. To add in a new problem, you will need:

  1. a new grammar file specific to your problem,
  2. a new fitness function (if you don't want to use a previously existing one), and
  3. if you are doing supervised learning then you may also need to add some new datasets.

A template for new fitness functions is provided in fitness.base_ff_classes.ff_template. This template allows for new fitness functions to be easily and quickly created in an easily compatible and extensible manner.

NOTE that it may be beneficial to create a new parameters file for any new problem.

Editing Code to enable new problems

Finally, depending on the problem itself you may need to edit representation.individual.Individual.evaluate to fully integrate the new problem to PonyGE. individual.evaluate is where PonyGE specifies the inputs needed for fitness evaluation.

NOTE that it may not be necessary to edit individual.evaluate if you only pass in the individual to be evaluated, as the call argument for each fitness function only has one input by default.

Scripts


Besides the main PonyGE.py file that can be found in the source directory, a number of extra scripts are provided with PonyGE2. These are located in the scripts folder. These extra scripts have been designed to work either as standalone files, or to work in tandem with PonyGE2. Various functions from within these scripts can provide extra functionality to PonyGE2.

Basic Experiment Manager


A basic experiment manager is provided in the scripts folder. This experiment manager allows users to execute multiple evolutionary runs across multiple cores using python's multiprocessing library. Experiments are saved in results/[EXPERIMENT_NAME] where [EXPERIMENT_NAME] is a parameter which specifies the name of the experiment. This can be set with the argument:

--experiment_name [EXPERIMENT_NAME]

or by setting the parameter EXPERIMENT_NAME to [EXPERIMENT_NAME] in either a parameters file or in the params dictionary, where [EXPERIMENT_NAME] is a string which specifies the desired name of the experiment.

NOTE that the EXPERIMENT_NAME parameter must be set when using the experiment manager.

The number of evolutionary runs to be executed can be set with the arguemt:

--runs [INT]

or by setting the parameter RUNS to [INT] in either a parameters file or in the params dictionary, where [INT] is an integer which specifies the number of evolutionary runs to be completed. The experiment manager initialises each evolutionary run with a different unique random seed. The random seeds for a batch of evolutionary experiments are the indexes of the individual experiments (i.e. the first run will have seed 0, the second will have seed 1, and so on up to seed [INT] - 1.)

NOTE that the experiment manager uses pythons multiprocessing library to launch multiple runs simultaneously. As such, it is not possible to use the experiment manager with the MULTICORE parameter set to True. If the MULTICORE parameter is already set to True, it will be turned off automatically.

Since python uses the central algorithm.parameters.params and stats.stats.stats dictionaries to manage various aspects of individual runs, it is not currently possible to launch multiple simultaneous runs of PonyGE2 from within a Python environment as the central dictionaries would be overwritten by the concurrent processes. As such, the experiment manager calls individual PonyGE2 runs from the command line using Python's subprocess.call() function.

NOTE that all functionality available to the main PonyGE file is available to the experiment manager, i.e. all command line arguments can be used including the specification of parameters files.

To run the experiment manager, type:

$ python scripts/experiment_manager.py --experiment_name [EXPERIMENT_NAME] --runs [INT]

where [EXPERIMENT_NAME] is a string which specifies the desired name of the experiment and where [INT] is an integer which specifies the number of evolutionary runs to be completed.

NOTE that since the [MULTICORE] parameter in PonyGE2 does not work with Windows operating systems, at present the experiment manager will not work with Windows operating systems.

Post-run Analysis


A basic statistics parser is included in the scripts folder. This statistics parser can be used to generate summary .csv files and .pdf graphs for all stats generated by all runs saved in an [EXPERIMENT_NAME] folder.

The statistics parser extracts the stats.tsv files from all runs contained in the specified [EXPERIMENT_NAME] folder. For each stat, a unique .csv file is generated containing that statistic across all stats files. Average and standard deviations for each stat are calculated, and graphs displaying the average values (with standard deviations) across all generations are generated. Finally, the statistics parser saves a main full_stats.csv file containing all statistics across all runs in a single file. All .csv summary files can be used with any numerical statistics package, such as R.

While the experiment manager calls the statistics parser after all experiments have been completed, it is possible to call the statistics parser as a standalone program to generate these files for any given [EXPERIMENT_NAME] folder. This can be done from the command line by typing:

$ python scripts/parse_stats.py --experiment_name [EXPERIMENT_NAME]

where [EXPERIMENT_NAME] is a string which specifies the desired name of the experiment contained in the results folder.

GE LR Parser


A powerful script that has been included with PonyGE2 is the deterministic GE LR Parser. This script will parse a given target string using a specified .bnf grammar and will return a PonyGE2 individual that can be used in PonyGE2. Provided the target string can be fully and correctly represented by the specified grammar, the LR parser uncovers a derivation tree which matches the target string by building the overall tree from the terminals used in the solution. A repository of phenotypically correct sub-trees whose outputs match portions of the target string (termed 'snippets') is compiled. Deterministic concatenation operators are employed to build the desired solution. Provided the grammar remains unchanged, these reverse-engineered solutions can be saved and used in an evolutionary setting.

Since the GE LR Parser is fully deterministic, the same GE individual will be returned every time it is executed.

To run the GE LR Parser, only two parameters need to be specified:

--grammar_file [FILE_NAME.bnf]

--reverse_mapping_target [TARGET_STRING]

where [TARGET_STRING] is a string specifying the target string to be parsed by the GE LR Parser.

NOTE that the full file extension for the grammar file (e.g. ".bnf") must be specified, but the full file path for the grammar file (e.g. grammars/example_grammar.bnf) does not need to be specified.

Alternatively, both the GRAMMAR_FILE and REVERSE_MAPPING_TARGET can be specified in either the algorithm.parameters.params dictionary or in a separate parameters file. An example parameters file can be seen in the parameters folder. To run this example, type:

$ python scripts/GE_LR_parse.py --parameters GE_parse.txt

Seeding GE Runs with target solutions


Combining the GE LR Parser with the full PonyGE2 library, it is possible to parse a target string into a GE individual and then to seed an evolutionary run of PonyGE2 with that individual. Provision is made in PonyGE2 to allow for the seeding of as many target individuals as desired into an evolutionary run.

In order to run PonyGE2 with seeded individuals, a specialised script is provided in the scripts folder, titled seed_PonyGE2.py. To run this script from the command line, simply type:

$ python scripts/seed_PonyGE2.py

All command line parameters from standard PonyGE2 are compatible with seed_PonyGE2.

There are two ways to seed individuals into a PonyGE2 run using this script:

1. Seeding runs with a single target solution

If a single target phenotype string is to be included into the initial population, users can specify the argument:

--reverse_mapping_target [TARGET_STRING]  

or set the parameter REVERSE_MAPPING_TARGET to [TARGET_STRING] in either a parameters file or in the params dictionary, where [TARGET_STRING] is a phenoytpe string specifying the target string to be parsed by the GE LR Parser into a GE individual.

NOTE that as with the GE LR Parser described above, a compatible grammar file needs to be specified along with the target string. If the target string cannot be parsed using the specified grammar, an error will occur.

2. Seeding runs with one or more target solutions

Alternatively, if one or more target individuals are to be seeded into a GE population, a folder has been made available for saving populations of desired individuals for seeding. The root directory contains a seeds folder. Any number of desired target individuals for seeding can be saved in separate text files within a unique folder in the scripts directory.

PonyGE2 currently supports four formats for saving and re-loading of such individuals (examples of each are given in the seeds/example_pop folder):

  1. (example_1.txt in seeds/example_pop) PonyGE2 can re-load "best.txt" outputs from previous PonyGE2 runs. These files contain the saved genotypes and phenotypes of the best solution evolved over the course of an evolutionary run. Re-using these output files greatly improves the seeding process, as the genotypes can be quickly used to re-map the exact identical individual evolved by PonyGE2. If possible, this is the preferred option for seeding populations as the use of genomes to re-build previous individuals guarantees the same genetic information will be retained.
  2. (example_2.txt in seeds/example_pop) Target phenotypes can be saved as a simple text file with a single header of "Phenotype:", followed by the phenotype string itself on the following line. The phenotype will then be parsed into a PonyGE2 individual using the GE LR Parser.
  3. (example_3.txt in seeds/example_pop) Target genotypes can be saved as a simple text file with a single header of "Genotype:", followed by the genotype itself on the following line. The genotype will then be mapped into a PonyGE2 individual using the normal GE mapping process. As with option 1 above, this will result in an identical PonyGE2 individual being re-created from the specified genome.
  4. (example_4.txt in seeds/example_pop) Target phenotypes can be saved as a simple text file where the only content of the file is the phenotype string itself (i.e. no descriptive text, headers, comments, etc). The content of these files will then be parsed into PonyGE2 individuals using the GE LR Parser.

NOTE that the names of individual files contained in a specified target population folder in the seeds directory do not matter. These files can be named however so desired.

NOTE that as with the GE LR Parser described above, a compatible grammar file needs to be specified along with the target seeds folder. If the target string cannot be parsed using the specified grammar, an error will occur. If the target genotype results in a different phenotype to that specified, an error will occur.

NOTE that at present, phenotypes spanning multiple lines can only be parsed correctly using file format 4 above, i.e. the phenotype string constitutes the sole information in the file. If a genotype exists for such phenotypes, best practice is to use the genotype to seed the solution using file format 3 above, i.e. discard the phenotype string and allow the genotype to re-produce it.

Initialisation

All initialisation techniques existing in PonyGE2 are compatible for seeding evolutionary runs with target individuals. However, an additional initialisation option is included which may be of some use in the case of Genetic Improvement. An option is available to initialise the entire population with only identical copies of the specified seed individual (or individuals). If only one target seed is specified, the initial population will consist of POPULATION_SIZE copies of that individual. If multiple target seeds are specified, the initial population will consist of equal amounts of copies of each specified seed. This option can be specified with the argument:

--initialisation seed_individuals

or by setting the parameter INITIALISATION to seed_individuals in either a parameters file or in the params dictionary.

Re-creating PonyGE2 runs

It is possible to set the random seeds for the various random number generators (RNGs) used by PonyGE2 in order to exactly re-create any given evolutionary run. All PonyGE2 runs by default save their random seeds. By simply specifying the argument:

--random_seed [RANDOM_SEED]

or by setting the parameter RANDOM_SEED to [RANDOM_SEED] in either a parameters file or in the params dictionary, where [RANDOM_SEED] is an integer which specifies the desired random seed.

At present, the main branch of PonyGE2 only uses two RNGs:

  1. The core Python random module, and
  2. The numpy np.random module.

Both of these RNGs are set using the same seed. When the RANDOM_SEED parameter is set, and provided the grammar, fitness function, and all parameters remain unchanged, then PonyGE2 will produce identical results to any previous run executed using this random seed.

Examples

An example parameters file for seeding runs with a number of individuals has been included in the parameters folder under seed_run_target.txt. An example folder named example_pop with a range of compatible formatting types for seeding target solutions is included in the seeds directory.

References


Michael O'Neill and Conor Ryan, "Grammatical Evolution: Evolutionary Automatic Programming in an Arbitrary Language", Kluwer Academic Publishers, 2003.

Michael O'Neill, Erik Hemberg, Conor Gilligan, Elliott Bartley, and James McDermott, "GEVA: Grammatical Evolution in Java", ACM SIGEVOlution, 2008. http://portal.acm.org/citation.cfm?id=1527066. Get GEVA: http://ncra.ucd.ie/Site/GEVA.html

O'Neill, M., Ryan, C., Keijzer, M. and Cattolico, M., 2003. "Crossover in grammatical evolution." Genetic programming and evolvable machines, 4(1), pp.67-93. DOI: 10.1023/A:1021877127167

Ryan, C. and Azad, R.M.A., 2003. "Sensible initialisation in grammatical evolution." In Proceedings of the Bird of a Feather Workshops, Genetic and Evolutionary Computation Conference (GECCO 2003), pp. 142-145.

Fagan, D., Fenton, M. and O'Neill, M., 2016. "Exploring Position Independent Initialisation in Grammatical Evolution." IEEE Congress on Evolutionary Computation, Vancouver, Canada. IEEE Press.

Nicolau, M. and Fenton, M., 2016. "Managing Repetition in Grammar-based Genetic Programming." ACM GECCO 2016 Proceedings of the Genetic and Evolutionary Computation Conference, Denver, Colorado, USA.

Byrne, J., O'Neill, M. and Brabazon, A., 2009, "Structural and nodal mutation in grammatical evolution." In Proceedings of the 11th Annual conference on Genetic and evolutionary computation (pp. 1881-1882). ACM.

Koza, J.R., 1992. "Genetic programming: on the programming of computers by means of natural selection (Vol. 1)". MIT press.

Lewin B., 2000. "Genes VII". Oxford University Press.

About

Genetic Improvement for Regex Performance (wall-clock time)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.8%
  • Shell 0.2%