I recall a (loud, yelling over crowd noise) conversation at Arbor Brewing Company with @lspector and @thelmuth (and maybe some other folks) after a GPTP session, in which we chatted about genomes. I think this was back in the first few days you were using Plush genomes, because I recall we had a long rambling thread about handling parentheses.
As I move from the new Push interpreter (which is pretty much done, by the way) to the flexible representation-agnostic search infrastructure that motivated my extraction of the interpreter in the first place, I think it would be good to revisit all the many ways we've come up with for representing Push programs as genomes (and why). First, because I'd like to be able to support all or any of them in the search system I'm developing, but also because there seems to be a lot of interest in one or another being "better" for certain tasks, and there's interesting research to be done on that.
So please help me remember what I might have missed, as I start a list of Push (and Nudge, which may as well be like Push) genomes we've actually implemented, with some motivations:
just trees dammit
Almost certainly everybody's at least experimented with Push "genomes" that are simply Push programs stored and manipulated as Lisp-like nested trees. This enables simple comparisons with Koza-style expressions, but has some intellectual and algorithmic baggage when it comes to implementing search operators.
It also suffers somewhat from the biases in "uniform random trees" Jason Daida pointed out in his GPTP work.
As I understand it, this representation is essentially a sequence of tuples, in which branching structure is associated with program tokens (instructions and literals), and in which the Push program is expressed by maintaining a rolling "nesting level" counter. Is that about right?
This enables treatment of the genome as a linear sequence by search operators, while managing the syntactic consistency of parentheses, and also introduces important Push-specific biases into "random code" which other schemes seem to miss.
I don't know how the structural biases unfold in "uniform random code", but would be interested in hearing. But regardless of structural biases, it does seem this one permits construction of random trees of exactly the specified number of program points.
Nudge or "footnote-based" genomes
These tackle the "tree structure" problem using an algorithm inspired by the Messy GA literature, and consist of an abstract (linear) "type framework" and a separate list of arbitrary literals, including "
:code literals". In the process of translating from genome to program, the framework is "filled in" by consuming matching items from the literals, without replacement, in a depth-first order, until no unassigned values remain or the literals are exhausted.
The motivations for this one were twofold: First, a cleaner separation between algorithmic structure and constants when applying search operators; one can more easily "change all the integers" if they're in a big pile in the "footnotes section". But the presence of an arbitrary number of genes for literals permitted untranslated material to be carried along with partially-working programs.
The tricky part here is that one never knows ahead of time how large a Push program will be, given its genome, without actually doing the translation. But linear search operators can be used, though in the form I describe here they need to be applied to the "body" and "footnotes" sections separately.
This is what I've been using recently in my work with Push. The genome is a sequence of tuples, each consisting of a simple Push form and a positioning instruction. As each gene is translated, a notional "write head" in a Clojure
zipper structure is moved, and the indicated Push form is added after the new position. The positioning instructions are a bit more expressive than those built into
clojure.zip off the shelf, and include "wrapping" when the bounds of a tree or subtree are encountered, and also absolute indexing in the Push style—that is, an arbitrary integer is converted into a valid index automatically.
The "simple Push forms" are what you might expect: instructions, inputs, literals, and empty lists. I think I may add a
nil one, as well, just for completeness; that would move the "write head" without adding a new token.
This approach definitely supports linear search operators, and the development of semantic blocks. It also permits the construction of code of an exact size (specified in program points).
(I've seen this in old papers, and can't recall the proper name of the scheme)
This approach uses a linear genome of simple Push forms, plus arbitrary
) tokens. If the genome happens to form a valid tree, with balanced parentheses, the obvious tree results. If not, then deterministic rules are applied to balance the parentheses and form a valid tree. Typically these are something like: Alternately add a
( at the head and
) at the tail until the resulting string is a valid tree.
As you can probably already see, the resulting trees (for almost any deterministic scheme for adding new parentheses) is kind of weird, with more root branches and more nesting on the final branch than you would expect from a "balanced" approach. Alternative schemes with different biases might include deleting unmatched parentheses, or shifting the parentheses already present left and right until they work out. Needless to say, most of these feel like kludges.
Years back I spent some time poking at simple rewriting grammars for Push programs, but made no major headway. The idea is sound, and certainly the Push grammar is simple enough that this would be a reasonably simple approach to explore.
I'm sure I've missed some.
I've intentionally avoided talking about tree-generating algorithms, though I think many of the motivations for the genomes we've explored arise from concerns about that process.