## Regular Encoding of Binary Trees

A couple of months ago, a question on cstheory.SE asked whether there is a regular encoding of binary trees. My immediate gut feeling was: no way! Binary trees are equivalent to Dyck words and the language of those is not regular1.

I was able to back this feeling with an approach using generating functions. I showed that the number of binary trees with $$n$$ nodes does not have a rational generating function; as all regular languages have rational structure functions, I claimed that this shows that binary trees can not be encoded with a regular language. This argument turned out to be somewhat fishy or incomplete since an encoding does not have to map all trees of size $$n$$ to encodings of size $$f(n)$$; at this point, the counting argument breaks down.

In fact, bijections from binary trees to any infinite regular language exist as both countable. You can even argue that there have to be computable ones as both sets are recursively enumerable. Just enumerate both sets simultaneously one at a time; once you see the input in one set, the current element from the other set is your output. This approach does not guarantee a “nice” encoding, though.

Aryeh came forward and proposed a way to associate any string over some alphabet with a binary tree. The idea is to list the nodes in breadth-first traversal order and mark missing children by $$0$$. This establishes a well-defined2 regular encoding with “nice” properties, i.e. it is concise and efficiently computable in both directions. It is not bijective, though, as every tree has infinitely many representations3.

So, what is going on here? Does the computational complexity of encodings depend more on the properties you want them to have than on the structure of the encoded construct? Is bijectivity a necessary requirement for an encoding? It seems that dropping bijectivity enables you to find much easier representations for certain sets; is this a general phenomenon? Can you characterise the impact of readability on encoding complexity?

1. Easily shown with Pumping Lemma for regular languages.
2. If we map each tree to its shortest encoding.
3. After all branches are closed off with $$0$$, you can add as many additional symbols to the encoding as you wish without changing the encoded tree.