So we were chatting, over coffee, as one does,
and we were talking about the nuisance it is when you have Monte Carlo simulations with qualitatively different output branches, and how you can trace outcomes and divergent branches leading to different, and same, outcomes.
The specific examples were binary population synthesis, where you can get a lot of different qualitative intermediate steps depending on contact and time scales. But this should work for anything where there are qualitative state changes or branching physical processes.
Most simulations handle such by setting flags and building strings, either letter strings or integer strings, depending on what happens. A priori, it is hard to know which outcomes are triggered, especially as later stage channels are contingent on earlier stages.
It is important to know when things happen and in which order, and different intermediate branches can lead to degenerate outcomes.
So, the problem comes when you want to quantify the branches and tabulate intermediate observable states. For any set where there are more than a few intermediate states, the strings become hard to parse and it is hard to get an overview - building tables of outcomes and observable intermediates is a pain.
So... has anyone used Gödel coding to do this?
Tabulate the possible branches, and assign a unique, small, prime number, to each state, transition or intermediate stage. Then simply build the product integer by multiplying the branch prime number labels.
Most simulations could be tabulated with 3 digit primes, there are lots of primes smaller than 1,000, and even with tens of intermediate stage branches per simulated event, the resulting integer is trivially small to carry as an arbitrary precision integer flag.
Then, to parse the result, you just factor the integer flag for each outcome.
Computationally trivial, since the possible prime factors are known, small, and few, and very easy to parse.
Simple example, for stellar population synthesis: each star starts with "1"; if it is in a binary, multiply by "2" - so integer flags that are even numbers are the binaries; then multiply primes for Roche lobe overflow, stellar evolution stages, etc.
Has anyone used this, in actual simulation sets?
Seems almost too easy. In fact I wish I had thought of this for my thesis - I used integer flags for outcomes, but I had to try to figure out all outcomes in advance and match sequential integers to diagnostic subroutine output, and I always missed nondeterminate outcomes since the asymptotic final state was not guaranteed to exist.
Gödel coding would have been much easier to build in and to parse.
- Log in to post comments
Since an arbitrary-precision integer is a variable-length string of 8- 16- or 32-bit digits, why not use a variable-length list of characters, or integers, or whatever? Why mix the bits of the elements through multiplication when you just want to extract them at the end?
I suppose this is an indication that you're using a language that doesn't provide convenient data structures (but if so, where are you getting your arbitrary-precision integers?), but surely you have text strings? (If you have more than 256 distinct basic states, there's always unicode.) Using strings would even give you very efficient substring and regular expression searching.
If the problem is that you don't know which "letters" will be needed, just use the code that would have allocated primes on the fly to allocate "letters" on the fly. You will surely not run out of unicode characters in any reasonable program...
That is exactly what people, including me, do, right now.
Heterogenous character strings or other byte assignments carried as a "flag" array
In practise, most people I know find these to be a pain, though they are grep'able and writing tools to pull the info is straightforward they are messy and unreadable and obscure the information.
This is language independent, there is no serious computing language incapable of handling string reg exps.
My argument is simply that the Godel coding is more compact, guaranteed to be unique and efficiently decomposable, and just more elegant.
But I can't think of a case where it is used.
Totally trivial:
(waiver: I'm a total layperson, having completed "High School" [actuallly UK Grammar school] in 1962)
Some fifteen years ago I was running a weakly[sic] quiz for my local pub. To keep an evenly balanced question set I created a binary tree of facts/questions on my computer. Pretty soon I realised that I couldn't really select for a particular type of question so changed to a prime number indexing system.
By arbitrarily allocating primes to various criteria ad multiplying them together I was ably to run a fairly balanced quiz.
eg: "Which UK Prime minister resigned after the Suez Crisis"
Difficulty: level 2 :[5]
Politics: [19]
UK: [47]
Middle East: [941]
1950s: [2269]
Question index: 2269*941*47*19*5 =9533350985
(the [2] flag was for past questions)
If I hadn't wanted to practice my "C" programming skills, I'd probably have used a database with sundry flags.
Dunno if that's any comparison to what you mean?
Anne is right. Your scheme is elegant in a clever mathy sort of way, has a cool name. But that is basically all it has going for it, except maybe that it is the only way to support what you are trying to do in a very feature-poor language.
To summarize: you have a variable size datastructure (your arbitrary precision integer); you support an "insert" operation which takes an n-bit number as input and inserts it into the datastructure (by multiplying), and a "list" operation which gives you back all the inserted numbers (by factoring).
Recall that multiplying n-bit by m-bit numbers gives you roughly n*m bits. Ergo, what you have is nothing more than a list of numbers, just dressed up with a fancy name and the extra expense of doing lots of arbitrary-precision integer aritmetic and factoring. If you really don't have a list datastructure but do have large integers, just do shift-by-n then add, instead of multiply. Voila, a list.
Let's also recall that characters are just integers too, as Anne pointed out.
More to the point, I guess I just don't get it. You are trying to say that 9533350985 is a more user-friendly representation than, say "Level2,Politics,UK,MiddleEast,1950s"? Or if you need compactness, why not "L2;Po;UK;ME;50;" or even just [2269,941,47,19,5] if you are in to integers (note that this is *less* compact than the compact string for many languages, when integers are 4 bytes each and characters are 1 byte each). And finally, if you really like the idea of a single integer, just write it this way: 22690941004700190005.
Hee!, I never said it was right or particularly clever. It was just what I did (more as a way of practicing C than anything else) but I think it's analogous to what what Steinn was proposing. Now I'd definitely use a database, but that was then.
You really don't want to do this, if only for storage size: if you ever have a case where all conditions are set to "true" then you would have to multiply all primes < N, and the result would be more or less exp(N); having exponential storage size is a really bad idea.
The good way to do this is instead to use bit strings and bitwise operators « and » and « or » to set, unset and check individual bits. This has a constant storage size and those operations take essentially zero CPU time when compared to integer multiplication and even more when compared to factorization (even with known primes, you have a modular reduction to do, which is very expensive!)
(Actually, this weird idea of using primes is one of the plot threads of the doorstopper _the Baroque Cycle_ by Neal Stephenson, who attributes the idea to none less than Leibniz...).
But you should keep your idea... for april fools' day next week !
Really, the reason Gödel coding exists is because people were essentially trying to do programming in an extremely feature-poor language, namely mathematics (of the time). Modern mathematicians recognize tuples, graphs, strings and weirder objects (including, in some sense, the entire ML language) as perfectly sound mathematical objects, so there's not really any need to code everything into integers. Only if you want to show properties of an absolutely minimal mathematics, and you want that mathematics to contain the integers, do you need Gödel coding.
There is a totally irrelevant but interesting question here, though: how should you store this sort of data so that it uses the minimal number of bits? (Never mind using a string and bzip2ing it, we're looking for a mathy solution.) Multiplying primes sort of doesn't waste any bits. I say sort of because if you only had true/false for each flag, you'd just use binary bits, so presumably each prime can occur multiple times. But the precise distribution of the elements is important: obviously multiplying by 17 costs more bits than multiplying by 2, so you'd want 2 to code the quantity most likely to occur many times. Really arithmetic coding is the way to go (if the patent has finally expired) and it is delightfully mathy to boot (numbers in a different base for every digit!).
In terms of storage, this really is horribly inefficient. Consider the following: with a 32-bit integer, you could only reliably store numbers with up to 9 prime factors. The first 10 prime numbers multiplied together:
2*3*5*7*11*17*19*23*29 = 6 469 693 230 > 2^32
By contrast, if you simply use each of the 32 available bits in a 32-bit integer as being related to a different property, then you have 32 different flags that can be stored in the same space. Using single-bit flags is also potentially rather nice because encoding and decoding is done through extremely fast binary operations.
Now, the benefit of Godel encoding would be that instead of just a "true or false" flag, you could have multiple applications of the same flag in the same number. So Godel encoding might be useful as a means of storing something for which the various parameters tend to take on a variety of typically small integer values, and the parameters themselves tend to be rather sparse (only a few properties tend to be active at any given time).
However, as a means of merely storing binary yes/no properties, Godel encoding really would be a poor use of storage.
Whoops, managed to forget 13 in that list of the first few primes. However, when I computed the multiple, the 13 was included. So the conclusion is correct.
So long as you have a good front end who cares how you pack the data? (Or are you really planning on grepping to find targets?) I'd have to agree bitwise makes most sense if storage space matters, e.g., b/c of huge numbers of target objects. A simple grep + perl -e inline to grab the phenomena of interest. If it doesn't and you need to future-protect your data format (i.e., in popsyn, new physics) while avoiding need to write/maintain your own data extraction/formatting tools, either (i) use XML or (ii) just dump your db file as sqlite.
I think XOR is the way to go as well. The benefit of the character strings is that the programmer can read them rather easily. When it comes to factoring, I can easily recognize even numbers, and it is only slightly harder to figure out if it is divisible by 3, or 5.
I think the the best of both worlds is to have a code take the bitstrings and it output something more human readable and grepable.
@FuturePostdoc: While this is a purely theoretical exercise, I'd say that the ability to use regular expressions to search for patterns is indeed valuable: for example, "find me everything that starts as an O star, undergoes a common-envelope phase, and ends as a millisecond pulsar" makes a single easy regular expression query. Building a query system yourself, unless you also build routines for manipulating finite automata, is just never going to be as flexible and as efficient as existing regular expression libraries.