Putting Exodus into Words: The sed Bible Translation Project

So, a while ago, Ben Zvanwas talking about doing something with the Bible, which would involve processing the text through some filters and recompiling it. This sort of thing has always interested me: Not recompiling the bible, but rather, textual analysis in general using the basic material stripped of intended meaning by classifying and ordering arbitrarily. What, for example, is the vocabulary of the Rosetta stone, or the Kensington Rune Stone (a probable fake Viking misssive on display in west-central Minnesota). Does the rune stone sample the lexicon of a particular time period or another, or one group of vikings or another? (I hasten to add, that study has been done, but was inconclusive).

i-99d00ebe1939d212a6236ba27a95ff15-tardis-thumb-220x288-61948.jpgA timely repost

The bible is an excellent source of text for this sort of thing in part because it is free and in part because it is wordy. There are two books of the Old Testament that to me are so similar that I'm pretty sure one is a draft of the other. I'd love to run those two through a rudimentary analysis and see if they are more or less similar at the word-list level, compared to the discoursive level.

Anyway, I did some of my own fooling around using a script that ran something like this:


#!/bin/bash

cat $1 |
sed "s/.......... //"|
sed "s/ //"|
sed "s/[.,:;()?]//g"|
sed "s/ / /g"|
sed "s/ / /g"|
sed "s/ /\n/g" |
sed "s/ //g"|
tr "[:upper:]" "[:lower:]" |
sort |
uniq -c|
sort -n

Which, when a translation of Exodus was fed to it, yielded this file.

Here are my musings (and the key word here is "muse") on this data set:

There are 2,055 distinct "words" used in the King James translation of the book of Exodus. Some of these different words are inflected forms, so 2,000 is an outside estimate of the total number. I have arranged the list of the words in inverse order of their frequency, alphabetized within rank. So, 729 words occur once and appear at the top of the list, and as I look down that list I'm baffled by the meaning of a word right away: abiasaph. It's a personal name, it turns out, but with a meaning (the gatherer). A quick check shows that most of the words I don't know are, supposedly, personal names often of "minor characters" (like Amminadab, for instance). The bible does not seem to have a lot of words that I don't know, which is interesting since it was written thousands of years ago. Well, this is a translation after all.

It is interesting that the term "bedchamber" is part of this lexicon. I was under the impression that "bedchambers" were invented in the 18th century, and was unaware that there were dungeons in the Bronze Age as well. Well, this is a translation after all.

As I read down the list, again, ordered by frequency then, within each frequency rank, alphabetically, I see sentences. Since this is from a holy script, I can guess therefore that they have meanings of great import. But I can't fathom them. Here are a few:

"Beware binding birth blast," and "Cinnamon circumcision" are interesting and possibly related. A word about the environment: "Circumspect cities cleanse clear clearness." And this: "Condemn confection congealed!" ... clearly, an invective against flan.

And a comment on pedagogy: "Consecrated consecrations consider content continual." And I wonder if people in those days really "Digged diligently diminished dip," and if that is a reference to low fat food. Is it true that they found "foreigner foreskin forgiving?" and that they "girded glad goats god-ward?"

I'll reserve comment on "witch woman's womb wont worm ..." and I'm not sure what I would do if I was invited to a "Baalzephon backside bake." Is that like a clam bake?

The word "Canaanitish" amuses me. The sentence "Canaanitish candlesticks" makes me LOL.

And, if I have a look at the most used several dozen words, keeping them in order, and deleting and punctuating selectively (using divine intervention, of course) I get this paragraph:

Serve throughout, called hundred place purple! Scarlet shittim. Another blood down. Five servants speak. Thus surely Ephod hast brass pillars! These four neither spake against cubits. Eat, mount, pass. Even sockets brought us are Egyptians! By my tabernacle: you, me, pharaoh on two, one, go!

Who says the bible can't be fun!

More like this

So, a while ago, Ben Zvanwas talking about doing something with the Bible, which would involve processing the text through some filters and recompiling it. This sort of thing has always interested me: Not recompiling the bible, but rather, textual analysis in general using the basic material…
I thought I was done with the command line for the week, but then I did something cool that I thought I'd share with you. Linux users only ... others will think this is silly ... join me below the fold. OK, are we alone? Good. It's nice to be away from all those Windows Symps for a while. Oh…
This is really off-topic for GM/BM, but I just can't resist mocking the astonishing stupidity of the Conservapedia folks. I'm sure you've heard by now that Andy Schafly and his pals are working on a "new translation" of the bible. They say that they need to do this in order to remove liberal bias…
I am looking at the question: How many words are there in a language? I'd like to know for languages in general, comparatively, and for pedagogical reasons, in some well known western language which may as well be English. What I found quite incidentally is a hornets nest of curmudgeonistic…

I love your analysis! I remember, donkey years ago, writing a program (in COBOL) to analyse writing style. It worked quite well, producing a histogram which could be used to compare different writers. The only snag was that the data had to be entered using punched cards (What's a "punched card", granddad?") which meant re-typing the script under consideration.