Built on Facts

I’m a bibliophile. I read books at an inordinate rate and have a tendency to buy them at an even faster rate. Here at Texas A&M I’m fortunate to have access to a library of more than four million volumes, a fantastic interlibrary loan service, and a breathtaking special collections library that among other things houses one of the largest and most comprehensive science fiction collections in the nation.

I also very much love the aesthetics of the physical books themselves, and if/when electronic books finally displace the old paper copies it will be a sad day. But it could also be the dawn of the era that’s been predicted almost since computers have been invented – the complete contents of the Library of Congress at your instant personal disposal. Two things stand in the way: copyright law and the limitations of technology.

For the moment let’s ignore copyright law. That’s a discussion for another day. Suffice it to say that copyright is fine and good conceptually, but the life of the author plus 70 years is simply an abomination. As a thought experiment we’ll assume that either the law has been reformed or the Library of Congress has been exempted at least for archival purposes if not public access purposes. The question becomes whether it’s possible to put the entire library in digital format.

The LoC’s collections include a bit over thirty million books. I don’t know how many pages there are per book, but why not say 300 as a rough average guess. In total and rounding up we can say there’s about ten billion pages to deal with. It’ll require a lot of storage space. We have two options: we can store as text, which is “cheap” in terms of the amount of space required. A page of text might be a few kilobytes, whereas a photo of that page could conceivably be very much larger depending on the resolution. For most books storing the text would be fine, but there are a few problems. Plain text fails to store any illustrations, equations, photos, and many kinds of special formatting. But plain text has some enormous advantages over photos; in particular it’s possible to search text while there’s no way to easily search for a particular phrase in a picture. The ideal is both – take the picture, and include the text in the image itself via pdf or similar format.

This isn’t actually so hard. The process of turning the image of text into machine-readable text is at this point largely a solved problem. Adobe Reader and other programs can turn almost all normal document images into text with excellent precision. Scan the image of the page and the text comes out almost instantly. And if we can do this, we’re left with two major problems to solve: storing the resulting images/texts and actually getting the images in the first place.

While text generally doesn’t need to be scanned at a high resolution, let’s take something of a worst-case scenario and assume we need a megabyte per page. We could probably get away with a tenth of that on average, but we ought to calculate for the worst just in case. Ten billion pages is thus ten billion megabytes, or ten thousand terabytes. These days a terabyte of storage space can be had on the home user market for about $100. That’s a million bucks for our complete LoC images. We should assume the actual cost will be more for redundancy and the associated hardware and personnel costs. Maybe five million total for the storage, for now not counting any kind of robust server system for disseminating the information to the general public. This sounds doable. The LoC budget is about $600 million yearly, so this capital investment is only a percent of the yearly budget under my very rough estimate.

Now of course you actually have to get the books scanned. Don’t tell anyone, but I recently scanned an obscure book of fewer than 100 pages while doing some personal genealogy research. It was a royal pain, and took a while. I’d estimate 4 pages a minute at the outside. At this rate it would take close to five thousand years to scan the entire library. Even with a small army of scanners the effort would take years. Now if someone could invent an automated book scanner…

Ah. Well that’s taken care of. 1200+ pages an hour brings the entire job down to about 950 machine-years. There’s no real rush, so with 100 machines or so we’re talking about a decade worth of work. The machines themselves run about $100,000 each, though with bulk pricing the figure would probably be lower. So $10m worth of machines, and another couple million per year to pay the library techs who’re feeding the machines should do the trick. Prioritize based on the book popularity and you’re well on pace to digitizing most frequently-used books in a year or two, with the rest to follow over the next decade for a total initial investment maybe the equivalent of 10% of the yearly budget. New materials won’t be a problem either. The library adds some 10,000 items per day. Only some of them are books. Those that are don’t even necessarily have to be scanned. Publishers could submit standardized pre-scanned copies or just the raw files that were used at the printer, saving our scanners a lot of work.

We’ve got more than seven hundred billion dollars worth of stimulus floating around. I don’t know how stimulating a library project would be, but considering the comparatively minimal cost to digitize a vast portion of the accumulated written knowledge of mankind, why not? Now if we can just get the copyright law bumped down to something reasonable I think we’ll be in good shape.

Comments

  1. #1 Uncle Al
    July 31, 2009

    After the whole of documented human thought is digitized and the redundant hardcopy is recycled into feminine hygiene products, don’t lose the electronic index. Oh, wait… Google. Never mind.

    That is why we have Bing. Add a Blue Screen of Death to mentation.

  2. #2 plam
    July 31, 2009

    I guess that Google did these calculations first and that’s why it’s proceeding with book search. It is, however, more likely to go poof! than the Library of Congress.

    Scanning makes a lot of information available to users, but it seems that there is a lot of additional, non-textual data that people extract from physical documents that’s not always available in scans. For instance, one might want to do chemical analyses on the pages or on what pollens are stuck to the book. In the same vein, libraries used to cut out ads from magazines that they archive. Oops! Turns out that ads are a valuable source of information about society back then.

  3. #3 Gray Gaffer
    July 31, 2009

    “Dark Ages” does not mean a time if ignorance and poverty and deprivation. It means a time from which we have little in the way of documentation or records. It was, IIRC, one of the Long Now founders (http://www.longnow.org/) who presented the idea that we are currently entering what is likely to be a “Dark Ages” for our g^5 grandchildren, back in 1996 or so.

    The premise is:

    The technology of storage and the economics of obsolescence inexorably degrade to the point of non-recoverability our digital archives.

    How many here can still read a 5 1/4″ floppy? 8″? Radio Shack data cassette? I have at least three distinct sets of tape archival media at home which are essentially Write Only Media. All my college work is one such set. The archives of a device I produced in 1991 is on another such set. I think. I cannot read the tapes to find out. And I want to – I still use the prototypes and I want to make some mods.

    Basically, it never occurred to us at the time that our backups were worthless, since at the time we could still recover stuff from them. Around 1995 I decided that I needed to archive the IDE drive, since that seemed the most stable media and held more than a CD. By 2000 I was archiving the entire machine. It still is not enough. Entropy rules, machine parts die, and the replacement parts vanish into history. So I already suffer from my own personal Dark Ages.

    The only solution to the recovery problem is to transfer all archives from their current media to the next Big Thing. But this introduces its own problem: each step takes us further and further away from being able to retrieve the data in a low tech environment.

    Libraries, with hard copy stacks, are our only protection. And even they are not really long term, nowhere near as long as what little we have found from the last Dark Ages, because paper acidity standards have relapsed and the books decompose so much earlier now.

    So, as much as I applaud the efforts to digitize as much as we can today, I hope the long term preservation and low tech recovery of the originals is also being addressed. I am not as sanguine as I used to be that we will enjoy our technological advantages for the indefinite future. Indeed, 100 years looks optimistic to me now. I’ll be OK – I’m getting on a bit – but the rest of you, well, watch out!

  4. #4 mtc
    August 1, 2009

    Not really buying it Gray Gaffer. Now you just copy everything over when you upgrade the server farm. And even so, bad as the newspaper industry is, there are still plenty of copies floating around, at least a few of which from every day will make it to posterity. Also books.

    And speaking of books, seeing how you are a bit of a sci-fi guy Matt, digitizing books is one of the central plot points in Vernor Vinge’s recent novel “Rainbow’s End”. Might wanna check it out.

  5. #5 Brian Vargas
    August 1, 2009

    And it’s easy to build a fusion reactor, too! All you need to do is restrict the plasma flow, and you’ve got it!

  6. #6 Paul Murray
    August 2, 2009

    Well, I’m transcribing all my college work into cuneiform on clay tablets. That seems to be the most durable form of card copy.

    The images and diagrams, of course, shall be painted on cave walls.

  7. #7 Eric Lund
    August 3, 2009

    @mtc: Gray Gaffer is correct that this is a problem. A large fraction of NASA satellite data from the 1960s is in danger of being lost, or already has been lost, because it’s on fragile tapes which, if they can be read at all, can only be read on one or two machines still existing in the world. Yes, we can plan ahead for future migration of our archives. We can’t do that for anything archived on media that are already obsolete–at least, not without huge expenditures of time and manpower to preserve the handful of remaining machines that can read the media. For stuff written in obsolete software (think WordPerfect or ancient versions of Word, Excel, etc.), the problem is compounded because somebody has to maintain the software, too. At least my LaTeX files can be read with a text editor, and I can examine the file and figure out the equations and font changes.

    One potential problem with OCR is what you do about material written in foreign languages. Much of the stuff we want to archive is in languages other than English and uses characters that are not part of the 7-bit ASCII standard; some of that is in non-Latin alphabets (Cyrillic, Greek, Chinese, Japanese, Hebrew, etc.) Even for books written in English, typefaces have evolved over time; e.g., in the 17th and 18th centuries interior S’s looked like F’s without the crossbar. That’s not to mention cases where authors varied font/typeface in order to emphasize something, or for some other reason. I think you have to assume that every page would have to be stored as an image.

  8. #8 alecwh
    August 3, 2009

    You’re going with the worst scenario of 1mb per page? That’s pretty ridiculous. Even taking a high-quality, compressed PHOTOGRAPH of the page would probably be a little bit under that. Storing text, along with styling information (margins, text-height, font-size, etc), plus any diagrams (as compressed images) would be a fraction of a megabyte.

    Not to mention, if we were actually going to store the entire Library of Congress, I’m pretty certain that they would hire some intelligent people to cut down on space usage as much as possible. That amount of books would necessitate some serious efficiency experts.

    @Eric: You could at least use OCR on all recent (since… the 19th century?) books with common languages, and then move onto storing images for those old works, with obscure font conventions, etc.

  9. #9 Gray Gaffer
    August 3, 2009

    mtc: perhaps you did not read my post closely enough. I am a typical example, not an outlier, most high tech folks are probably in the same pickle. We are already in a Dark Age. Have been for 30 or 40 years.

    The oldest technology for which I can still build software that I have dates back to only 1995, and that only because I can still run XP, which still supports Win 3.11 programs. But not for much longer. Soon MS will stop responding to XP validations, and I am bound to hit that when the day comes I have to change out my hard drive. All my older stuff – and my personal list goes back to 1965 or so – is completely unrecoverable. And as for hardware, the designs were pencil on transparency, replicated by Diazotype, all faded beyond recognition if not actually junked as their corporations folded. Or in proprietary file formats for software which turns itself off after 1 year if the license is not renewed. Which can NOT be renewed now, their manufacturers have folded.

    You seem to think just replicating the server farms is all it will take. The bits may survive, yes, but do you really think the software that makes them useful will too? Or that the actual owners of the bits will choose to, or even be able to choose to, upgrade the software and re-import and export those bits? And to date not a high percentage of the bits that might have historical value are on those servers. They are on floppies in somebody’s basement, on QIC40 tapes, on Data Cassettes, on SyQuest platters, on 9 track from 1965, on cheap CDs that are delaminating quietly to themselves, on archived hard drives for which no IO cards still exist or motherboards with compatible busses (MFM + ISA anyone? I have some…), or whatever. My old NuBus machines no longer boot up. The list just goes on and on. There are even still court cases that rely on obsolete equipment and formats.

    You also seem to assume the electrical power and organizational structures that maintain those server farms will exist for the indefinite future – by which I mean thousands of years, not mere decades. Trust me, they won’t. Today Google and S3 are probably the most stable, but would you really trust them to hold all your financial and legal and medical documents with no hard copy backup in your safe deposit box? Really? Good luck to you then. And that just covers the next thirty years or so. If we are lucky.

    And as for planning ahead for future migration: the storage is growing faster than the people or processing power needed to do that replication, and we still do not have a common forever-more data format for its representation, so the issue of the lifetime of compatible software is also pertinent. I can no longer render my Corel Draw 1.0 files from 1989. I have the files, but no reader or converter. Or format description were I to be able to find the time to write my own. Or SCSI system to read the associated stuff on the disks I archived. In Mac OS7 disk format. Or computer that can run the software. Now yes, those might be a little closer than gone, but not without some significant time and effort and searching for working compatible hardware. By 2020? No way. Like I said, my NuBus machines are already scrap. My MS college papers? Atari 800 floppies. And some of that work has become relevant recently to my job.

    Replicating server disks may be OK for that data only needed for Statute of Limitations needs – 7 years or so – until the last farms die, anyway. But no way are they suitable for historical archiving.

    I think I’ve made as much of my point as I can without too much redundancy.

  10. #10 Timothy Underwood
    August 5, 2009

    Gray, you are assuming that historians of the future will be stupid. They won’t be. There are two cases, the death of technology— in which situation I honestly don’t care very much, if humankind survives they’ll still know more than we know about Rome— or a situation where a relatively high standard of living survives, and it is an issue of fairly wealth and well trained historians trying to find information. If the bits still exist, they will figure a way out to reconstruct the software to regain access to any information that they are interested in. They probably will have some sort of extremely automated powerful software which can hack anything from this time period easily, and hence will be able to analyze any data whose bits they are able to get onto their own storage devices.

    The real problem for future historians will be cases where the bits simply no longer exist, ie cases where the physical storage mediums reach a point where the data is inaccessible. Although it may take longer to reach that point than you think it would if there are historians who really care about it.

    Regarding data that can only be read by a few machines in the world. If the storage media still exists in 2100, and somebody is willing to pay the equivalent of a couple of million dollars to gain access to the data, he probably will have no problem.

  11. #11 Sharron Bortz
    August 6, 2009

    The problem is that computers update and material recorded on previous computers do not play. Some will respond that it can be upgraded, but sometimes the change is so slight that doesn’t seem necessary and then another change creates the problem.

    As a librarian, I worked for a time at Duke University Library. They have papyrus that is still accessible. They also had storerooms of floppy disks and other materials that they no longer have the computers to access. Within an archival collection, there was a 45 LP record that had once been distributed to gas station customers. There was no player anywhere on campus to play the thing.

    Personally, I have all the beautiful poetry and stories my daughters wrote in school (and out) recorded on floppy disk and can no longer access them. I created a searchable bibliography of almost 900 children’s literature books on a Mac Classic – what good is all that work now?

    What happens if the Library of Congress does digitize everything and then a saboteur destroys the computer or nuclear war sets humanity back. Would all the literature of American by lost forever? With books, a few might survive or intentionally be hidden to preserve them. As a reader of science fiction, I’m sure you have read such scenario.

    An archivist at University of North Carolina told me that because people email instead of writing letters, we have already created a terrible gap for future researchers. Letters are a major primary source for research, and who saves their emails?

  12. #12 Dave Burton
    August 6, 2009

    Re: reading old data

    I can’t read punched cards, but I can still read 1″ 8-channel paper tape. I can also still read 5.25″ diskettes.

    My hard-sectored 8″ diskette drive has failed, but I still have a working diskette drive that can read single-sided, soft-sectored 8″ diskettes.

    Unfortunately, I’ve found that most of my 8″ diskettes have degraded to the point of unreadability. Specifically, the adhesive which sticks the oxide to the substrate has failed on the Wabash and BASF 8″ diskettes. The oxide flakes off when I try to read these old diskettes on my old 8″ diskette drive.

    The 3M 8″ diskettes don’t have that problem, but I only have a few 3M 8″ diskettes, and the data which I most badly want to recover is on BASF diskettes.

    That problem has rendered my antique home-built 2MHz Z80 UCSD P-System with Friden Flexowriter I/O unusable.

    If anyone knows how to stabilize the oxide layer on an old diskette, or if anyone has a specially-gentle modified 8″ diskette drive, which could read the data from my old BASF and Wabash 8″ diskettes, I’d be most grateful to hear from you! http://www.burtonsys.com/email/

    Re: cost of storage for the LOC’s book collection

    Three years ago I digitized my grandfather’s PhD dissertation, “A study of S Sagittae.” (He wrote it 1923, but I digitized an updated version that was published in 1932.) Here it is:
    http://www.burtonsys.com/johnaldrich/

    At 150 dpi (which seems to be plenty), the jpeg scans of each page averaged only about 240 KB, not the 1 MB which Mr. Springer used in his calculations. If 240 KB is typical, then that brings down the raw storage cost on consumer-grade hard disk drives of the LOC’s entire book collection to a mere $240,000.

    Adding the cost of redundancy, server farms, etc., still probably adds up to a total expense only a couple of million dollars.

    Dave Burton
    Cary, NC

  13. #13 Dave Burton
    August 6, 2009

    I did notice one small error in this article. It says, “Adobe Reader and other programs can turn almost all normal document images into text with excellent precision.” Adobe Reader doesn’t do that. It just displays whatever is in the .pdf files.

    The .pdf files can contain text and/or images, and if Adobe Reader is showing selectable/copyable text it is because another tool (called OCR software) has already deciphered the printing in the image and converted it to text.

    Dave Burton
    Cary, NC