I’m a bibliophile. I read books at an inordinate rate and have a tendency to buy them at an even faster rate. Here at Texas A&M I’m fortunate to have access to a library of more than four million volumes, a fantastic interlibrary loan service, and a breathtaking special collections library that among other things houses one of the largest and most comprehensive science fiction collections in the nation.
I also very much love the aesthetics of the physical books themselves, and if/when electronic books finally displace the old paper copies it will be a sad day. But it could also be the dawn of the era that’s been predicted almost since computers have been invented – the complete contents of the Library of Congress at your instant personal disposal. Two things stand in the way: copyright law and the limitations of technology.
For the moment let’s ignore copyright law. That’s a discussion for another day. Suffice it to say that copyright is fine and good conceptually, but the life of the author plus 70 years is simply an abomination. As a thought experiment we’ll assume that either the law has been reformed or the Library of Congress has been exempted at least for archival purposes if not public access purposes. The question becomes whether it’s possible to put the entire library in digital format.
The LoC’s collections include a bit over thirty million books. I don’t know how many pages there are per book, but why not say 300 as a rough average guess. In total and rounding up we can say there’s about ten billion pages to deal with. It’ll require a lot of storage space. We have two options: we can store as text, which is “cheap” in terms of the amount of space required. A page of text might be a few kilobytes, whereas a photo of that page could conceivably be very much larger depending on the resolution. For most books storing the text would be fine, but there are a few problems. Plain text fails to store any illustrations, equations, photos, and many kinds of special formatting. But plain text has some enormous advantages over photos; in particular it’s possible to search text while there’s no way to easily search for a particular phrase in a picture. The ideal is both – take the picture, and include the text in the image itself via pdf or similar format.
This isn’t actually so hard. The process of turning the image of text into machine-readable text is at this point largely a solved problem. Adobe Reader and other programs can turn almost all normal document images into text with excellent precision. Scan the image of the page and the text comes out almost instantly. And if we can do this, we’re left with two major problems to solve: storing the resulting images/texts and actually getting the images in the first place.
While text generally doesn’t need to be scanned at a high resolution, let’s take something of a worst-case scenario and assume we need a megabyte per page. We could probably get away with a tenth of that on average, but we ought to calculate for the worst just in case. Ten billion pages is thus ten billion megabytes, or ten thousand terabytes. These days a terabyte of storage space can be had on the home user market for about $100. That’s a million bucks for our complete LoC images. We should assume the actual cost will be more for redundancy and the associated hardware and personnel costs. Maybe five million total for the storage, for now not counting any kind of robust server system for disseminating the information to the general public. This sounds doable. The LoC budget is about $600 million yearly, so this capital investment is only a percent of the yearly budget under my very rough estimate.
Now of course you actually have to get the books scanned. Don’t tell anyone, but I recently scanned an obscure book of fewer than 100 pages while doing some personal genealogy research. It was a royal pain, and took a while. I’d estimate 4 pages a minute at the outside. At this rate it would take close to five thousand years to scan the entire library. Even with a small army of scanners the effort would take years. Now if someone could invent an automated book scanner…
Ah. Well that’s taken care of. 1200+ pages an hour brings the entire job down to about 950 machine-years. There’s no real rush, so with 100 machines or so we’re talking about a decade worth of work. The machines themselves run about $100,000 each, though with bulk pricing the figure would probably be lower. So $10m worth of machines, and another couple million per year to pay the library techs who’re feeding the machines should do the trick. Prioritize based on the book popularity and you’re well on pace to digitizing most frequently-used books in a year or two, with the rest to follow over the next decade for a total initial investment maybe the equivalent of 10% of the yearly budget. New materials won’t be a problem either. The library adds some 10,000 items per day. Only some of them are books. Those that are don’t even necessarily have to be scanned. Publishers could submit standardized pre-scanned copies or just the raw files that were used at the printer, saving our scanners a lot of work.
We’ve got more than seven hundred billion dollars worth of stimulus floating around. I don’t know how stimulating a library project would be, but considering the comparatively minimal cost to digitize a vast portion of the accumulated written knowledge of mankind, why not? Now if we can just get the copyright law bumped down to something reasonable I think we’ll be in good shape.