03 February 2010

Is Google Good for History? revisited: a case study in Pepys

Doing a million little tasks at once these days -- a kind of academic death by a thousand cuts that we're all familiar with -- one of which is securing image permissions for an essay I co-wrote with Martin Foys on the re-, dis- and unmediations that frame, shape and even softly determine our readings of two literary classics, Beowulf (Martin's portion of the essay) and Samuel Pepys' Diary (of course, mine).

Part of my argument rests on a re-reading of the now legendary story of John Smith, a poor beleaguered student who (the tale goes) spent years decoding the Diary's shorthand, never knowing that the "crack" to the "code" was sitting on the library's shelf all along, "just steps away from where he worked." The scare quotes do intend to scare: as I found in my research, a number of buried references to the Diary crop up throughout the eighteenth century, indicating the work was not entirely unknown until Smith's transcription. In fact, a now all-but-forgotten biography of William Weller Pepys, A Later Pepys by Alice Gaussen (1904), includes facsimile scans of a plan to transcribe and publish the Diary that presumably pre-dates Smith's work, thereby (I argue) exploding the typical origin story. The need for textual provenance is retroactive, applied only after a seventeenth-century manuscript is circumscribed by print scholarship -- "edition-ized" for academic consumption, as it were.

How did I find these traces of an eighteenth-century Pepys which have puzzled scholars for a century -- including the two editors who devoted a large chunk of their lives to studying this text? It has nothing to do with intelligence, and I (sadly) have no tales from the archival crypts. You can chalk it all up to Google Books.

In the introduction to their edition of the Diary (the only unbowdlerized edition ever published, and therefore for the purpose's of contemporary scholarship, the only edition), Latham and Matthews note finding a single reference to Pepys' Diary pre-transcription: a puzzling fact, they say, since (the narrative goes) the Diary mouldered on the stacks for 150 years before finally being "discovered." How could this individual have had 1) access to the Diary to quote it, and 2) knowledge of Pepys' shorthand to read the text?

I googled the quote and unearthed a few more references in early nineteenth-century periodicals, indicating that a particular entry on "tea" was somewhat known among the late-Enlightenment literati. Tracking the beast as far as Google Books would let me, I finally stumbled over the biography mentioned above, I think through the word "transcription," and found the two facsimiles of a pre-Smith plan to transcribe the Diary.*

This research -- an exciting alternative history of a canonical story -- would not have been possible without Google Books or a comparable search engine and database of OCRed texts. So is Google good for history? Uh, hell yeah. That should be a given by now, folks.

Here's where the story get sticky, though. Thinking my work was done, I finished up the essay without ever consulting the physical book (don't judge me, we all do it), even took a screenshot of the facsimiles from the biography, now out of print, and dropped them in as figures for the essay. The time for permissions rolls around, and we realize the scans are too low resolution for publication. So I order the dusty 1904 tome be dragged up from Duke's storage facilities; open it up to scan the figures myself; and find this:

What I thought were scratches from the scanner, or -- honestly, I don't know what I thought they were; my intuitive curiosity as a literary historian and digital humanist failed me -- turned out to be full pages. The dunce that scanned the text for Google Books didn't bother to unfold the paper; and, since Google Books doesn't have any mechanism for indicating moving parts and fold-outs on their flattened scans, whatever was tucked between the folds was lost to the database.

I've talked about interactivity in the digital archive here before; this incident brought the issue home for me. Like all media, tools like Google Books inevitably (re-)frame our research, opening exciting new possibilities; but in doing so, other potentials are foreclosed. Beyond the dampening effect on research into the codex as a form, the digital archive's absences produce an image of "print culture" that slides frustratingly toward the very reductive models that many book historians have challenged in recent years. We need to start thinking seriously about what aspects of the book are elided by the screen; how a text's materiality is mediated by scans; and how the structure of databases disallow us from documenting these bookish anomalies.

Databases are themselves media structures, and the historical artifacts we read in and through them have to take this into account. Perhaps more importantly (and I'm saying this to myself, as much as anyone else), we need to learn to be better skeptics of our own resources, finding new methods for verifying our research when using digital scans. While ultimately this incident didn't put a dent in my argument, it will make me think twice next time a see an odd little scratch on Google Books.

*If you want the whole argument, you're going to have to read the book when it comes out this summer.


Amanda French said...

Great story, Whitney! Thanks for writing it up. I wrote something similar for MLA 2008 (though more on the positive side) recounting what Google Book Search made possible in my own literary-historical research: it's at http://amandafrench.net/presentations/ if you're interested.

Whitney said...

Wonderful, thanks for sharing, Amanda. I'm interested in this kind of documentation, if only as a way of sharing war stories, figuring out the best way to deal with the potentials *and* problems of these new tools..

Anonymous said...

This is a great post; and I look forward to reading your article, too. I have posted a few musings about the drawbacks of the way the screen mediates our reading of sources recently, in this case the impact that the Early English Books Online website has.


Palimpsest said...

Great post! I long for the day when we can actually *handle* documents via some virtual reality gadget, or remote control tactile device.

Anonymous said...

Whitney, this is an excellent post. I have been thinking about many of these issues, and am preparing a grant proposal with a research group to try and deal with some of them. We need more openness about our experiences, especially when things go wrong (even if it doesn't change our outcomes). And re:Palimpsest every virtual reproduction is itself an act of interpretation and translation. Something is always left out. Furthermore, if everyone is using the "same" virtual copy of an original, there may in fact be many different versions related to that original (as with printed books)--this is a big drawback with EEBO, ECCO, and Googlebooks if everyone is looking at the same digitized copy and not getting a sense of the range of variants. Digital "sameness" is another issue. (Ad inifinitum.)

Whitney said...

John, your point about sameness is very well taken, and one we tend to forget. The "one document" model negates much of the interesting work being done on the materiality of the book. The Shakespeare Quarto's project (quartos.org) is perhaps a counterpoint, in that it attempts to bring together multiple copies of a single text and provides mechanisms (such as layering and an opacity filter) for finding variants.

In the end, though, libraries, Google Books, etc., just aren't going to shell out money to scan multiple copies of the same edition of a text (the Shakespeare exception aside). I'd love to hear more about the grant you're working on.