20 October 2008

mining (con)texts

(Just another cross-post; this time from HASTAC. Also, I realize that this is my second pessimistic digital humanities post in one week. As a former Governor once said to me, "You're just a contrarian for the sake of being a contrarian." Guilty as charged. I promise tomorrow I'll unpack my pom-poms and go back to cheering that old MIT cheer. Goooooo Technology!)

John Unsworth gave a talk at Harvard tonight teasingly titled "How Not to Read a Million Books: Text Mining, and Reading the Unreadable." He spoke mostly about the MONK project, a Mellon-funded collaboration which applies text mining techniques and visualizations to discover new dimensions to literary and historical texts.

Unsworth described the work of several scholars already using the MONK toolkit in their work. For instance, Tanya Clement, a PhD candidate in English Literature and Digital Studies, has successfully applied MONK to her research on Gertrude Stein's The Making of Americans, as she described in a recent article for Literary and Linguistic Computing:
The particular reading difficulties engendered by the complicated patterns of repetition in The Making of Americans mirror those a reader might face attempting to read a large collection of like texts at once without getting lost—likewise, it is almost impossible to read this text in a traditional, linear manner. However, by visualizing certain patterns and looking at the text ‘from a distance’ through textual analytics and visualizations, we are enabled to make readings that were formerly inhibited. Franco Moretti has argued that the solution to truly incorporating a more global perspective in our critical literary practices is not to read more of the vast amounts of literature available to us, but to read it differently by employing 'distant reading'. 'We know how to read texts', he writes, ‘now let's learn how not to read them' (Moretti, 2000Go, p. 57). Similarly, by learning to read texts that have been misread 'at a distance', we are reading differently and we value different readings.
Sara Steger, a PhD candidate in English at University of George, is similarly using MONK in her study of sentimentalism in nineteenth-century novels. Not only could she train the program to recognize sentimental scenes, she then was able to mine a collection of texts for over-represented words in, for instance, Victorian deathbed scenes:

And, then, under-represented words in those same scenes:

Her results invite new research into the absence of formal expressions of mourning ("holy," "country," "lord"), and the presence of physical and emotional closeness ("pillow," "cheek," "breath").

I want to underscore that I think these tools do offer incredible, never-before-possible ways of looking at texts. But: I wonder about how slippery the word "text" becomes in the phrase "text mining." MONK and similar projects focus narrowly on "text" as a string of letters than can plucked from any material context, plopped into another and manipulated, "mined," for meaning. Let's assume for a second that the OCR software always works perfectly (it doesn't), and that the scans of our target book have picked up all the paratexts, including the copyright page, advertisements, promotional blurbs, even page numbers. Then take that nice, neat group of letters and drop it into a text file. What are you left with? What so-called "accidentals," what context, has been lost in translation?

I'm reminded of Kenneth Goldsmith's book Day, in which he re-typed one day's New York Times word for word, from the upper left hand corner to the lower right hand corner, including page numbers and any text in advertisements. The resulting book -- a thick tome that essentially levels the dynamic space of the newspaper -- might have been the newspaper . . . but definitely was not the newspaper.

These questions become relevant particularly for Victorian novels, many of them stuffed with advertising and illustrations, or published serially in magazines alongside political cartoons or recipes. The Wordles created from deathbed scenes are fascinating and very exciting to me; but unless they're paired with some old-school bibliographic analysis, I worry that more has been elided from the text than it's worth. I also wonder (given my own interests) how text mining would work for early modern books, many of which may ascribe meaning and significance to "accidentals" like italics, capitalization and typographic variation. Unsworth acknowledged that text mining should only be one tool in the researcher's toolkit. What, then, would a combination of MONK-like text mining and bibliography would look like? How can we apply "distant reading" to texts-as-strings-of-letters, while simultaneously doing a "close reading" of texts-as-material-objects?


Anonymous said...

Wow. Way to write a blog that makes me go, "Holy crap! Someone has one of them weblog thingies on the intertube that makes me truly want to read it."

Yay for young scholars of the history of the book; yay for optimistic pragmatism about what digital technologies can('t) do; yay for lucid prose; yay for Stephen Greenblatt making penis jokes. Just yay.

And thanks.

Whitney said...

Thanks so much! You've got a pretty swank blog yourself -- I'm surprised I hadn't found it before, but am quite happy to have it in my Reader now.