04 February 2014

Layouts, Patterns, Networks

The next phase in building my prototype digital edition of a Little Gidding Harmony has led me into document layout analysis and digitization processes. As I move from research to writing, I'd like to use this post to untangle a few preliminary ideas – not really findings so much as speculations.

If you've ever used Adobe Acrobat to process historical documents, you're probably familiar with some form of layout analysis. It's the task of identifying different regions of interest on a scanned image of a document (text, image, graphic), then further labeling their different roles (caption, page number, title), usually in an XML file that's linked to the page image. To give you an example, I uploaded this page from a 1635 edition of the Sternhold and Hopkins psalter to the SCRIBO module, a neat little online tool for layout analysis. Here's the image it spit out:

This book has a somewhat complicated layout by modern standards, and the microfilm scan from EEBO is not great; still, SCRIBO did a moderate job of identifying text blocks and other elements. It does better with more standard layouts, like these facing dedicatory epistles in Robert Greene's Penelopes web (1587).

Extracting this kind of textual description from an image is useful for all sorts of purposes. It helps automate the process of transforming scanned books into clean digital text by removing paratexts that might gunk up the flow of the document, things like running heads or page numbers. It can also facilitate identifying all images within a magazine, or all article titles within a newspaper. Because layout analysis links description to coordinates on the digital image, it provides a way of mediating between the photographic facsimile and the extracted text when working with digitized books and manuscripts.

One of the things that fascinates me about this is its potential to bridge different kinds of book historical research. I tend to take a "material texts" approach to my digital work, grounded in, for instance, Randall McLeod's playful investigations into the deep materialities of printed books, or Johanna Drucker's attendance to design, or more distantly D. F. McKenzie's sociology of texts. But there's also of course a rich tradition of mining, abstracting, and visualizing large corpora of digitized texts, a tradition that might be traced back to the early quantitative book historical work of the Annales school. By drawing attention to the importance of the material text within a macroscale approach, pattern analysis has the potential to bring these two divergent branches – material book and immaterial texts – back together. Here, projects like VisualPage are leading the way by analyzing layout, form, and the use of space in poems printed in the late nineteenth century.

Yet I'm also interested in the limits of these tools, in finding their breaking points; for the point at which something stops working tells us much about the assumptions or aims of its design, as well as the structures (material, ideological) that circumscribe its possible uses. This task is particularly enlightening with tools that digitize books, because of course, as hardware, printed books and manuscripts are very different types of things from digital texts. Asking a machine – especially one that we, they, someone has designed – to "analyze" print's layout helps pinpoint the gap between our (modern) expectations of a book and its (historically specific) material reality.

Here, as in so many things, the Little Gidding Harmonies offer a perfect test case. Here's a page from the King's Harmony (1635), as interpreted by SCRIBO:

Not so great, but honestly, not so terrible either, given this is a web-based tool with a 20MB upload limit, and this page is a complicated mash-up of a variety of printed texts and images. (The archive-quality TIFF files I have for another, simpler Harmony are each around 42MB.) What interests me here is that large text blocks are being identified as images, outlined in orange – possibly because of discoloration, or because the pasting of the cut-up bits and pieces is ever-so-slightly uneven, and we would expect printed lines of text to be perfectly straight. This happens repeatedly when I upload other pages:

So, what does this matter? Well, there's a few obvious points to be made. Even as digitization – defined simply as taking a photograph of a book and disseminating it over digital networks – increases access to rare materials, here we see how the mechanisms that make digitized books legible (searchable, manipulable, visualizable) to researchers continue to reproduce both print biases and modern attitudes toward what a "book" or a "text" is. Though the difference this makes is subtle, it does mildly qualify the point that digitization helps bring rare, inaccessible, and otherwise non-canonical works – objects like the Harmonies – to a wider audience. Taking and posting photographic facsimiles online is one task within a broader array of practices that we might call "digitization"; if our tools for mining and analyzing these books can't read these facsimiles, then the Harmonies and other texts that are not easily machine-readable remain in the position of the unusual, the quirky, the idiosyncratic, unplugged and disengaged from the networks that enable us to study broader cultural trends. In other words, they hold more or less the same marginalized position that they do under scholarly regimes of print, where, unable to be easily anthologized or reproduced, they remain outside the systems through which knowledge circulates, accumulating cultural capital. This small difference may have big consequences for the kinds of stories we can tell about history at scale.

As I mentioned in my recent MLA paper, we see the same issue in image matching tools. Machines are good at – that is, we've designed machines to be good at (what determines what is slippery here) – identifying sameness, matching strings of characters or visual patterns. It's more difficult to trace subtle acts of remediation, whereby a woodcut pattern is copied in Thomas Trevelyon's 1608 manuscript miscellany, then embroidered in blackwork, or used in plasterwork. This is not a point against image matching; rather, it's a simple reminder that scale is determined by not only the capacity but the affordances of the network, by what the computer is capable of seeing and reading, such that more does not eventually lead to "culture" as such.

(The top image is from Geoffrey Whitney's A choice of emblemes (1586), STC 25438; the second is from Thomas Fella's manuscript miscellany, now at the Folger.)

I've (accidentally) described this problem as a criticism, but it's more interesting in the form of a question. Namely, what type tool would be suited to pattern matching in the Harmonies, or across networks of prints and textiles? How would we design it? What are the points of friction, material or structural, in the process of digitization?

As I've been playing around with different programs, I keep returning to the recent discussions I've been having with my colleagues Mary Caton Lingold and Darren Mueller. We're in the midst of collaboratively writing an introduction for an edited collection on digital sound studies, and we keep rubbing up against this problem of medium specificity. The digital experience, however you interpret that phrase, is remarkably textual, not just because we're emailing and tweeting and blogging and texting more, but also because we parse nearly everything in strings of characters. Vestiges of the command-line interface appear in the ubiquitous search box of the web; the metadata that responds to that search is encoded in text. If you want to perform similar search operations on sound, you have to translate it into another medium, either visual (the waveform, the spectrogram) or textual (user tagging). Though the Little Gidding Harmonies are books to be read as text, the task of making them machine readable is more like that of mining sound for semantic content. That is, it's an intermedial process of mapping character strings to their position within a page spread, and then matching these patterns within and across the whole. It requires abstracting from the book without losing the book.

Pattern is a word rich with meaning in the early modern period, especially for Little Gidding. Nicholas Ferrar, founder of the community, describes his friend George Herbert as a "pattern or more for the age he lived in," a phrase Ferrar applied to Little Gidding's lifestyle, too. The Harmonies themselves (note the sonic resonance of the word) are pattern books that marry form and content in the same way as Herbert's "Easter Wings" – a poem whose material history has been traced beautifully by Random Cloud (Randall McLeod) in "FIAT fLUX." The Harmonies also contain echoes of the pattern books used to embroider and knot networks of significance, "networks" of course originally referring to lace webbing. We (I?) can't seem to escape this dense interweaving of text and textiles when talking about transmutation from print to digital books. It's fun to dance in the history of etymologies – but more than that, these webs of signification continue to do cultural work. I'm drawn to the idea of a digital humanities invested in pattern rather than identity.

(From Flickr user crabchick.)

If you want to learn more about layout analysis for historical documents, you can dive into the HisDoc research project or PRiMA. LLC recently published an interesting article on using pattern redundancy analysis in historical printed books, based on work at the Laboratoire d'Informatique de Tours. There are also a few neat open source tools for OCRing text (Optical Character Recognition – the process of pulling text from a scanned image) that come packaged with document analysis, like OCRopus. I would of course love to hear about more digital humanities projects using layout analysis.

No comments: