Text-Mining the Middle Ages

In the always lovely Austin, TX, where MLA 2016 was held, I presented a paper on the challenges for medievalists who want to do quantitative textual analysis. To spoil the conclusion a bit: the sub-title might have "Why TEI is so great and you should use it." Here's the paper and some of the more relevant slides (I'm not included the sign-posting slides).

One of the most prominent areas of research and practice in the digital humanities is in text-mining, also known as quantitative textual analysis or Moretti’s well-known “distant reading,” all of which I use interchangeably. Especially at a conference like the MLA, most of the digital humanities sessions will have at least one, if not multiple papers discussing text-mining... and this one is no different. Rather than research results, however, I will discuss the challenges medieval texts pose to the common methods of text-mining, the assumptions behind the tools and algorithms that these texts expose, and potential, necessary changes to how we value the labor that is a prerequisite to successful text-mining work if medievalists, in particular, are to benefit most fully from the digital tools and methods available for distant reading.

I’m going to assume very little knowledge about text-mining. If you’re already well-versed in this field, please bear with me as I describe a typical project. The first step after developing a pertinent research question is to acquire a corpus of digitized texts that is large enough that it demands quantitative methods. This corpus may be as small as a few hundred thousand words, but that’s still more than enough to exhaust a human’s patience and accuracy. For digital humanists working in the 18th and 19th centuries, this poses little challenge. Many thousands of works have been digitized and are not copyrighted. Although medievalist do not have to worry so much about copyright, digitization remains another large hurdle. Oftentimes, print books are scanned then processed by optical character recognition (OCR) software to convert the page images into computer-recognizable text. Medieval manuscripts, however, are far more difficult for OCR. To begin, we must distinguish text from non-text.1 Illustrated manuscripts often have a large amount of the page given over to non-text elements. There is fascinating and valuable digital research on manuscripts for this and related problems.

Particularly noteworthy has been the Mellon-funded work at Yale that includes projects by Alastair Minnis, Jessica Brantley, Anders Winroth, and Holly Rushmeier2 and a new NEH-funded project “Global Currents: Literary Networks, c. 1090-1900,”3 a collaboration among Stanford, McGill, and Groningen, which includes work by Lambert Shomaker, an AI researcher, on a program called Monk, which has identified a lexicon of hundreds of thousands of words to make it possible to do keyword searches of manuscripts.4

Brandon Hawk has also experimented with OCR on medieval manuscripts and notes that even good OCR:

would not eliminate issues like post-processing correction of OCR extractions, or editorial decisions about modernizing forms like abbreviations, punctuation, and capitalization. The goal is not to eliminate human editorial work with computers, but creating accurate OCR for manuscripts has the potential to limit the time of editing and increase the efficiency of dealing with large numbers of witnesses.5

In other words, as work on this problem progresses, we may suddenly have access to the digitized textual content of many thousands more manuscripts, but a great deal of editorial work will remain before these texts are usable as corpora for quantitative textual analysis.

The basis of nearly every distant reading is, quite simply, counting words. Here we medievalists run into our next problem. Computers will count variant spellings as separate words because they perform a naive comparison of strings. If the strings differ by so much as a single letter, they are not identical and therefore “count” as a different word. The lack of orthographic regularity in our source materials is thus a substantial problem for nearly all the techniques common in text-mining. Scott Kleiman writes:

A single word in an Early Middle English text may have as many as thirty different spellings, and another text from the same time period may have a completely different set of spellings for the same word. Dialectal variations mean that grammatical variants of words may be entirely different in texts from different parts of the country.6

To test my own concerns about how medieval orthography might frustrate text-mining, I extracted text from the TEAMS Middle English Text Series—an incredibly valuable collection of digitized works—and performed some basic topic modelling (the algorithmic details of which I will skip, except to mention that topics are basically “bags of words” that, when extracted, appear as a list of the most common words in a given topic; it’s up to the researcher to interpret them), first without any processing, then with some of the most common Middle English words (i.e., stop words) removed.

The first batch of topics were almost entirely useless, except as proof of the importance of spelling and stop words. For example, three topics were nearly identical:

I suspect the differences among these topics results from orthographic variations of a few words. After removing some stop words, the results were noticeably better, but also showed just how many variant spellings we have to account for. I didn’t catch them all in this next trial:

These topics tell me that my corpus is a mix of Middle English in several dialects, some Latin, and some French. I also need to add more variants spellings to my stop words list. Rather than just adding “when”, I need “quhen”; instead of just “shall”, I need “schal”. And so on...

After normalizing the texts and stop words lists a great deal more, I might get results more like what Anna Waymack, a graduate student at Cornell, has managed. She graciously shared some of her results with me and writes of her corpus:

It is passus 1-7 of Piers Plowman, the first book of the alliterative Morte, Wynnere and Wastoure, Parliament of the Three Ages, Three Dead Kings, and the Pearl-poems. I divided them into chunks of 20 lines each. I removed capitalization and punctuation, turned 'u's into 'v's, 'gh' and yogh into 'y', thorn into 'th', and 'j' into 'i'. I then ran these 20-line chunks eight times, with 1000 iterations, changing the following: 10 topics vs 25, stoplists of approx 300 vs 500, and removing all word-final 'e's from the corpus.

A sample of her topics shows considerable improvement over my first results:

I won’t belabor the point further, but it’s clear that a great deal of work must be done on a medieval corpus before one can achieve usable results. There are also many more tools for working with modern languages that can make this pre-processing nearly automatic that are, unfortunately, unavailable to medievalists. For example, there’s no software that I’ve found that can lemmatize a Middle English corpus. There are many, many other similar challenges, but I won’t detail them now.

What, then, happens to all this work preparing a medieval corpus for the brave souls who attempt it? As part of the deformation of a text often required by algorithmic criticism, to paraphrase Stephen Ramsay7, the corpus loses information that we often find invaluable: spelling variants that might be signs of dialect or origin have been removed, manuscript variants have been elided, etc. The digital corpus that remains is typically plain text and nearly unreadable by humans. And the many hours of labor preparing such a corpus will not accrue any credit unto the scholars who performed it nor will it generally benefit any others.

If, however, we adjust our idea of what constitutes valuable scholarship, then we have a way to harness and encourage work building corpora in a way that would serve the field as a whole. I propose a solution that is part technological, part cultural. Consider the medieval scribes whose copying of manuscripts preserved them. Our source materials wouldn’t exist without their labor. Consider, too, the transcriptions and scholarly editions produced in the early twentieth century, also without which medieval literary scholarship would be greatly impoverished, if not nearly impossible. The MLA’s Committee on Scholarly Editions, however, laments the marginalization of the practice of editing in the academy: “editorial work has often seemed to reside outside the bounds of ‘true’ scholarship.”8

If a print-based scholarly edition is difficult for faculty to receive credit for as scholarship, the creation of a digital edition is more fraught still. Guidelines for the evaluation of digital scholarship exist, but they often amount to little more than a call for committees to recognize digital work as equally valid as traditional monographs. There is a long way to go, yet, in establishing strong incentives for scholars, especially junior ones, to spend their valuable time on digital projects. Moreover, it seems unlikely that someone might receive tenure or promotion based on the creation of a digital scholarly edition. Yet this work is desperately needed. Many observers or newcomers to the digital humanities assume that computers can solve all our problems easily, but the greatest challenge is still the development of freely available, high-quality corpora.

Although there’s little I can do about these issues other than bring attention to them, I can propose a robust, easily learned technology that was developed by literary scholars and that allows researchers to create digital texts useful for human use, accessible to machine analysis, and from which we can generate digital editions. The solution is the Text Encoding Initiative (TEI) Guidelines, “which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation.”9 TEI is often viewed as little more than a mechanism for preservation used mostly by librarians, but it is far more powerful than that. Because the guidelines were developed by scholars, they cover the gamut of research uses and include critical apparatus, manuscript description, text alignment, and linguistic analysis for prose, verse, performances, and other forms.

What I propose, then, is simple: scholars who need to build a corpus of digital texts should begin with TEI-encoded ones that mark up the salient features. This work can be done concurrently with transcription and proof-reading, during active research as a form of digital note-taking, or after the fact. Also, TEI-encoded texts can be automatically transformed into web pages like the TEAMS METS site, which people can read either for research or teaching (the Walt Whitman Archive10 is an excellent example of this use case). TEI-encoded texts can also be used for quantitative textual analysis of a far more sophisticated and accurate type than is otherwise possible. Instead of relying on natural language processing methods—which mostly fail for medieval texts, anyway—the scholar-created digital edition can easily deal with variant spellings and variant manuscripts. Such digital editions could be shared freely and progressively enriched by each new scholar who works on them. Collaboration is key to successful digital humanities projects; the use of robust standards is one crucial element to any successful collaboration. TEI is not only an excellent, robust standard, it’s effectively the only game in town for marking up digital texts.

To reiterate and close: text-mining the Middle Ages is rare compared to studies done on 18th- and 19th-century literature for several reasons. The words themselves resist analysis through the usual means because those means are biased towards information retrieval tasks tested upon modern languages. Because of this fact, the development and preparation of a digital corpus is more time-consuming for medievalists than for many other digital humanists. Unless we as a community turn to reusable, flexible, and extensible standards, then these corpora will typically be used only once and then discarded. But, until universities recognize the necessity of the preparation of digital editions and begin to value this labor as important scholarship, and so create an incentive for scholars to take it on, medieval text-mining projects will remain rare and demand large teams and resources.


1 See, Ying Yang , Ruggero Pintus, Enrico Gobbetti and Holly Rushmeier. “Automated Color Clustering for Medieval Manuscript Analysis,” http://graphics.cs.yale.edu/site/sites/files/DH2015_ID193_final_1.pdf

2 “Digitally Enabled Scholarship with Medieval Manuscripts.” http://ydc2.yale.edu/research-support/digitally-enabled-scholarship-medieval-manuscripts

3 “Global Currents: About Us.” https://globalcurrents.stanford.edu/about/about-us

4 http://www.ai.rug.nl/~lambert/Monk-collections-english.html

5 Hawk, Brandon. “OCR and Medieval Manuscripts: Establishing a Baseline.” http://brandonwhawk.net/2015/04/20/ocr-and-medieval-manuscripts-establishing-a-baseline/ April 20, 2015.

6 Kleinman, Scott. “Topic Models and Spelling Variation: The Case of Early Middle English” http://scottkleinman.net/blog/2013/01/20/1568/ Jan 20, 2013.

7 Ramsay, Stephen. Reading Machines: Toward an Algorithmic Criticism. Champaign IL: University of Illinois Press. 2011.

8 “What We Do and Why You (Editors) Should Care.” https://scholarlyeditions.commons.mla.org/2015/10/28/the-committee-on-scholarly-editions-what-we-do-and-why-you-editors-should-care/

9 TEI: Text Encoding Initiative. http://www.tei-c.org/index.xml

10 http://www.whitmanarchive.org/


Last modified Mon, 14 Mar, 2016 at 9:45