Data and Doing the Reading

by Sean Gilleran


An early model of a relational database, from https://www.ibm.com/ibm/history/ibm100/us/en/icons/reldb/.

I was recently a teaching assistant for a course on the history of public policy—a broad topic and one that undergraduates had a particularly difficult time getting adjusted to. Not only was the course thematically challenging, it also featured a bewildering array of case studies, spanning the globe over more than two thousand years. Our first week’s reading was from Thucydides, on the Athenian demos and its disastrous collective decision to go to war against Egesta. Our last touched on the mythology of Silicon Valley and the cult of innovation around information technology-adjacent businessmen like Elon Musk. Lectures, meanwhile, were delivered almost all by guests, with new faces and new voices every week. It was difficult to maintain a sense of consistency. The students were overwhelmed, and so was I.

When students ask me for help with reading, I tell them to look at the introduction and conclusion, look at the topic sentences, and to focus on the overall meaning or importance of a piece in the abstract first. There are no spoiler alerts in academia—they need to know what they’re going to read before they read it so they can see how the intention of the author and the experience of reading relate to the larger thematic concerns of the course. It’s too easy, otherwise, to catch yourself reading the same paragraph over and over without really absorbing it, or getting lost and confused by details that don’t matter.

Over the last few years, as I’ve been making my way through graduate school as a teaching assistant, I’ve also been working with Alan Liu, Jeremy Douglass, Scott Kleinman, Lindsay Thomas, and others on the WhatEvery1Says, or WE1S project. This is an effort to use “digital humanities methods to study public discourse about the humanities at large data scales.”1 I don’t have a background in the digital humanities, and I don’t have a background in working with “large data scales,” but I do have a background in computing and, of course, in doing research, and I found, as I got to know the tools, the processes, and the people, that there was a lot of overlap here. Essentially the problem faced by WE1S is this: how do you read a million newspapers or a million blogs? The answer, of course, is that you don’t. Past a certain threshold, these texts become data. We begin to think about them differently because at scales like these the conceptual framework we use to consider sources, as humanists, begins to fail.

One of other projects Scott has worked on is Lexos, a tool designed to “help you explore your favorite corpus of digitized texts” by way of “computational and statistical probes.”2 Lexos, like the software “observatory” under development at WE1S, lets us take a collection of texts and turn them into data. Unlike the WE1S tools though (at least for the time being), Lexos is quick, simple, and fun. It can excise words and find lemmas, produce word clouds, find patterns and outliers, etc. Working with Lexos is a fascinating back-and-forth process. You can adjust the scope of your results to zoom in on a particular document or series of documents, or look at a handful of words or hundreds at a time. A certain word may show up in your results that you did not expect, and you can decide on the fly to leave it in or take it out and see its presentation of the data change as you do.

I decided, then, to use Lexos to help my students develop the background understanding they would need to pull the course materials together. I assembled the readings as a plain-text corpus, offering extra credit for students who helped me with material that was not yet digitized or that could not be pulled from optical character recognition. Then I had them put all this into Lexos and explore the results, asking them to come to class next week with a “souvenir” of what they found—nothing more than a screenshot and a paragraph summarizing its significance.

When students examined the documents through the lens of data, they discovered patterns that they had not before been able to find. Many of them noticed, for instance, that the word “man” appeared far more often than the word “woman.” “After comparing” the documents, wrote one student, “women really were left out… who is respected? Who is deemed fit to rule? How does gender influence politics?” Others latched on to the frequent appearance of the words “king” and “god(s).” One student speculated that this might be because the two are representative of forces “seen as something that tie[s people] together.” Another student argued that Lexos revealed “a general theme… of identity” since “the most frequently appearing words” were “identifying words: man, king, god, solider, officer, Christian, etc… words [that] are used to classify people and create a hierarchy in society.”

A Lexos word cloud generated by a student as part of the assignment.

Intriguingly, the students who reported that they found the assignment most useful were also the ones who were most critical of Lexos and of the curious alchemy it was doing. “If one were to analyze the texts using only the key words that are provided,” wrote one, “then the reader might miss the tone of the article.” Another student, although they “found it incredibly interesting to analyze documents in this manner,” also noted that “many of the same things the Lexos word clouds tell us can be gleaned through close reading” and that, of course, “throughout time and in different circumstances, the same words can be used drastically differently, and it is important that we keep this in mind.”

That this last student seemed to identify the assignment as being superfluous indicates to me that it achieved its purpose. We have a sense that when we deal with data we are dealing with absolutes; that we have ascended to the realm of the objective. But reading a text and working with data are essentially the same thing. These are both acts of interpretation with a human being at their center.

If I’m lucky, my students will have come away from their work not just with a better understanding of the course content, but with a larger metaphor for the data-centric discourse that awaits them in the world beyond—and with a reminder that, unfortunately, you still have to do the reading.