Graphing Topics in DH Interviews

Stéfan Sinclair tweeted in passing that it would be interesting to do some text-mining on interviews with DH practitioners posted on the Digital History website:

This seemed like a chance for me to play around with MALLET and Gephi, two tools I've been learning lately. The interviews are all done by students in Leslie Madsen-Brooks's course "Digital History" at Boise State University. The course "is about methods, controversies, ideas and ideologies, and the way U.S. history gets deployed in a digital age" (Syllabus). All are relatively short pieces that summarize interviews with a wide range of DH folk. Given the big tent of DH, we might expect to see little overlap in the topics, which is almost precisely what I found--with some notable exceptions. As you'll see from the overall weights of the topics below the fold, only 2 topics appear regularly among all the interviews. 

It would be great to have transcriptions of the actual interviews as these posts introduce another layer--that of students writing a summary--that comes through in the topics. 

Method

I used wget to download all the posts from the course website, then separated out only those that are interviews. Because I didn't want to leave wget mirroring the site for too long, I cut it off once it seemed I had them all. After I finished the topic models and the graph visualization, I realized that I missed 2-3 interviews. I very much doubt the results would differ radically were those included, but let's just consider that a potential avenue of further investigation (I smell an assignment). Once I stripped out the HTML using Python's nltk clean_html() function, I hand-edited the files to remove the header and footer information in each file, leaving only the text of the interview. If the corpus were larger, this method would not be sustainable, but developing a script to strip this data out would have taken more time than doing it by hand in this case. After running the texts through MALLET a few times with 10, 20, and 30 topics (stopwords removed), I settled on 30 topics as giving some of the most seemingly meaningful results with a good amount of diversity. You can see a table of the topics and their weights at the bottom of this post. Once I had this data, I put it into an Excel spreadsheet with "Source", "Target", and "Weight" columns (following this suggestion from Shawn Graham, one of the interview subjects) saved it as a CSV, and imported it into Gephi to create a bimodal graph. I used the Force Directed Atlas 2 layout, adjusted some colors, and increased the edge sizes and colors to make the relationship among nodes clear. While another option would have been to make graphs of only people, then use edges to show shared topics, the relatively small size of the data set and the relationships among them made a bimodal graph an appealing option.

Topic model of people and topics in DH Interviews from Digital History

Analysis

The first thing that stood out to me when looking at the topics was how much each interview seemed to be a topic unto itself. Many of the topics include the interviewee's name and other proper nouns that no other interviews contain. As such, they serve as handy distant readings of each interview that give a very good sense of the content. We might also read them as short-hand for the diversity of subjects and methods that make up the DH field. Topics 22 and 14, however, appear across all the interviews with a much higher weight than any of the others, 22 especially, which centers the entire network. Topic 22 arises from the source of these interviews: an assignment written by students for a course. Nearly all the students wrote their posts in first-person and have some mention of the fact that they performed an interview with somebody in the digital humanities. In other words, echoes of the assignment prompt made their way into the texts analyzed and ends up being by a very large margin the topic that ties all the interviews together. I suspect an analysis of interview transcriptions would, as a result, look radically different (or maybe not). Still, some interesting words appear. I'm particularly struck by "work", "project", "research", and "content". These words reflect the project-based, maker-oriented, content-driven aspects of DH practice. The next most commonly attested topic (14) reflects the context of the course. "History" and variations of that word dominate. Looking across the topics, we can pick up some other themes: libraries, museums, many different technologies, and several other words that are semantically related crop up repeatedly.

If you draw other conclusions from this data, have problems with my methodology, or do something else with the texts, I'd love to hear about it in the comments.

Further Questions

How would including the other posted interviews I missed change the results?

What would topic models run on each interview individually return?

How different would transcriptions of the interviews be from the summaries?

What would a unimodal network graph of people as nodes and shared topics as edges look like?

Data

Attached at the end of this post, you can find all the texts, the MALLET file, and spreadsheets generated, and the network graph in a compressed file.

Here's a table of the topics, their weight across the corpus, and the words in each topic:

 
TopicWeightTopic Words
00.0407collections knowledge publishing spiro express answers emerging org interesting bios nitle industry guide biggest michigan change simply continue fact 
1

0.02434

frischer dr projects conversation world skype ucla understanding noted radio received suggested conferences life picture ancient lab pc lighting

2

0.08813

brandi humanist boise city historian important state department historical skill skills technological amp idaho history arts website research center 

3

0.06631

ceiling site google googledocs realize small training build bibliographic hacked plan errors networking impact action read liven maintaining devoted 

4

0.05148

museum blog cataloging online curators types collections career interactive provide access point skill state website exhibit feature technician presence 

5

0.06741

model leading diligent completing reduce states integrator soldiers living utilizes school ability web flowing assistance organizations start relies interaction 

6

0.06027

louisiana standards mlis war national helped oral library helpful metadata barnes called opening degree art artifacts grants career weapons

7

0.04679

collaborate ended historying org campus building standpoint methodology order land wanted summer system postal nineteenth academic visit positive openness

8

0.06834

encrypting hopeful journals entertained ohio molecules complex ocean museums obtain privileges excitement informing mcchesney simply align confident ness set

9

0.04119

hajo dr funding humanists resource opinion relevancy changing thing nyu documents make inquiry case easier difficult publications argument range

10

0.04253

native american dr powell museum tribes artifacts university explain digitalize program americans fellowship dad cherokee grad anthropology pennsylvania culture

11

0.02629

sics computer busy archaeological modeling groups great virtual professor touch clued discussion feel derailed turned retrospect interesting resume 

12

0.06517

social growth feel position zotero support studying general involving solving cohen factor affects solved matter pace rapid ago warns

13

0.06205

kelly mr teaching learn technologies helping social teachers funding webpage proud global society unsettling begin ways biggest addition related

14

0.4278

history projects people historical tools historians source learned primary http good sources historian work making students quickly kind media

15

0.05801

owens national encouraging methods historians scrappiness ndiipp trevor problems preserve problem science watching users successful connections pragmatic approaches instrumental

16

0.08734

gave collection studies title obstacles reinvented alluded pressure ways notions developments imagine mailed peninsula martin librarianship insights staff consistently 

17

0.06879

students hangen dr worcester online classroom state entry wikipedia access blogs thing order utilize learning world teacher found omeka

18

0.04919

shawn young academic share scholars means archeology offer academia academics beginning concept forward interest canada innovative related presented initial 

19

0.04632

extent cable apply pages posts maps throw full love technological significant 

20

0.03325

preservation interdisciplinary commented preserving games group set field spend managed professions disciplines bent technologists facing enables maintenance focused alliance

21

0.03281

cameron ve realized transactions real century skill recommend aspect graduated background life dome grew art twitter www affiliated seek

22

1.13777

digital humanities work project technology field time working research world interview things university humanist career information years find content

23

0.05917

humanist mies http php participate suggest situation play good cultural world happy coming apposed possessing christorpher lisa publishers ties 

24

0.07412

young mr internet explore informed theater visualization berkeley talking online ten sort allowing microsoft exploits colleagues improve started voice 

25

0.05815

scholarship forester compelling overwhelmed talking felt experience suggested phenomena budgets slashed tenure dismal appeared futureofthebook link employable door code

26

0.07692

share west open advice upcoming standing dan boss couple mentions carr innovations terrible rapidly mindset accomplishments philadelphia renaissance fascinating

27

0.08466

exhibits job passion people college flexible work clean archiving progressive physical curatorial decide due senior keynotes british participated museum

28

0.03934

melanie schlosser definition librarianship interested librarians publishing arguments resources projects key intersectional introducing chris intriguing assigned omeka community component

29

0.03296

women flanders stated traditional field dr discipline text serving editing entered mention side aspect kind interesting conversation required inspiration

Comments

Thanks so much for this, Mike!  Fascinating stuff.

It definitely makes me think that next time I teach the class I need to add two more requirements to the assignment: audio recording the interviews if done by phone or video chat (perhaps a technical hurdle too high for my tech-anxious students as their first assignment of the semester?) and providing clean transcriptions of the interviews.  

That sounds like a great tweak. You could end up talking about methodology as a result. 

Add new comment

Last modified Tue, 25 Sep, 2012 at 15:50