Graphing Topics in DH Interviews
Stéfan Sinclair tweeted in passing that it would be interesting to do some text-mining on interviews with DH practitioners posted on the Digital History website:
It would be great to have transcriptions of the actual interviews as these posts introduce another layer--that of students writing a summary--that comes through in the topics.
I used wget to download all the posts from the course website, then separated out only those that are interviews. Because I didn't want to leave wget mirroring the site for too long, I cut it off once it seemed I had them all. After I finished the topic models and the graph visualization, I realized that I missed 2-3 interviews. I very much doubt the results would differ radically were those included, but let's just consider that a potential avenue of further investigation (I smell an assignment). Once I stripped out the HTML using Python's nltk clean_html() function, I hand-edited the files to remove the header and footer information in each file, leaving only the text of the interview. If the corpus were larger, this method would not be sustainable, but developing a script to strip this data out would have taken more time than doing it by hand in this case. After running the texts through MALLET a few times with 10, 20, and 30 topics (stopwords removed), I settled on 30 topics as giving some of the most seemingly meaningful results with a good amount of diversity. You can see a table of the topics and their weights at the bottom of this post. Once I had this data, I put it into an Excel spreadsheet with "Source", "Target", and "Weight" columns (following this suggestion from Shawn Graham, one of the interview subjects) saved it as a CSV, and imported it into Gephi to create a bimodal graph. I used the Force Directed Atlas 2 layout, adjusted some colors, and increased the edge sizes and colors to make the relationship among nodes clear. While another option would have been to make graphs of only people, then use edges to show shared topics, the relatively small size of the data set and the relationships among them made a bimodal graph an appealing option.
The first thing that stood out to me when looking at the topics was how much each interview seemed to be a topic unto itself. Many of the topics include the interviewee's name and other proper nouns that no other interviews contain. As such, they serve as handy distant readings of each interview that give a very good sense of the content. We might also read them as short-hand for the diversity of subjects and methods that make up the DH field. Topics 22 and 14, however, appear across all the interviews with a much higher weight than any of the others, 22 especially, which centers the entire network. Topic 22 arises from the source of these interviews: an assignment written by students for a course. Nearly all the students wrote their posts in first-person and have some mention of the fact that they performed an interview with somebody in the digital humanities. In other words, echoes of the assignment prompt made their way into the texts analyzed and ends up being by a very large margin the topic that ties all the interviews together. I suspect an analysis of interview transcriptions would, as a result, look radically different (or maybe not). Still, some interesting words appear. I'm particularly struck by "work", "project", "research", and "content". These words reflect the project-based, maker-oriented, content-driven aspects of DH practice. The next most commonly attested topic (14) reflects the context of the course. "History" and variations of that word dominate. Looking across the topics, we can pick up some other themes: libraries, museums, many different technologies, and several other words that are semantically related crop up repeatedly.
If you draw other conclusions from this data, have problems with my methodology, or do something else with the texts, I'd love to hear about it in the comments.
How would including the other posted interviews I missed change the results?
What would topic models run on each interview individually return?
How different would transcriptions of the interviews be from the summaries?
What would a unimodal network graph of people as nodes and shared topics as edges look like?
Attached at the end of this post, you can find all the texts, the MALLET file, and spreadsheets generated, and the network graph in a compressed file.
Here's a table of the topics, their weight across the corpus, and the words in each topic:
|0||0.0407||collections knowledge publishing spiro express answers emerging org interesting bios nitle industry guide biggest michigan change simply continue fact|
frischer dr projects conversation world skype ucla understanding noted radio received suggested conferences life picture ancient lab pc lighting
brandi humanist boise city historian important state department historical skill skills technological amp idaho history arts website research center
ceiling site google googledocs realize small training build bibliographic hacked plan errors networking impact action read liven maintaining devoted
museum blog cataloging online curators types collections career interactive provide access point skill state website exhibit feature technician presence
model leading diligent completing reduce states integrator soldiers living utilizes school ability web flowing assistance organizations start relies interaction
louisiana standards mlis war national helped oral library helpful metadata barnes called opening degree art artifacts grants career weapons
collaborate ended historying org campus building standpoint methodology order land wanted summer system postal nineteenth academic visit positive openness
encrypting hopeful journals entertained ohio molecules complex ocean museums obtain privileges excitement informing mcchesney simply align confident ness set
hajo dr funding humanists resource opinion relevancy changing thing nyu documents make inquiry case easier difficult publications argument range
native american dr powell museum tribes artifacts university explain digitalize program americans fellowship dad cherokee grad anthropology pennsylvania culture
sics computer busy archaeological modeling groups great virtual professor touch clued discussion feel derailed turned retrospect interesting resume
social growth feel position zotero support studying general involving solving cohen factor affects solved matter pace rapid ago warns
kelly mr teaching learn technologies helping social teachers funding webpage proud global society unsettling begin ways biggest addition related
history projects people historical tools historians source learned primary http good sources historian work making students quickly kind media
owens national encouraging methods historians scrappiness ndiipp trevor problems preserve problem science watching users successful connections pragmatic approaches instrumental
gave collection studies title obstacles reinvented alluded pressure ways notions developments imagine mailed peninsula martin librarianship insights staff consistently
students hangen dr worcester online classroom state entry wikipedia access blogs thing order utilize learning world teacher found omeka
shawn young academic share scholars means archeology offer academia academics beginning concept forward interest canada innovative related presented initial
extent cable apply pages posts maps throw full love technological significant
preservation interdisciplinary commented preserving games group set field spend managed professions disciplines bent technologists facing enables maintenance focused alliance
cameron ve realized transactions real century skill recommend aspect graduated background life dome grew art twitter www affiliated seek
digital humanities work project technology field time working research world interview things university humanist career information years find content
humanist mies http php participate suggest situation play good cultural world happy coming apposed possessing christorpher lisa publishers ties
young mr internet explore informed theater visualization berkeley talking online ten sort allowing microsoft exploits colleagues improve started voice
scholarship forester compelling overwhelmed talking felt experience suggested phenomena budgets slashed tenure dismal appeared futureofthebook link employable door code
share west open advice upcoming standing dan boss couple mentions carr innovations terrible rapidly mindset accomplishments philadelphia renaissance fascinating
exhibits job passion people college flexible work clean archiving progressive physical curatorial decide due senior keynotes british participated museum
melanie schlosser definition librarianship interested librarians publishing arguments resources projects key intersectional introducing chris intriguing assigned omeka community component
women flanders stated traditional field dr discipline text serving editing entered mention side aspect kind interesting conversation required inspiration
Last modified Tue, 25 Sep, 2012 at 15:50