(Text) Mining in West Virginia: Extracting Resources from Our Digital Texts

The Historical Medical Library, as part of its role with the Medical Heritage Library (MHL), is working on a consortium wide digitization effort, in conjunction with the Internet Archive, to provide scholarly access to the entirety of the State Medical Society Journals published in the 20th century. For an introduction to this project, you can read my previous blog post.

In this post, I would like to explore what I began to discuss at the end of my last post: the application of computer aided text analysis techniques, also referred to as “text mining.” In this second-in-a-series of posts about the MHL project and the possibilities for digital scholarship, I will offer an introduction to some of the core concepts of text mining, as well as some easy-to-use, browser-based tools for getting started without the need for a high level of expertise, or specialized software.  There will be a link to some more in-depth resources and processes at the end of this article for people interested in exploring some of these concepts and processes more fully.

Word cloud created by Voyant Tools from the text of v.1-v.84 of the West Virginia Medical Journal

Voyant Tools

While most rigorous digital humanities projects involve using custom built tools and procedures, often coded in R or Python (computer languages well suited to statistical analysis) there exist some out of the box browser based tools any curious person can use. Voyant tools is a “web-based reading and analysis environment for digital texts” that allows the user to analyze their own text(s) by direct uploading or pasting, or providing a URL to them.

Let us look at some simple visualizations based on one volume from one of the journals from the Historical Medical Library collections that was recently digitized. In order to do this, I downloaded the full text version of the journal from the Internet Archive and cleaned it up a bit, a necessary part of the process when working with documents created by optical character recognition (OCR). However, if you just want to dive in and play, there is no need to worry about that part of the process; you can simply copy the URL (see image below) from the full text version of the article and provide it to Voyant. You may see some anomalies in your results, but just understand that they are artifacts of the OCR process. A link to the entirety of the journals that have been digitized so far is here: https://archive.org/details/statemedicalsocietyjournals.


Let’s get started! Word trends, also referred to as “term frequency,” is an easy way to start examining a text.  Voyant provides us with correlations between the most frequent words in a document by default, though sometimes the most common words in a document end up being the least interesting. As such, you can change the words displayed on the graph by typing them into the field on the bottom left of the graph. Since term frequency is calculated in segments of a text, you can also change that value to change the granularity of the graph, providing more or less points of comparison. Try out some combinations – did you find any interesting (expected or unexpected) results?

Voyant also offers visualizations that are less analytical and more demonstrative. An interesting one you can see working below is called Bubbles, which shows in real time what the term frequency is for words in a document as it works through the document word by word. This may not be immediately revealing, but it helps us to conceptualize the way in which this type of analysis takes place on the computer’s end.

Voyant has many ways in which it can analyze and visualize texts. You can view documentation about the available features here: http://voyant-tools.org/docs/#!/guide/tools. What are some features of Voyant that seem particularly suited to examining the State Medical Society Journals?

Corpora, the core of text analysis

For the sake of a simple demonstration, I have shown you somewhat meaningless results that are derived from one document. However, things begin to get more interesting when comparisons are made across a body of documents, or even between multiple bodies of documents. These bodies are referred to as corpora (corpus is the singular). Corpora need to have bounding ideas that relate the texts to each other in order for the results of analysis to have a clear meaning. As such, the creation and selection of corpora is one of the most essential tasks in performing this type of research. Examples might include all of the works of one type by one author, such as Shakespeare’s Sonnets, or 19th Century German histories, or something more complicated, such as 1960s science fiction written by women under a male pseudonym.

Selecting corpora guides the types of questions one might ask. For example, the science fiction corpus above might tell us something about attitudes toward gender in a particular genre of writing when compared against such corpora as 1960s literary fiction by woman or 1960s science fiction by male writers.

The State Society Medical Journals currently being digitized by the MHL are ripe for this kind of cross corpus analysis as they have naturally occurring geographic and temporal boundaries. For example, one might want to look at the occurrence of a particular term for a pathology over time against the actual rates of occurrence of that pathology. Additionally, since each state is represented in the collection, it could be possible to look at how coverage of a particular pathology in the text is related to the regional occurrence of that pathology. Like any good research, computer aided research requires a solid thesis in order to provide one with meaningful insights.

Miriam Posner from the UCLA Digital Humanities department has provided instructions here for using the command line to bulk download journals from MHL to gather your own text corpora.

A document summary for the West Virginia Medical Journal corpus from Voyant Tools.

That’s all well and good, but what does it mean?

Like any tool, these analyses are only as strong as the person who is interpreting them. The rudimentary examples shown above might not actually tell us much about the text at hand. However, the following are examples from the broader field of digital humanities that demonstrate the power of text analysis to elucidate our texts and to provide innovative means of publishing the findings through interactive data visualizations.

An interesting project by Ben Schmidt, assistant professor of history at Northeastern University and core faculty at the NuLab for Texts, Maps, and Networks, uses D3 data visualization and BookWorm to allow users to self-examine the use of language as it relates to gender at Rate My Professor. (It’s worth checking out his many other projects as well for interesting examples data analysis and data visualization.)


Another example looks at vocabulary size in hip-hop based on the first 35,000 words in a rapper’s corpus (bounded to account for differences in the size of the artist’s output, an important consideration when drawing conclusions).


I choose to highlight those two projects as they both garnered attention on social media at their time of publication and to show how digital research can have a popular appeal.

Deeper into the mine . . .

With this post, I hope I piqued your interest with some of the ways we can text mine the State Medical Society Journals. For those interested in text mining using a scripting language and building custom tools, a Jupyter notebook page with the raw code for using Python with explanation of the processes can be found here. In coming blog posts I would like to explore in more depth how to work across full corpora, as well as to introduce more complicated concepts such as sentiment analysis, topic modeling, document similarity using vector space modeling, and data visualization.

What are some ideas you might have for way that the State Society Medical Journal collection could benefit from these digital tools? What projects would you like to see? Leave your suggestions in the comments at the bottom. Please feel free to leave comments below with any questions, or to e-mail me directly at tdahn@collegeofphysicians.org.