Monday, August 21, 2006

Easy Pieces in Python: Keyword in Context

Yesterday, I showed that it is possible to extract useful information from a historical source (word frequencies) with a few lines of a high-level programming language like Python. Today we continue with another simple demo, keyword in context (KWIC). The basic problem is to split a text into a long list of words, slide a fixed window over the list to find n-grams and then put each n-gram into a dictionary so we can find the contexts for any given word in the text.

As before, we are going to be working with Charles William Colby's The Fighting Governor: A Chronicle of Frontenac (1915) from Project Gutenberg. We start by reading the text file into a long string and then splitting it into a list of words:

wordlist = open('cca0710-trimmed.txt', 'r').read().split()

Next we run a sliding window over the word list to create a list of n-grams. In this case we are going to be using a window of five words, which will give us two words of context on either side of our keyword.

ngrams = [wordlist[i:i+5] for i in range(len(wordlist)-4)]

We then need to put each n-gram into a dictionary, indexed by the middle word. Since we are using 5-grams, and since Python sequences are numbered starting from zero, we want to use 2 for the index.

kwicdict = {}
for n in ngrams:
    if n[2] not in kwicdict:
        kwicdict[n[2]] = [n]
    else:
        kwicdict[n[2]].append(n)

Finally, we will want to do a bit of formatting so that our results are printed in a way that is easy to read. The code below gets all of the contexts for the keyword 'Iroquois'.

for n in kwicdict['Iroquois']:
    outstring = ' '.join(n[:2]).rjust(20)
    outstring += str(n[2]).center(len(n[2])+6)
    outstring += ' '.join(n[3:])
    print outstring

This gives us the following results.

bears, and
 Iroquois knew that
of the
 Iroquois villages. At
with the
 Iroquois at Cataraqui
to the
 Iroquois early in
to the
 Iroquois chiefs, Frontenac
shelter the
 Iroquois from the
wished the
 Iroquois to see
of the
 Iroquois a fort
 ...  
that captured
 Iroquois were burned

This kind of analysis can be useful for historiographical argumentation. If we look at the contexts in which the Iroquois appear in Colby's text, we find that they are usually the objects of verbs rather than the subjects. That is to say that we find a lot of phrases like "to the Iroquois," "make the Iroquois," "overawe the Iroquois," "invite the Iroquois," "with the Iroquois," "smiting the Iroquois," and so on. We find far fewer phrases of the form "[the] Iroquois knew," "the Iroquois rejoiced," or "six hundred Iroquois invaded." This could be taken to suggest that Colby wasn't thinking of the Iroquois as historical agents (which is how most historians see them now) but rather as background characters, as foils for the settlers of New France.

Tags: | | | | |