Sunday, December 16, 2012

The Eye of the World as word cloud

The Eye of the World is the first book of Robert Jordan's Wheel of Time books. As the last of these books will be published soon, I was wondering if natural language processing can be used to examine books like these. For this purpose I downloaded a copy from somewhere undisclosed and analyzed it.

During my experiments with this file I found wordcloud was actually a good way to look at this. My first attempts, using correspondence analysis did not give anything useful. Everything on top of each other does not yield an interesting plot. Clustering of chapters did not reveal anything nice. Wordcloud has comparison clouds, which can be used to differentiate between chapters.
I am sure readers can do their own interpretation of this. Myself, I am surprised by the massive amount of names of places and persons in this first book, even though I know the number of persons in the series is large.

R code
r1 <- readLines("Robert Jordan - Wheel Of Time 01 - The Eye Of The World.txt")
#remove text page xxx
pagina <- grep('^Page [[:digit:]]+$',r1)
r1 <- r1[-pagina]
r1 <- sub('Page [[:digit:]]+$','',r1)
# remove empty lines
r1 <- r1[r1!='']
#extract chapter headers
chapterrow <- grep('^(CHAPTER [[:digit:]]+)|(PROLOGUE)$',r1)
chapterrow <- c(chapterrow,length(r1)+1)
#extract chapters
chapters <- sapply(1:(length(chapterrow)-1),function(i) 
      paste(r1[(chapterrow[i]+2):(chapterrow[i+1]-1)],sep=' '))
chapterrow <- chapterrow[-length(chapterrow)]
#name the chapters
chapternames <- paste(sub('CHAPTER ','',r1[chapterrow]),r1[chapterrow+1])
names(chapters) <- chapternames

# use example processing from tm
EotW <- Corpus(VectorSource(chapters))
EotW <- tm_map(EotW,stripWhitespace)
EotW <- tm_map(EotW,tolower)
EotW <- tm_map(EotW,removeWords,stopwords("English"))
EotW <- tm_map(EotW,stemDocument)
EotW <- tm_map(EotW,removePunctuation)

tdmEotW <- TermDocumentMatrix(EotW)

h1 <- hclust(dist(t(sqrt(as.matrix(tdmEotW )))),method='ward')
# hclust to put related chapters together

# and make a cloud
tdmEotW2 <- as.matrix(tdmEotW)[,h1$order],random.order=FALSE,scale=c(1.4,.6),title.size=.7,


No comments:

Post a Comment