Data
The training data used are some of the books by Sanderson and Jordan. They form three categories; Robert Jordan Wheel of Time, Robert Jordan other and Brandon Sanderson various.
- the Eye of the World (Wheel of Time) by Robert Jordan
- the Fires of Heaven (Wheel of Time) by Robert Jordan
- Elantris by Brandon Sanderson
- Warbreaker by Brandon Sanderson
- Prince of the Blood (other) by Robert Jordan
- Conan the Defender (other) by Robert Jordan
The test set is three books;
- Knife of Dreams (Wheel of Time) by Robert Jordan
- Mistborn by Brandon Sanderson
- the Gathering Storm (Wheel of Time) by Brandon Sanderson and Robert Jordan
All books were acquired via darknet and read into R as a vector with one element per chapter. Prologue and epilogue count for separate chapters. The relative amount of common words is counted in each chapter. In this case, common words are defined as stopwords from the tm package. For example;
tm::stopwords("English")[1:5]
[1] "a" "about" "above" "across" "after"
Two functions were devised to count the relative occurrence of these words per chapter:
numwords <- function(what,where) {
g1 <- gregexpr(paste('[[:blank:]]+[[:punct:]]*',what,'[[:punct:]]*[[:blank:]]+',sep=''),where,perl=TRUE,ignore.case=TRUE)
if (g1[[1]][1]==-1) 0L
else length(g1[[1]])
}
countwords <- function(book) {
sw <- tm::stopwords("English")
la <- lapply(book,function(where) {
sa <- sapply(sw,function(what) numwords(what,where))
ntot <- length(gregexpr('[[:blank:]]+',
where,perl=TRUE,ignore.case=TRUE)[[1]])
sa/ntot
} )
mla <- t(do.call(cbind,la))
}
# words are counted
wtEotW <- countwords(tEotW)
wElantris <- countwords(Elantris)
wtFoH <- countwords(tFoH)
wWarbreaker <- countwords(Warbreaker)
wPotB <- countwords(PotB)
wConan <- countwords(Conan)
wtGS <- countwords(tGS)
wMistborn <- countwords(Mistborn)
wKoD <- countwords(KoD)
Model
Random forest is used as the number of variables is much bigger than the number of objects.
#combine the counts and make predictions
all <- rbind(wElantris,wWarbreaker,wtEotW,wtFoH,wPotB,wConan)
cats <- factor(c(
rep('BS',nrow(wElantris)),
rep('BS',nrow(wWarbreaker)),
rep('WoT',nrow(wtEotW)),
rep('WoT',nrow(wtFoH)),
rep('RJ',nrow(wPotB)),
rep('RJ',nrow(wConan))
),levels=c('BS','WoT','RJ'))
rf1 <- randomForest(y=cats,x=all,importance=TRUE)
rf1
Call:
randomForest(x = all, y = cats, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 22
OOB estimate of error rate: 3.93%
Confusion matrix:
BS WoT RJ class.error
BS 124 0 1 0.008000000
WoT 0 110 1 0.009009009
RJ 5 4 35 0.204545455
varImpPlot(rf1)
Words which discriminate between the three categories are such as 'and', 'did't' and 'not'. The next figure shows the typical usage of nine words. Note that the data has been scaled at this point in order to make the display more easy to read.
im <- importance(rf1)
toshow <- rownames(im)[order(-im[,'MeanDecreaseGini'])][1:9]
tall <- as.data.frame(scale(all[,toshow]))
tall$chapters <- rownames(tall)
tall$cats <- cats
rownames(tall) <- 1:nrow(tall)
propshow <- reshape(tall,direction='long',
timevar='Word',
v.names='ScaledScore',
times=toshow,
varying=list(toshow))
bwplot( cats ~ScaledScore | Word,data=propshow)
Based on this it seems Sanderson would use contractions such as 'didn't', which Jordan did not. Jordan used 'not', 'and' and 'or' more often. 'However is very much Sanderson.
Predictions
For predictions I took the predicted proportion trees for each category, as this shows a bit of the uncertainty in the categorization, which I find of interest. To display the predictions density plots are used. Each pane in the plot shows the strength of the associations between books and categories. The higher the values, the stronger association. Each row represents a book, each column a category.
ptGS <- predict(rf1,wtGS,type='prob')
pMistborn <- predict(rf1,wMistborn,type='prob')
pKoD <- predict(rf1,wKoD,type='prob')
preds <- as.data.frame(rbind(ptGS,pMistborn,pKoD))
preds$Book <- c(rep('the Gathering Storm',nrow(ptGS)),
rep('Mistborn',nrow(pMistborn)),rep('Knife of Dreams',nrow(pKoD)))
predshow <- reshape(preds,direction='long',
timevar='Prediction',v.names='Score',times=c('BS','WoT','RJ'),
varying=list(w=c('BS','WoT','RJ')))
densityplot(~Score | Prediction + Book,data=predshow)
Interpretation
Knife of dreams is correctly categorized as Wheel of Time, Mistborn is correctly categorized as Sanderson. This shows the predictions are indeed performing well and the item of interest can be examined; the Gathering Storm. It sits solidly in the Sanderson category. Interestingly, it sits a little bit less in Sanderson than Mistborn and sits a bit more in Wheel of Time than Mistborn.
This was a super cool analysis! Can you post a link to the data you used to do it?
ReplyDeleteThanks!
Hi Inkhorn,
DeleteThe data (books) are obviously copyrighted. That means I cannot. Sorry about that.