Algorithm
To quote wikipedia again:- Pick an element, called a pivot, from the array.
- Reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
- Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.
Or, in roughly in R
wfqs1 <- function(x) {
if (length(x)<2) return(x)
pivot <- x[sample(length(x),1)]
c(wfqs1(x[x<pivot]),x[x==pivot],wfqs1(x[x>pivot]))
}
The implementation by the Julia team does do the in place reordering:
qsort = function(a) {
qsort_kernel = function(lo, hi) {
i = lo
j = hi
while (i < hi) {
pivot = a[floor((lo+hi)/2)]
while (i <= j) {
while (a[i] < pivot) i = i + 1
while (a[j] > pivot) j = j - 1
if (i <= j) {
t = a[i]
a[i] <<- a[j]
a[j] <<- t
i = i + 1;
j = j - 1;
}
}
if (lo < j) qsort_kernel(lo, j)
lo = i
j = hi
}
}
qsort_kernel(1, length(a))
return(a)
}
Variations in implementation
After coding the first version, it was tried to make faster variations. The first three version are intended to have less comparisons than wfqs1(). The last one is intended to build a bit more vectorizing in qsort().
wfqs2 <- function(x) {
if (length(x)<2) return(x)
ipivot <- sample(length(x),1)
pivot <- x[ipivot]
c(wfqs2(x[x<pivot]),pivot,wfqs2(x[-ipivot][x[-ipivot]>=pivot]))
}
wfqs3 <- function(x) {
if (length(x)<2) return(x)
ipivot <- sample(length(x),1)
pivot <- x[ipivot]
split <- x<pivot
split2 <- !split
split2[ipivot] <- FALSE
c(wfqs3(x[split]),pivot,wfqs3(x[split2]))
}
wfqs4 <- function(x) {
if (length(x)<2) return(x)
split <- x<x[1]
c(wfqs4(x[split]),x[1],wfqs4(x[!split][-1]))
}
wfqsx = function(a) {
qsort_kernel = function(lo, hi) {
if(lo>=hi) return()
if(hi-lo==1) {
if(a[lo]>a[hi]) {
t <- a[lo]
a[lo] <<- a[hi]
a[hi] <<- t
}
return()
}
goUp <- a[(lo+1):hi]>a[lo]
up <- which(goUp)
do <- which(!goUp)
pivottarget <- lo+length(do)
up <- up[up<=length(do)]+lo
do <- do[do>length(do)]+lo
t <- a[do]
a[do] <<- a[up]
a[up] <<- t
t <- a[pivottarget]
a[pivottarget] <<- a[lo]
a[lo] <<- t
qsort_kernel(lo,pivottarget-1)
qsort_kernel(pivottarget+1, hi)
}
qsort_kernel(1, length(a))
return(a)
}
Analysis
Speed without JIT
The first figure shows that sorting time is roughly equivalent to the number of items to be sorted. The exception there is base:sort() where around 100 items there is no size effect. Even though not very pronounced on this scale, qsort() is slowest and wfqs4() is fastest after base:sort().
Profiling
Profiling seems to show that time is spend because it is spend. Most of it is not spend on any function which Rprof() registers. Regarding memory usage, if mem.total is anything to go by, qsort() uses quite more than wfqs4().
##### qsort #############################################
$by.self
self.time self.pct total.time total.pct mem.total
"qsort_kernel" 6.72 86.15 7.80 100.00 1671.4
"<GC>" 0.68 8.72 0.68 8.72 112.2
"+" 0.16 2.05 0.16 2.05 31.5
"<" 0.12 1.54 0.12 1.54 21.5
"-" 0.06 0.77 0.06 0.77 5.4
"floor" 0.04 0.51 0.04 0.51 10.2
"<=" 0.02 0.26 0.02 0.26 5.3
$by.total
total.time total.pct mem.total self.time self.pct
"qsort_kernel" 7.80 100.00 1671.4 6.72 86.15
"<Anonymous>" 7.80 100.00 1671.4 0.00 0.00
"eval" 7.80 100.00 1671.4 0.00 0.00
"qsort" 7.80 100.00 1671.4 0.00 0.00
"<GC>" 0.68 8.72 112.2 0.68 8.72
"+" 0.16 2.05 31.5 0.16 2.05
"<" 0.12 1.54 21.5 0.12 1.54
"-" 0.06 0.77 5.4 0.06 0.77
"floor" 0.04 0.51 10.2 0.04 0.51
"<=" 0.02 0.26 5.3 0.02 0.26
$sample.interval
[1] 0.02
$sampling.time
[1] 7.8
##### wfqs4 ###########################################
$by.self
self.time self.pct total.time total.pct mem.total
"wfqs4" 0.98 75.38 1.30 100.00 258.0
"<" 0.10 7.69 0.12 9.23 26.2
"<GC>" 0.08 6.15 0.08 6.15 11.8
"c" 0.06 4.62 0.08 6.15 12.0
"!" 0.04 3.08 0.04 3.08 9.0
"-" 0.02 1.54 0.02 1.54 4.6
".External" 0.02 1.54 0.02 1.54 0.0
$by.total
total.time total.pct mem.total self.time self.pct
"wfqs4" 1.30 100.00 258.0 0.98 75.38
"<Anonymous>" 1.30 100.00 258.0 0.00 0.00
"eval" 1.30 100.00 258.0 0.00 0.00
"<" 0.12 9.23 26.2 0.10 7.69
"<GC>" 0.08 6.15 11.8 0.08 6.15
"c" 0.08 6.15 12.0 0.06 4.62
"!" 0.04 3.08 9.0 0.04 3.08
"-" 0.02 1.54 4.6 0.02 1.54
".External" 0.02 1.54 0.0 0.02 1.54
"runif" 0.02 1.54 0.0 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 1.3
JIT compiling
The JIT is a great equalizer. Everybody profits with the exception of base:sort(). It is kind of obvious base:sort() does not profit from the JIT, the part that does the work should be in machine code. In the R code, it seems the various implementations are much closer to each other. Small refinements are apparently swamped by the JIT. Nevertheless, wfsq4() is still fastest after base:sort(). It takes half the time of qsort().
Discussion
It cannot be stated enough: When programming in R, vectorize. Vectors make compact code. Vectors are efficient; a vectorized calculation which does more actual computations can beat complex loops which avoid computations. The speed effect is reduced when the JIT is used. Starting the JIT then should be the first step when an R algorithm is too slow. The second step is starting the profiler and looking for (vectorized) optimization. When these two fail it is time to roll out the big guns; (inline) C code, faster computer or switching to a faster language altogether.
Additonal code
la <- lapply(1:15,function(i) {
n <- 10*2^i
rr <- runif(n)
mb <- microbenchmark(
wfqs1(rr),
wfqs2(rr),
wfqs3(rr),
wfqs4(rr),
wfqsx(rr),
sort(rr),
qsort(rr),
times=10)
ag <- aggregate(mb$time,list(method=mb$expr),median)
ag$x <- ag$x/1000
ag$n <- n
ag
})
res <- do.call(rbind,la)
png('qs1.nc.png')
p <- ggplot(res, aes(y=x, x=n,col=method))
p +geom_line() +
scale_y_log10('Median sorting time (ms)') +
scale_x_log10() +
labs(x='Number of items',title='No compiling' )
dev.off()
resref <- subset(res,res$method=='sort(rr)',-method)
names(resref)[names(resref)=='x'] <- 'tsort'
res2 <- merge(res,resref)
res2$reltime <- res2$x/res2$tsort
png('qs2.nc.png')
p <- ggplot(res2, aes(y=reltime, x=n,col=method))
p +geom_line() +
scale_x_log10() +
labs(x='Number of items',
y='Median relative sorting time',
title='No compiling' )
dev.off()
Rprof("Rprofqs.out",memory.profiling = TRUE, gc.profiling = TRUE)
y = qsort(runif(100000))
Rprof(NULL)
Rprof("Rprofw4.out",memory.profiling = TRUE, gc.profiling = TRUE)
y = wfqs4(runif(100000))
Rprof(NULL)
summaryRprof(filename = "Rprofqs.out",memory='both')
summaryRprof(filename = "Rprofw4.out",memory='both')
require(compiler)
enableJIT(3)
la <- lapply(1:15,function(i) {
n <- 10*2^i
rr <- runif(n)
mb <- microbenchmark(
wfqs1(rr),
wfqs2(rr),
wfqs3(rr),
wfqs4(rr),
wfqsx(rr),
sort(rr),
qsort(rr),
times=10)
ag <- aggregate(mb$time,list(method=mb$expr),median)
ag$x <- ag$x/1000
ag$n <- n
ag
})
res <- do.call(rbind,la)
enableJIT(0)
png('qs1.jit3.png')
p <- ggplot(res, aes(y=x, x=n,col=method))
p +geom_line() +
scale_y_log10('Median sorting time (ms)') +
scale_x_log10() +
labs(x='Number of items',title='Enable JIT 3' )
dev.off()
resref <- subset(res,res$method=='sort(rr)',-method)
names(resref)[names(resref)=='x'] <- 'tsort'
res2 <- merge(res,resref)
res2$reltime <- res2$x/res2$tsort
png('qs2.jit3.png')
p <- ggplot(res2, aes(y=reltime, x=n,col=method))
p +geom_line() +
scale_x_log10() +
labs(x='Number of items',
y='Median relative sorting time',
title='Enable JIT 3' )
dev.off()
This comment has been removed by the author.
ReplyDeleteAfter reader comment, two figures were replaced. Their axis had wrong units.
ReplyDeleteThere was a debate about this on Stack Overflow:
ReplyDeletehttps://stackoverflow.com/questions/9968578/speeding-up-julias-poorly-written-r-examples
The algorithms chosen by the Julia team were delliberately chosen to highlight areas where Julia performs better than R.
tl;dr: One side of the argument is that it is disingenuous to pick an algorithm that reflects badly on R, when smallish fixes can improve the performance substantially. The other side of the argument is that in Julia, you don't need to worry as much about how you contruct your algorithm, since it doesn't have the same performance blindspots as R.