Sunday, February 24, 2013

Earthquakes in Netherlands

In the Netherlands we have Natural Gas. Unfortunately winning this gas seems to cause some quakes. As quakes go, they are not strong. However, our buildings are not made to resist quakes, before 1986 they were unheard of, so there is some damage. It is now predicted they could get stronger and more frequent. This caused a bit of a stir in the Netherlands and I wondered if I could display the data so I can understand the quakes a bit better.

Reading the data

The source of the data is knmi.nl. I found a spreadsheet with data upto 28 January 2013 called geinduceerde-bevingen-nl.pdf. The data would not convert very well in calibre, but after zooming out, I could ctrl-A, ctrl-C it from Adobe into notepad. As so often, a bit of post processing is needed. These days I tend to do that in R.
library(ggplot2)
# read as lines of text for processing
Datain <- readLines('aardbeving.txt')
head(Datain)
[1] "Ge├»nduceerde aardbevingen in Nederland"                             
[2] "YYMMDD TIME LOCATION LAT LON X_RD Y_RD INT MAG.DEPTH"               
[3] "19861226 074751.00 Assen 52.992 6.548 232,924 556,587 4.5 2.8 1.0"  
[4] "19871214 204951.05 Hooghalen 52.928 6.552 233,266 549,537 4 2.5 1.5"
[5] "19891201 200914.35 Purmerend 52.529 4.971 126,697 504,593 5 2.7 1.2"
[6] "19910215 021116.54 Emmen 52.771 6.914 257,992 532,491 3.5 2.2 3.0"  
Although not clear from this head, it seems there are a number of problems running this through read.table(). The worst are; each page of the .pdf contains a header, LOCATION can contain a space, INT may be missing, MAG.DEPTH is actually two variables.
#print notes, then remove them
grep('^YYMMDD|Ge|[[:digit:]]{8}',Datain,invert=TRUE,value=TRUE)
[1] "www.knmi.nl"  
[2] "Any use of this material [text, illustrations, graphics] is only permitted when the source www.knmi.nl is acknowledged."
Datain <- grep('^(YYMMDD|[[:digit:]]{8})',Datain,value=TRUE)
#column header is repeated. remove all but first
header <- which(Datain ==Datain[1])
Datain <- Datain[-header[-1]]
# Quote around location names
Datain <- gsub('([[:digit:]]) ([[:alpha:]\'])','\\1 "\\2',Datain,fixed=FALSE)
Datain <- gsub('([[:alpha:]() ]) ([[:digit:]])','\\1" \\2',Datain,fixed=FALSE)
#quote around INT, so it can be read even if it is absent
Datain <- gsub('([[:digit:]]{2,3},[[:digit:]]{3} [[:digit:]]{3},[[:digit:]]{3} )','\\1" ',Datain,fixed=FALSE)
Datain <- gsub('( -?[[:digit:]]*\\.[[:digit:]]* [[:digit:]]*\\.[[:digit:]]*$)',' "\\1',Datain,fixed=FALSE)
# remove one occurence 'not yet known'
Datain <- grep('Nog niet bekend',Datain,value=TRUE,invert=TRUE)
# and split the last two words
Datain <- gsub('MAG.DEPTH','Mag Depth',Datain,fixed=TRUE )
# now read into a dataframe
r1 <- read.table(textConnection(Datain),header=TRUE,
colClasses=c(c('character','character'),rep(NA,8)))
To use the data, I need a proper data-time variable. My interest is limited to Groningen and Drente, so some data is removed. The projection into a map I stole from another blog (rpsychologist.com) and updated for a newer version of ggplot2.
r1$DTC <- as.POSIXct(paste(r1$YYMMDD,r1$TIME),format='%Y%m%d %H%M%S')
r1 <- r1[order(r1$DTC),]
# calc time between Quakes and limit region
r1$diftime <- c(NA,diff(r1$DTC))/3600/24
ylim <- c(52.825,53.5)
xlim <- c(6.3,7.2)
# prepare for plot in map
# source: http://rpsychologist.com/parsing-data-from-a-text-file-and-plotting-where-people-live-using-ggplot2-and-openstreetmaps/
Sys.setenv(NOAWT=1) #call this before loading OpenStreetMap
library(OpenStreetMap)
r1 <- cbind(r1,projectMercator(r1$LAT,r1$LON))
r2 <- r1[r1$LAT>min(ylim) & r1$LON>min(xlim),]

How often?

The first question is: how often are these quakes? I have been using time between quakes as a measure. Since there are loads of quakes, a loess smoothed line is added. I did not use the ggplot2 function for this as I wanted to restrict my axes and that affects the smoother. As a bonus the colour represents magnitude. Nothing much happens in magnitude. KNMI writes (in Dutch) There is an increase in quakes since 2002. I have the feeling it is increasing more and more, but they are restricting in strength categories, so that may explain the difference. I do feel sorry for people having these quakes ever so often and apparently this will continue.
# how often?
lo1 <- loess(diftime ~ as.numeric(DTC),data=r2[-1,] )
lo2 <- loess(Mag ~ as.numeric(DTC),data=r2[-1,] )

topred <- data.frame(DTC=seq(r2$DTC[2],max(r1$DTC),length.out=40))
topred$diftime <- predict(lo1,topred)
topred$Mag <- predict(lo2,topred)

png('earthquake time.png')
ggplot(r2, aes(DTC,diftime,col=Mag)) +
geom_point() +
scale_x_datetime('Date',limits=as.POSIXct(c('1995-01-01','2013-02-01'))) +
scale_colour_gradient(low="blue",high="red") +
scale_y_log10('Time between quakes (days)',limits=c(1e-1,50)) +
geom_line(size=2,data=topred)
dev.off()

Where?

I wanted a good reason to plot some data in a map for ages. I am very happy this is the good occasion. And it is actually very simple. The only difficulty was using colour for progression. ggplot2 didn't like the POSIXct variable DTC, so I had to make a second variable, mydate, which was numeric and mimics years. The red is more recent. Most quakes are north-east of Groningen city, which is the region I associate them with. There are definitely 'hot spots' and 'hot regions'. I was surprised to learn of quakes in Ter Apel (bottom right), and just south of Assen, that is (relatively) far away.
#where is the quake?
mp <- openmap(upperLeft=c(max(ylim),min(xlim)),lowerRight=c(min(ylim),max(xlim)))
r2$mydate <- 1995+(2013-1995)*(as.numeric(r2$DTC-as.POSIXct('1995-01-01')))/
as.numeric(as.POSIXct('2013-01-01')-as.POSIXct('1995-01-01'))
png('earthquake map.png')
autoplot(mp) +
geom_point(aes(x,y, color=mydate,guide='legend'),
alpha = I(7/10), data=r2) +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank()) +
scale_colour_gradient(low="blue",high="red",guide=guide_legend('Date')) +
theme(legend.position = "bottom")
dev.off()

Sunday, February 17, 2013

A look at strucchange and segmented

After last week's post it was commented that strucchange and segmented would be more suitable for my purpose. I had a look at both. Strucchange can find a jump in a time series, which was what I was looking for. In contrast segmented is more suitable for occasions where rates of effect. This made me think of a post on climate change I made earlier, so I applied segmented to those data

Strucchange and Nile data

Strucchange is a package to detect jumps in data. They have an example of the effect of the Aswan dam on Nile water. Those data are suitable to compare the methods available. For strucchange this is just following the example. The result is an effect on water flow around 1898. There is just one catch. It needs a time series, so regular data over time.
library(strucchange)
library(segmented)
library(R2jags)
library(tree)

data("Nile")
plot(Nile)
bp.nile <- breakpoints(Nile ~ 1)
ci.nile <- confint(bp.nile, breaks = 1)
lines(ci.nile)
Interestingly tree gives 1898.5, about the same answer. Segmented does not work. It is not devised for this kind of step changes.
NileDF <- data.frame(Nile=as.numeric(Nile),year=1871:1970)
tree(Nile ~ year,data=NileDF)

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 2835000  919.4  

   2) year < 1898.5 28  492000 1098.0  
     4) year < 1889.5 19  388200 1067.0  
       8) year < 1880.5 10  205200 1133.0 *
       9) year > 1880.5 9   92680  994.6 *
     5) year > 1889.5 9   48760 1162.0 *
   3) year > 1898.5 72 1105000  850.0  
     6) year < 1953.5 55  802300  836.1 *
     7) year > 1953.5 17  258600  894.7  
      14) year < 1965.5 12  114300  947.8 *
      15) year > 1965.5 5   29480  767.4 *


segmented(lm(Nile ~1,data=NileDF),~ year,psi=list(year=c(1890)))

Error in segmented.lm(lm(Nile ~ 1, data = NileDF), ~year, 
    psi = list(year = c(1900))) :
only 1 datum in an interval: breakpoint(s) at the boundary or too close each other


The Bayes model is working the Nile data, but the result is not quite the same as strucchange. The interval for the dates is a bit smaller.
model1 <- function() {
  for ( i in 1:N ) {
    category[i] <- (xx[i]>limit) + 1
    yy[i] ~ dnorm( mu[category[i]] , tau ) 
  }
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  mu[1] ~ dnorm(0,1E-6)
  mu[2] ~ dnorm(0,1E-6)
  limit ~ dbeta(1,1)
}
datain <- list(yy=c(NileDF$Nile),xx=seq(0,1,length.out=length(Nile)),N=length(Nile))
params <- c('mu','sigma','limit')
inits <- function() {
  list(mu = rnorm(2,0,1),
      sigma = runif(1,0,1),
      limit=runif(1,0,1))
}
jagsfit <- jags(datain, model=model1, inits=inits, parameters=params,
    progress.bar="gui",n.iter=10000)
limit = as.numeric(jagsfit$BUGSoutput$sims.array[,,'limit'])*100+1871
png('16feb13.2.png')
plot(density(limit),main='Nile flooding',xlab='Year')

Segmented and the climate data

As written above, I used these climate data before. I now realize they are updated every month, 2012 is now complete. As I want to use these data again, a few lines to process the data prior to read.table are added. My choice was to use three breaks, approximately near 1900, 1940 and 1970, to initiate the algorithm. The data shows four phases. A decrease till 1910, an increase till 1940, a flat till 1970 then an increase.
# data from http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt
Datain <- readLines('GLB.Ts+dSST.txt')
#remove empty lines, print notes, then remove them
Datain <- Datain[grep('^ *$',Datain,invert=TRUE)]
grep('^(Year|[[:digit:]]{4})',Datain,value=TRUE,invert=TRUE)
[1] " GLOBAL Land-Ocean Temperature Index in 0.01 degrees Celsius base period: 1951-1980"
[2] " sources: GHCN-v3 1880-12/2012 + SST: ERSST 1880-12/2012" 
[3] " using elimination of outliers and homogeneity adjustment" 
[4] " Notes: 1950 DJF = Dec 1949 - Feb 1950 ; ***** = missing" 
[5] " AnnMean" 
[6] "Divide by 100 to get changes in degrees Celsius (deg-C)." 
[7] "Multiply that result by 1.8(=9/5) to get changes in degrees Fahrenheit (deg-F)." 
[8] "Best estimate for absolute global mean for 1951-1980 is 14.0 deg-C or 57.2 deg-F," 
[9] "so add that to the temperature change if you want to use an absolute scale" 
[10] "(this note applies to global annual means only, J-D and D-N !)" 
[11] "Example -- Table Value : 40" 
[12] " change : 0.40 deg-C or 0.72 deg-F" 
[13] "abs. scale if global annual mean : 14.40 deg-C or 57.92 deg-F" 

Datain <- Datain[grep('^(Year|[[:digit:]]{4})',Datain)]
#column header is repeated. remove all but first
header <- which(Datain ==Datain[1])
Datain <- Datain[-header[-1]]
# and fix *** for missing - the ' ' separator is missing D-N 1880
Datain <- gsub('\\*+',' NA',Datain)
r1 <- read.table(textConnection(Datain),header=TRUE)
plot(J.D ~ Year,data=r1,type='l',,main='GLOBAL Land-Ocean Temperature Index',ylab='0.01 degrees C base period: 1951-1980')
seg <- segmented(lm(J.D~Year,data=r1),~ Year, list(Year=c(1900,1940,1970)))
lines(seg,col='red')

slope(seg)
$Year
           Est. St.Err. t value CI(95%).l CI(95%).u
slope1 -0.68530  0.1647 -4.1600   -1.0110   -0.3593
slope2  1.29300  0.2013  6.4220    0.8944    1.6910
slope3 -0.04426  0.1728 -0.2562   -0.3862    0.2977
slope4  1.67400  0.1095 15.2900    1.4580    1.8910

Note

Given the sensitivity of climate I feel obliged to mention I am not a climate specialist and hence cannot vouch for the appropriateness of the segmented model for these climate data.

Saturday, February 9, 2013

Finding a jump in data

The other day I saw a plot with some data with a clear jump in it. So I wondered if it were possible to determine the position of that jump. Data with a jump is easy enough made:
library(ggplot2)
n <- 100
mydata <- data.frame(x=runif(n))
mydata$y <- rnorm(n,0,1) +ifelse(mydata$x>.5,0,1)
qplot(x=mydata$x,y=mydata$y)

Now to find where that jump was. After some consideration, the most simple way seemed appropriate. Just take any possible point and examine the corrected sums of squares of the parts left and right of the break.
mydata <- mydata[order(mydata$x),]
ss <- function(x) sum((x-mean(x))^2)
mids <- (mydata$x[-1]+mydata$x[-nrow(mydata)])/2
sa <- sapply(mids,function(x) 
      ss(mydata$y[mydata$x<x])+ss(mydata$y[mydata$x>x]))
qplot(x=mids,y=sa, ylab='Sum of squares',xlab='Position of break')
That was very simple. But still too complex. I realized I was recreating a tree with only one node. There is a way to get that first node more quickly:
library(tree)
tree(y~x,data=mydata,)

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 109.700  0.76210  
   2) x < 0.515681 58  50.430  1.23700  
     4) x < 0.0865511 7   8.534  0.56330 *
     5) x > 0.0865511 51  38.280  1.33000  
      10) x < 0.144596 7   8.355  1.72600 *
      11) x > 0.144596 44  28.650  1.26700  
        22) x < 0.276795 10   3.767  0.71330 *
        23) x > 0.276795 34  20.920  1.43000  
          46) x < 0.380858 16  13.150  1.68100 *
          47) x > 0.380858 18   5.856  1.20600 *
   3) x > 0.515681 42  28.100  0.10580  
     6) x < 0.853806 29  13.950  0.28370 *
     7) x > 0.853806 13  11.190 -0.29090  
      14) x < 0.938267 6   5.031 -0.70380 *
      15) x > 0.938267 7   4.255  0.06312 *
So, the first break is at 0.51. That was easy.

In JAGS

I like JAGS far too much. So, I wanted to do that in JAGS. And it was simple too. Just make two means and (assume) a common standard deviation. The break (called limit when I programmed this, but break is an inconvenient name for a variable in R) has a beta prior.
library(R2jags)
model1 <- function() {
  for ( i in 1:N ) {
    category[i] <- (xx[i]>limit) + 1
    yy[i] ~ dnorm( mu[category[i]] , tau ) 
  }
  tau <- 1/pow(sigma,2)
  sigma ~ dunif(0,100)
  mu[1] ~ dnorm(0,1E-6)
  mu[2] ~ dnorm(0,1E-6)
  limit ~ dbeta(1,1)
}
datain <- list(yy=mydata$y,xx=mydata$x,N=nrow(mydata))
params <- c('mu','sigma','limit')
inits <- function() {
  list(mu = rnorm(2,0,1),
      sigma = runif(1,0,1),
      limit=runif(1,0,1))
}
jagsfit <- jags(datain, model=model1, inits=inits, parameters=params,
    progress.bar="gui",n.iter=10000)
jagsfit

Inference for Bugs model at "C:/Users/.../Temp/RtmpuosigK/model176025bc6712.txt", fit using jags,
 3 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 3000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
limit      0.528   0.042   0.474   0.504   0.516   0.541   0.629 1.001  3000
mu[1]      1.212   0.128   0.959   1.128   1.213   1.296   1.465 1.002  1400
mu[2]      0.122   0.152  -0.183   0.024   0.121   0.226   0.417 1.001  3000
sigma      0.922   0.069   0.795   0.872   0.917   0.967   1.070 1.002  1100
deviance 265.762   3.751 260.131 262.937 265.553 267.918 274.100 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 7.0 and DIC = 272.8
DIC is an estimate of expected predictive error (lower deviance is better).
So, the posterior median for the break is 0.516 with a slightly higher mean of 0.52. To make things more sweet, how about a nice combined plot? Extract the samples, take a hint from stackoverflow how to arrange the parts and a hint from google groups ggplot2 on the removal of the x=axis label. Mission accomplished.

limit = as.numeric(jagsfit$BUGSoutput$sims.array[,,'limit'])

q1 <- qplot(x=mydata$x,y=mydata$y,xlim=c(0,1),ylab='y') + 
    theme(axis.text.x = element_blank(),axis.title.x=element_blank(),
        axis.ticks.x=element_blank())
d1 <- ggplot(,aes(x=limit)) + geom_density() +
    scale_x_continuous(limits=c(0,1),'x') +
    scale_y_continuous('Density of break position' )

library(gridExtra)
grid.arrange(q1,d1, nrow=2)


Sunday, February 3, 2013

For descriptive statistics, values below LLOQ set to ...

That is what I read the other day. For calculation of descriptive statistics, values below the LLOQ (lower limit of quantification)  were set to.... Then I wondered, wasn't there a trick in JAGS to incorporate the presence of missing data while estimating parameters of a distribution. How would that compare with standard methods such as imputing LLOQ at 0, half LLOQ, LLOQ or just ignoring the missing data? A simulation seems to be called for.

Bayes estimation

A bit of searching gave me a way function to estimate the mean and standard deviation, here wrapped in a function so resulting data has a nice shape.
library(R2jags)
library(plyr)
library(ggplot2)

Bsensnorm <- function(yy,limit) {
  isCensored <- is.na(yy)
  model1 <- function() {
    for ( i in 1:N ) {
      isCensored[i] ~ dinterval( yy[i] , limit )
      yy[i] ~ dnorm( mu , tau ) 
    }
    tau <- 1/pow(sigma,2)
    sigma ~ dunif(0,100)
    mu ~ dnorm(0,1E-6)
  }
    datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
  params <- c('mu','sigma')
  inits <- function() {
    list(mu = rnorm(0,1),
        sigma = runif(0,1))
  }
  
  jagsfit <- jags(datain, model=model1, inits=inits, 
      parameters=params,progress.bar="gui",
      n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
  data.frame( system='Bayes Uninf',
      mu_out=jagsfit$BUGSoutput$mean$mu,
      sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Somewhat informative prior

As the original model did not perform very well at some points, I created an informative prior on the standard deviation. It is not the best prior, but just to give me an idea. To defend having a prior, once a method has a LLOQ, there should be some idea about precision. Otherwise, how is LLOQ determined? Also, in a real world there may be data points with complete data, which allows estimation of the standard deviation in a more hierarchical model.
BsensInf <- function(yy,limit) {
  isCensored <- is.na(yy)
  model2 <- function() {
    for ( i in 1:N ) {
      isCensored[i] ~ dinterval( yy[i] , limit )
      yy[i] ~ dnorm( mu , tau ) 
    }
    tau <- 1/pow(sigma,2)
    sigma ~ dnorm(1,1) %_% T(0.001,)
    mu ~ dnorm(0,1E-6) 
  }
  datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
  params <- c('mu','sigma')
  inits <- function() {
    list(mu = abs(rnorm(0,1))+.01,
        sigma = runif(.1,1))
  }
  
  jagsfit <- jags(datain, model=model2, inits=inits, parameters=params,progress.bar="gui",
      n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
  data.frame( system='Bayes Inf',
      mu_out=jagsfit$BUGSoutput$mean$mu,
      sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Simple imputation

These are the simple imputations each in a separate function.
L0sensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- 0
  data.frame(system='as 0',mu_out=mean(yy),sigma_out=sd(yy))
}
Lhsensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- limit/2
data.frame(system='as half LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
Llsensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- limit
  data.frame(system='as LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
LNAsensnorm <- function(yy,limit) {
  yy <- yy[!is.na(yy)]
  data.frame(system='as missing',mu_out=mean(yy),sigma_out=sd(yy))
}

Simulations

The script below is used to the methods with some settings I chose. Basically, the LLOQ is set at 1. The SD is 1. The mean ranges fro 1 to 3.5. The data is cut a a level of 1. The data is simulated from a normal distribution. If more than half of the data is below LLOQ then new samples are drawn until more than half of the data is above LLOQ. 
limitest <- function(yy,limit) {
  yy[yy<limit] <- NA
  do.call(rbind,list(
    Bsensnorm(yy,limit),
    BsensInf(yy,limit),
    L0sensnorm(yy,limit),
    Lhsensnorm(yy,limit),
    Llsensnorm(yy,limit), 
    LNAsensnorm(yy,limit)))
}
simtest <- function(n,muin,sdin,limit) {
  numnotna<- 0
  while(numnotna<n/2) {
    yy <- rnorm(n,muin,sdin)
    numnotna <- sum(yy>limit)
  }
  ll <- limitest(yy,limit)
  ll$n_na <- sum(yy<limit)
  ll
}
datain <- expand.grid(n=c(5,6,8,10,12,15,20,50,80),muin=seq(1,3.5,.05),sdin=1,limit=1)
dataout <- mdply(datain,simtest)

Results for means

After all simulations, a difference is calculated between the simulation mean and the estimated means. This made interpreting the plots much more easy.
dataout$dif <- dataout$mu_out-dataout$muin

Uninformed prior did not perform with very little data

With very little data, the standard deviation gets large and the mean is low. This may just be because there is no information stating they are not large and small respectively, and the chain just wanders of. Anyway, reason enough to find that with only few samples it is not a feasible method. Adding information about the standard deviation much improves this.  

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system %in% c('Bayes Uninf','Bayes Inf'),],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At low number of observations variation of the mean swamps bias.  

The variation in estimates is rather large. Even so, there is a trend visible that removing missing data and imputing LLOQ give too high estimates when mu is low and hence the number of missings is high. When mu is high then all methods obtain the same result.

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At intermediate number of observations method choice gets more important 

Missing and imputation of LLOQ are clearly biased. Half LLOQ and 0 do much better. At 15 observations the informed prior is fairly similar to uninformed prior. 
ggplot(dataout[dataout$n %in% c(10,12,15) & dataout$system!='Bayes xUninf',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At large number of observations bias gets far too large

At this point the missing option is not plotted any more, it is just not competitive. Setting at LLOQ gets a big positive bias, while 0 gets a negative bias, especially for 80 observations. There is no point in setting up an informed prior. Half LLOQ seems least biased out of the simple methods.

ggplot(dataout[dataout$n  %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

Standard deviation is negative biased for simple imputation

The uninformed prior estimation is too bad to plot and retain reasonable axes (and setting limits on y means the data outside those limits get eliminated from the smoother too, so that is not an option). Especially setting missing as LLOQ or ignoring them gets too low estimates of standard deviation. The informed prior gets a bit too large standard deviation.
ggplot(dataout[dataout$n%in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 

ggplot(dataout[dataout$n%in% c(10,12,15) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 
With a large number of observations half LLOQ results in a negative bias on the standard deviation.

ggplot(dataout[dataout$n %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 

Conclusion

At low number of observations using a Bayesian model and a good prior on the standard deviation is the best way to obtain descriptive statistics. Lacking those setting <LLOQ values to 1/2 LLOQ for estimation of means and to 0 for estimation of standard deviations seems best. But some care and more extensive simulations for the problem a hand are needed.

Finally, setting data to <LLOQ is something that has implications. As a statistician, I think using <LLOQ to display data in listings and tables is a good thing. However, when subsequent calculations by statisticians are needed, they may cause more trouble than they resolve. I have seen simple PCA go down the drain by <LLOQ, which is, truth to say, a shame given the effort in getting data.