Project

General

Profile

« Previous | Next » 

Revision 2523d01d

Added by Adam M. Wilson almost 11 years ago

Adding a new option 2

View differences:

climate/research/LST_evaluation.Rmd
14 14

  
15 15
```{r,echo=FALSE,results='hide',message=F}
16 16
## some setup
17
uploadimages=T  # upload all images to imgur.com for easy viewing on github
17
 uploadimages=T  # upload all images to imgur.com for easy viewing on github
18 18
opts_knit$set(progress = TRUE, verbose = TRUE,root.dir="~/Downloads",base.url = NULL)
19 19
if(uploadimages) opts_knit$set(upload.fun =imgur_upload)
20 20

  
......
25 25
library(reshape)
26 26
library(plyr)
27 27
library(latticeExtra)
28
library(knitr)
28 29

  
29 30
## Import the data
30 31
setwd("~")
......
94 95

  
95 96
Let's see how the seasonal cycle is represented by these proxies for a few randomly selected stations.  Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest).
96 97

  
97
```{r}
98
```{rm,echo=FALSE}
99
## subset to 10 stations
98 100
dlst2l=melt(dlst2,id.vars=c("station","month"),measure.vars=c("TMax","lst","lst3m","lst2m"))
99
ss=sample(dlst2l$station,10)
101
dlst2ls=dlst2l[dlst2l$station%in%sample(unique(dlst2l$station),10),]
100 102
```
101 103

  
102 104
```{r,echo=FALSE}
103 105
trellis.par.set(superpose.line = list(col = c("red","black","green","blue"),lwd=c(2,2,1,1)),
104 106
                superpose.symbol = list(col = c("red","black","green","blue"),lwd=c(2,2,1,1)))
105
xyplot (value~month|station, data=dlst2l[dlst2l$station%in%ss,], groups=variable,type="l",
107
xyplot (value~month|station, data=dlst2ls, groups=variable,type="l",
106 108
        main="LST at station locations by month",
107 109
        ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45)),
108 110
        layout=c(2,5))
......
130 132

  
131 133
In this case a mean would be 'pulled' towards the mean value of month-1.  Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent 
132 134

  
133
## Solution 2 (complex - based on Kalman filter and ancillary model)
135
## New Solution 2 (also relatively easy)
136

  
137
1. Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month.
138
2. Fill in missing months using temporal interpolation:
139
   * Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty.
140
   * Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.)
141

  
142

  
143
Remember from above that the LST curves are fairly smooth (caveat: we've only looked at locations in the CONUS dataset).  So let's try fitting a simple sin function to estimate missing months:
144
```{r,echo=T}
145
### Curve fitting function
146
### Set parameter starting values and limits
147
parms=list(phi=5,A=.05)  #starting parameters
148
low=list(phi=-999,A=0)  #lower bounds - constrain s3 to be positive 
149
high=list(phi=999,A=999)  #lower bounds - constrain s3 to be positive
150

  
151
## define a simple model estimating a sin function with variable shift and amplitude
152
fit.model<-function(parms,x) {
153
      fit=sin(parms[["phi"]]+(x*2*pi))*abs(parms[["A"]])
154
      fit[fit==Inf]=.Machine$double.xmax  #if Inf, replace with maximum value
155
      fit[fit==-Inf]=.Machine$double.xmin  #if Inf, replace with maximum value
156
    list(fit=fit)
157
  } 
158
## function to minimize the RMSE of predictions
159
 minimize.model<-function(parms,x,y) {
160
    ss=sum((y-fit.model(parms,x)$fit)^2)
161
    ss=min(ss,.Machine$double.xmax)  #if Inf, replace with maximum value
162
    ss=max(ss,.Machine$double.xmin)  #if -Inf, replace with maximum value
163
    list(ss=ss)
164
  }
165

  
166
## Process station data and make predictions for each month
167
makepreds=function(dat,drop=NULL){
168
  dat=na.omit(dat)
169
  dat=dat[dat$variable=="lst",]
170
  ## drop
171
  if(nrow(dat)==0){ return(NULL)}
172
  ## drop some proportion of data if drop is specified
173
  if(!is.null(drop)) dat=dat[sample(1:nrow(dat),round(nrow(dat)*(1-drop))),]
174
  dat=dat[order(dat$month),]
175
  dat$time=(1:nrow(dat))/nrow(dat)
176
  xbar=mean(dat$value,na.rm=T)
177
  fit=optim(unlist(parms),minimize.model,
178
            x=dat$time,y=dat$value-xbar,
179
            control=list(trace=0,maxit=10000),method="L-BFGS-B",lower=low,upper=high)
180

  
181
  dat$pred=xbar+sin(fit$par[["phi"]]+(dat$time*2*pi))*fit$par[["A"]]
182
  dat$A=fit$par[["A"]]
183
  dat$phi=fit$par[["phi"]]
184
  dat
185
}
186
## apply the function
187
d4=ddply(dlst2ls,.(station),.fun=makepreds)
188
d4l=melt(d4,id.vars=c("station","month"),measure.vars=c("value","pred"))
189
```
134 190

  
135
We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [http://dx.doi.org/10.1016/j.rse.2007.07.007].  The modelled data can be obtained from external climate models such as NCEP Reanalysis (at >0.5 degree of resolution). For example:
191
Let's see what that looks  for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function.
192
```{r,echo=FALSE}
193
xyplot (value~month|station, data=d4l, groups=variable,type="l",
194
        main="LST at station locations by month",
195
        ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45)))
196
```
197
And a histogram of the residuals for these stations:
198
```{r,echo=FALSE,fig.height=3}
199
d4$anomoly=d4$value-d4$pred
200
hist(d4$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C")
201
```
202

  
203
Not bad... Now let's drop 25% of the observations from each station and do it again:
204
```{r,echo=F}
205
## apply the function
206
d4a=ddply(dlst2ls,.(station),.fun=makepreds,drop=.25)
207
d4a$anomoly=d4a$value-d4a$pred
208
d4al=melt(d4a,id.vars=c("station","month"),measure.vars=c("value","pred"))
136 209

  
137
![title](http://i.imgur.com/qZs0KZn.png)
210
xyplot (value~month|station, data=d4al, groups=variable,type="l",
211
        main="LST at station locations by month",
212
        ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45)))
213
```
214

  
215
And the  of residuals for these 10 stations with 25% of the observations removed:
216
```{r,echo=FALSE,fig.height=3}
217
hist(d4a$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C")
218
```
219

  
220
### Apply this function to the full CONUS dataset described above.
221
Now look at the distributions of the residuals for the full conus dataset.
222
```{r,echo=FALSE,fig.height=3}
223
## apply the function
224
d5=ddply(dlst2l,.(station),.fun=makepreds,drop=.25)
225
d5$anomoly=d5$value-d5$pred
226
hist(d5$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C")
227
```
228
And summarize the residuals into RMSE's by station:
229
```{r, echo=FALSE,fig.height=3}
230
d5_rmse=ddply(d5,.(station),.fun=function(dat){
231
  sqrt(mean( (dat$pred - dat$value)^2, na.rm = TRUE))
232
})
233
hist(d5_rmse$V1,col="grey",main="Distribution of station RMSE's for predictions from sinusoidal model",xlab="RMSE")
234
```
235
So the vast majority of stations will have a RMSE of <5 using this simple method even if 25% of the data are missing.
138 236

  
139
However, this would require a major reorganization of how we are approaching the problem.
237
# Next steps  
238
We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST.  It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. 
140 239

  
141 240
#  Remaining Questions  
142 241

  
climate/research/LST_evaluation.html
259 259
<h1>TMax~LST relationship seasonal variability</h1>
260 260

  
261 261
<p>First let&#39;s explore the variability of the TMax~LST relationship by month, grouped by tile.  In this plot grey lines indicate a Tmax~LST regression within each tile (stations may be present in multiple tiles). Variability among grey lines represents spatial variability in the fitted lapse rate and the heavy black line is overall mean relationship (by month).
262
<img src="http://i.imgur.com/V5b3GSo.png" alt="plot of chunk unnamed-chunk-4"/> </p>
262
<img src="http://i.imgur.com/RZbujnI.png" alt="plot of chunk unnamed-chunk-4"/> </p>
263 263

  
264 264
<h1>Comparison of monthly means with 3-month moving window</h1>
265 265

  
266 266
<p>Here we compare the monthly means with a three month moving window (e.g. January mean LST with December-February mean LST).  Note that the relationship is very good (R<sup>2</sup> &gt;0.95) but slightly weaker in winter months, likely due primarily to seasonal minimums.  Heavy black line is 1:1, and thin line is the fitted regression.</p>
267 267

  
268
<p><img src="http://i.imgur.com/A0BKGfp.png" alt="plot of chunk unnamed-chunk-5"/> </p>
268
<p><img src="http://i.imgur.com/aAmbMaz.png" alt="plot of chunk unnamed-chunk-5"/> </p>
269 269

  
270 270
<h1>Comparison of monthly means with 2-month moving window that does not include the month</h1>
271 271

  
272 272
<p>Here we compare the monthly means with a two month moving window that does not include the month of interest (e.g. January mean LST with December and February mean LST, but not including the January LST).  Heavy black line is 1:1, and thin line is the fitted regression.</p>
273 273

  
274
<p><img src="http://i.imgur.com/ykO8AJI.png" alt="plot of chunk unnamed-chunk-6"/> </p>
274
<p><img src="http://i.imgur.com/AeJ0AFq.png" alt="plot of chunk unnamed-chunk-6"/> </p>
275 275

  
276 276
<p>Now let&#39;s look at the differences between the 1-month and 2-month LST values.  These represent a measure of how wrong we would be if we only had data from the two surrounding months and not the month in question.  </p>
277 277

  
278 278
<pre><code class="r">histogram(dlst2$lst - dlst2$lst2m, col = &quot;grey&quot;, xlab = &quot;Anomolies (1 month - 2 month means)&quot;)
279 279
</code></pre>
280 280

  
281
<p><img src="http://i.imgur.com/N8R2Iln.png" alt="plot of chunk unnamed-chunk-7"/> </p>
281
<p><img src="http://i.imgur.com/qIlsdWo.png" alt="plot of chunk unnamed-chunk-7"/> </p>
282 282

  
283 283
<p>The 95th quantile of the absolute value is only 4.2, so the differences are quite small. From this analysis, it appears that broadening the temporal window will maintain a relatively consistent estimate of LST (R<sup>2</sup> ranged from 0.88-0.9) even when using only data from the surrounding months.</p>
284 284

  
285 285
<p>Let&#39;s see how the seasonal cycle is represented by these proxies for a few randomly selected stations.  Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest).</p>
286 286

  
287
<pre><code class="r">dlst2l = melt(dlst2, id.vars = c(&quot;station&quot;, &quot;month&quot;), measure.vars = c(&quot;TMax&quot;, 
288
    &quot;lst&quot;, &quot;lst3m&quot;, &quot;lst2m&quot;))
289
ss = sample(dlst2l$station, 10)
290
</code></pre>
291

  
292
<p><img src="http://i.imgur.com/Zax7J5C.png" alt="plot of chunk unnamed-chunk-9"/> </p>
287
<p><img src="http://i.imgur.com/Ve0mC1C.png" alt="plot of chunk unnamed-chunk-8"/> </p>
293 288

  
294 289
<h1>Processing Options</h1>
295 290

  
......
320 315

  
321 316
<p>In this case a mean would be &#39;pulled&#39; towards the mean value of month-1.  Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent </p>
322 317

  
323
<h2>Solution 2 (complex - based on Kalman filter and ancillary model)</h2>
318
<h2>New Solution 2 (also relatively easy)</h2>
319

  
320
<ol>
321
<li>Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month.</li>
322
<li>Fill in missing months using temporal interpolation:
323

  
324
<ul>
325
<li>Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty.</li>
326
<li>Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.)</li>
327
</ul></li>
328
</ol>
329

  
330
<p>Remember from above that the LST curves are fairly smooth (caveat: we&#39;ve only looked at locations in the CONUS dataset).  So let&#39;s try fitting a simple sin function to estimate missing months:</p>
331

  
332
<pre><code class="r">### Curve fitting function Set parameter starting values and limits
333
parms = list(phi = 5, A = 0.05)  #starting parameters
334
low = list(phi = -999, A = 0)  #lower bounds - constrain s3 to be positive 
335
high = list(phi = 999, A = 999)  #lower bounds - constrain s3 to be positive
336

  
337
## define a simple model estimating a sin function with variable shift and
338
## amplitude
339
fit.model &lt;- function(parms, x) {
340
    fit = sin(parms[[&quot;phi&quot;]] + (x * 2 * pi)) * abs(parms[[&quot;A&quot;]])
341
    fit[fit == Inf] = .Machine$double.xmax  #if Inf, replace with maximum value
342
    fit[fit == -Inf] = .Machine$double.xmin  #if Inf, replace with maximum value
343
    list(fit = fit)
344
}
345
## function to minimize the RMSE of predictions
346
minimize.model &lt;- function(parms, x, y) {
347
    ss = sum((y - fit.model(parms, x)$fit)^2)
348
    ss = min(ss, .Machine$double.xmax)  #if Inf, replace with maximum value
349
    ss = max(ss, .Machine$double.xmin)  #if -Inf, replace with maximum value
350
    list(ss = ss)
351
}
352

  
353
## Process station data and make predictions for each month
354
makepreds = function(dat, drop = NULL) {
355
    dat = na.omit(dat)
356
    dat = dat[dat$variable == &quot;lst&quot;, ]
357
    ## drop
358
    if (nrow(dat) == 0) {
359
        return(NULL)
360
    }
361
    ## drop some proportion of data if drop is specified
362
    if (!is.null(drop)) 
363
        dat = dat[sample(1:nrow(dat), round(nrow(dat) * (1 - drop))), ]
364
    dat = dat[order(dat$month), ]
365
    dat$time = (1:nrow(dat))/nrow(dat)
366
    xbar = mean(dat$value, na.rm = T)
367
    fit = optim(unlist(parms), minimize.model, x = dat$time, y = dat$value - 
368
        xbar, control = list(trace = 0, maxit = 10000), method = &quot;L-BFGS-B&quot;, 
369
        lower = low, upper = high)
370

  
371
    dat$pred = xbar + sin(fit$par[[&quot;phi&quot;]] + (dat$time * 2 * pi)) * fit$par[[&quot;A&quot;]]
372
    dat$A = fit$par[[&quot;A&quot;]]
373
    dat$phi = fit$par[[&quot;phi&quot;]]
374
    dat
375
}
376
## apply the function
377
d4 = ddply(dlst2ls, .(station), .fun = makepreds)
378
d4l = melt(d4, id.vars = c(&quot;station&quot;, &quot;month&quot;), measure.vars = c(&quot;value&quot;, &quot;pred&quot;))
379
</code></pre>
380

  
381
<p>Let&#39;s see what that looks  for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function.
382
<img src="http://i.imgur.com/oNteXlf.png" alt="plot of chunk unnamed-chunk-10"/> </p>
383

  
384
<p>And a histogram of the residuals for these stations:
385
<img src="http://i.imgur.com/1z4IHQs.png" alt="plot of chunk unnamed-chunk-11"/> </p>
386

  
387
<p>Not bad&hellip; Now let&#39;s drop 25% of the observations from each station and do it again:
388
<img src="http://i.imgur.com/6zkd5CB.png" alt="plot of chunk unnamed-chunk-12"/> </p>
389

  
390
<p>And the  of residuals for these 10 stations with 25% of the observations removed:
391
<img src="http://i.imgur.com/c1Wyh9J.png" alt="plot of chunk unnamed-chunk-13"/> </p>
392

  
393
<h3>Apply this function to the full CONUS dataset described above.</h3>
394

  
395
<p>Now look at the distributions of the residuals for the full conus dataset.
396
<img src="http://i.imgur.com/od4iR2S.png" alt="plot of chunk unnamed-chunk-14"/> </p>
397

  
398
<p>And summarize the residuals into RMSE&#39;s by station:
399
<img src="http://i.imgur.com/KIV2rBk.png" alt="plot of chunk unnamed-chunk-15"/> </p>
324 400

  
325
<p>We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [<a href="http://dx.doi.org/10.1016/j.rse.2007.07.007">http://dx.doi.org/10.1016/j.rse.2007.07.007</a>].  The modelled data can be obtained from external climate models such as NCEP Reanalysis (at &gt;0.5 degree of resolution). For example:</p>
401
<p>So the vast majority of stations will have a RMSE of &lt;5 using this simple method even if 25% of the data are missing.</p>
326 402

  
327
<p><img src="http://i.imgur.com/qZs0KZn.png" alt="title"/></p>
403
<h1>Next steps</h1>
328 404

  
329
<p>However, this would require a major reorganization of how we are approaching the problem.</p>
405
<p>We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST.  It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. </p>
330 406

  
331 407
<h1>Remaining Questions</h1>
332 408

  
climate/research/LST_evaluation.md
28 28
# TMax~LST relationship seasonal variability 
29 29

  
30 30
First let's explore the variability of the TMax~LST relationship by month, grouped by tile.  In this plot grey lines indicate a Tmax~LST regression within each tile (stations may be present in multiple tiles). Variability among grey lines represents spatial variability in the fitted lapse rate and the heavy black line is overall mean relationship (by month).
31
![plot of chunk unnamed-chunk-4](http://i.imgur.com/V5b3GSo.png) 
31
![plot of chunk unnamed-chunk-4](http://i.imgur.com/RZbujnI.png) 
32 32

  
33 33

  
34 34

  
35 35
# Comparison of monthly means with 3-month moving window
36 36
Here we compare the monthly means with a three month moving window (e.g. January mean LST with December-February mean LST).  Note that the relationship is very good (R^2 >0.95) but slightly weaker in winter months, likely due primarily to seasonal minimums.  Heavy black line is 1:1, and thin line is the fitted regression.
37 37

  
38
![plot of chunk unnamed-chunk-5](http://i.imgur.com/A0BKGfp.png) 
38
![plot of chunk unnamed-chunk-5](http://i.imgur.com/aAmbMaz.png) 
39 39

  
40 40

  
41 41
# Comparison of monthly means with 2-month moving window that does not include the month
42 42
Here we compare the monthly means with a two month moving window that does not include the month of interest (e.g. January mean LST with December and February mean LST, but not including the January LST).  Heavy black line is 1:1, and thin line is the fitted regression.
43 43

  
44
![plot of chunk unnamed-chunk-6](http://i.imgur.com/ykO8AJI.png) 
44
![plot of chunk unnamed-chunk-6](http://i.imgur.com/AeJ0AFq.png) 
45 45

  
46 46

  
47 47
Now let's look at the differences between the 1-month and 2-month LST values.  These represent a measure of how wrong we would be if we only had data from the two surrounding months and not the month in question.  
......
51 51
histogram(dlst2$lst - dlst2$lst2m, col = "grey", xlab = "Anomolies (1 month - 2 month means)")
52 52
```
53 53

  
54
![plot of chunk unnamed-chunk-7](http://i.imgur.com/N8R2Iln.png) 
54
![plot of chunk unnamed-chunk-7](http://i.imgur.com/qIlsdWo.png) 
55 55

  
56 56
The 95th quantile of the absolute value is only 4.2, so the differences are quite small. From this analysis, it appears that broadening the temporal window will maintain a relatively consistent estimate of LST (R^2 ranged from 0.88-0.9) even when using only data from the surrounding months.
57 57

  
58 58
Let's see how the seasonal cycle is represented by these proxies for a few randomly selected stations.  Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest).
59 59

  
60 60

  
61
```r
62
dlst2l = melt(dlst2, id.vars = c("station", "month"), measure.vars = c("TMax", 
63
    "lst", "lst3m", "lst2m"))
64
ss = sample(dlst2l$station, 10)
65
```
66 61

  
67 62

  
68
![plot of chunk unnamed-chunk-9](http://i.imgur.com/Zax7J5C.png) 
63
![plot of chunk unnamed-chunk-8](http://i.imgur.com/Ve0mC1C.png) 
69 64

  
70 65

  
71 66
# Processing Options
......
89 84

  
90 85
In this case a mean would be 'pulled' towards the mean value of month-1.  Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent 
91 86

  
92
## Solution 2 (complex - based on Kalman filter and ancillary model)
87
## New Solution 2 (also relatively easy)
88

  
89
1. Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month.
90
2. Fill in missing months using temporal interpolation:
91
   * Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty.
92
   * Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.)
93

  
94

  
95
Remember from above that the LST curves are fairly smooth (caveat: we've only looked at locations in the CONUS dataset).  So let's try fitting a simple sin function to estimate missing months:
96

  
97
```r
98
### Curve fitting function Set parameter starting values and limits
99
parms = list(phi = 5, A = 0.05)  #starting parameters
100
low = list(phi = -999, A = 0)  #lower bounds - constrain s3 to be positive 
101
high = list(phi = 999, A = 999)  #lower bounds - constrain s3 to be positive
102

  
103
## define a simple model estimating a sin function with variable shift and
104
## amplitude
105
fit.model <- function(parms, x) {
106
    fit = sin(parms[["phi"]] + (x * 2 * pi)) * abs(parms[["A"]])
107
    fit[fit == Inf] = .Machine$double.xmax  #if Inf, replace with maximum value
108
    fit[fit == -Inf] = .Machine$double.xmin  #if Inf, replace with maximum value
109
    list(fit = fit)
110
}
111
## function to minimize the RMSE of predictions
112
minimize.model <- function(parms, x, y) {
113
    ss = sum((y - fit.model(parms, x)$fit)^2)
114
    ss = min(ss, .Machine$double.xmax)  #if Inf, replace with maximum value
115
    ss = max(ss, .Machine$double.xmin)  #if -Inf, replace with maximum value
116
    list(ss = ss)
117
}
118

  
119
## Process station data and make predictions for each month
120
makepreds = function(dat, drop = NULL) {
121
    dat = na.omit(dat)
122
    dat = dat[dat$variable == "lst", ]
123
    ## drop
124
    if (nrow(dat) == 0) {
125
        return(NULL)
126
    }
127
    ## drop some proportion of data if drop is specified
128
    if (!is.null(drop)) 
129
        dat = dat[sample(1:nrow(dat), round(nrow(dat) * (1 - drop))), ]
130
    dat = dat[order(dat$month), ]
131
    dat$time = (1:nrow(dat))/nrow(dat)
132
    xbar = mean(dat$value, na.rm = T)
133
    fit = optim(unlist(parms), minimize.model, x = dat$time, y = dat$value - 
134
        xbar, control = list(trace = 0, maxit = 10000), method = "L-BFGS-B", 
135
        lower = low, upper = high)
136
    
137
    dat$pred = xbar + sin(fit$par[["phi"]] + (dat$time * 2 * pi)) * fit$par[["A"]]
138
    dat$A = fit$par[["A"]]
139
    dat$phi = fit$par[["phi"]]
140
    dat
141
}
142
## apply the function
143
d4 = ddply(dlst2ls, .(station), .fun = makepreds)
144
d4l = melt(d4, id.vars = c("station", "month"), measure.vars = c("value", "pred"))
145
```
146

  
147

  
148
Let's see what that looks  for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function.
149
![plot of chunk unnamed-chunk-10](http://i.imgur.com/oNteXlf.png) 
150

  
151
And a histogram of the residuals for these stations:
152
![plot of chunk unnamed-chunk-11](http://i.imgur.com/1z4IHQs.png) 
153

  
154

  
155
Not bad... Now let's drop 25% of the observations from each station and do it again:
156
![plot of chunk unnamed-chunk-12](http://i.imgur.com/6zkd5CB.png) 
157

  
158

  
159
And the  of residuals for these 10 stations with 25% of the observations removed:
160
![plot of chunk unnamed-chunk-13](http://i.imgur.com/c1Wyh9J.png) 
161

  
162

  
163
### Apply this function to the full CONUS dataset described above.
164
Now look at the distributions of the residuals for the full conus dataset.
165
![plot of chunk unnamed-chunk-14](http://i.imgur.com/od4iR2S.png) 
93 166

  
94
We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [http://dx.doi.org/10.1016/j.rse.2007.07.007].  The modelled data can be obtained from external climate models such as NCEP Reanalysis (at >0.5 degree of resolution). For example:
167
And summarize the residuals into RMSE's by station:
168
![plot of chunk unnamed-chunk-15](http://i.imgur.com/KIV2rBk.png) 
95 169

  
96
![title](http://i.imgur.com/qZs0KZn.png)
170
So the vast majority of stations will have a RMSE of <5 using this simple method even if 25% of the data are missing.
97 171

  
98
However, this would require a major reorganization of how we are approaching the problem.
172
# Next steps  
173
We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST.  It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. 
99 174

  
100 175
#  Remaining Questions  
101 176

  

Also available in: Unified diff