Revision 2523d01d
Added by Adam M. Wilson almost 11 years ago
climate/research/LST_evaluation.Rmd | ||
---|---|---|
14 | 14 |
|
15 | 15 |
```{r,echo=FALSE,results='hide',message=F} |
16 | 16 |
## some setup |
17 |
uploadimages=T # upload all images to imgur.com for easy viewing on github |
|
17 |
uploadimages=T # upload all images to imgur.com for easy viewing on github
|
|
18 | 18 |
opts_knit$set(progress = TRUE, verbose = TRUE,root.dir="~/Downloads",base.url = NULL) |
19 | 19 |
if(uploadimages) opts_knit$set(upload.fun =imgur_upload) |
20 | 20 |
|
... | ... | |
25 | 25 |
library(reshape) |
26 | 26 |
library(plyr) |
27 | 27 |
library(latticeExtra) |
28 |
library(knitr) |
|
28 | 29 |
|
29 | 30 |
## Import the data |
30 | 31 |
setwd("~") |
... | ... | |
94 | 95 |
|
95 | 96 |
Let's see how the seasonal cycle is represented by these proxies for a few randomly selected stations. Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest). |
96 | 97 |
|
97 |
```{r} |
|
98 |
```{rm,echo=FALSE} |
|
99 |
## subset to 10 stations |
|
98 | 100 |
dlst2l=melt(dlst2,id.vars=c("station","month"),measure.vars=c("TMax","lst","lst3m","lst2m")) |
99 |
ss=sample(dlst2l$station,10)
|
|
101 |
dlst2ls=dlst2l[dlst2l$station%in%sample(unique(dlst2l$station),10),]
|
|
100 | 102 |
``` |
101 | 103 |
|
102 | 104 |
```{r,echo=FALSE} |
103 | 105 |
trellis.par.set(superpose.line = list(col = c("red","black","green","blue"),lwd=c(2,2,1,1)), |
104 | 106 |
superpose.symbol = list(col = c("red","black","green","blue"),lwd=c(2,2,1,1))) |
105 |
xyplot (value~month|station, data=dlst2l[dlst2l$station%in%ss,], groups=variable,type="l",
|
|
107 |
xyplot (value~month|station, data=dlst2ls, groups=variable,type="l",
|
|
106 | 108 |
main="LST at station locations by month", |
107 | 109 |
ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45)), |
108 | 110 |
layout=c(2,5)) |
... | ... | |
130 | 132 |
|
131 | 133 |
In this case a mean would be 'pulled' towards the mean value of month-1. Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent |
132 | 134 |
|
133 |
## Solution 2 (complex - based on Kalman filter and ancillary model) |
|
135 |
## New Solution 2 (also relatively easy) |
|
136 |
|
|
137 |
1. Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month. |
|
138 |
2. Fill in missing months using temporal interpolation: |
|
139 |
* Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty. |
|
140 |
* Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.) |
|
141 |
|
|
142 |
|
|
143 |
Remember from above that the LST curves are fairly smooth (caveat: we've only looked at locations in the CONUS dataset). So let's try fitting a simple sin function to estimate missing months: |
|
144 |
```{r,echo=T} |
|
145 |
### Curve fitting function |
|
146 |
### Set parameter starting values and limits |
|
147 |
parms=list(phi=5,A=.05) #starting parameters |
|
148 |
low=list(phi=-999,A=0) #lower bounds - constrain s3 to be positive |
|
149 |
high=list(phi=999,A=999) #lower bounds - constrain s3 to be positive |
|
150 |
|
|
151 |
## define a simple model estimating a sin function with variable shift and amplitude |
|
152 |
fit.model<-function(parms,x) { |
|
153 |
fit=sin(parms[["phi"]]+(x*2*pi))*abs(parms[["A"]]) |
|
154 |
fit[fit==Inf]=.Machine$double.xmax #if Inf, replace with maximum value |
|
155 |
fit[fit==-Inf]=.Machine$double.xmin #if Inf, replace with maximum value |
|
156 |
list(fit=fit) |
|
157 |
} |
|
158 |
## function to minimize the RMSE of predictions |
|
159 |
minimize.model<-function(parms,x,y) { |
|
160 |
ss=sum((y-fit.model(parms,x)$fit)^2) |
|
161 |
ss=min(ss,.Machine$double.xmax) #if Inf, replace with maximum value |
|
162 |
ss=max(ss,.Machine$double.xmin) #if -Inf, replace with maximum value |
|
163 |
list(ss=ss) |
|
164 |
} |
|
165 |
|
|
166 |
## Process station data and make predictions for each month |
|
167 |
makepreds=function(dat,drop=NULL){ |
|
168 |
dat=na.omit(dat) |
|
169 |
dat=dat[dat$variable=="lst",] |
|
170 |
## drop |
|
171 |
if(nrow(dat)==0){ return(NULL)} |
|
172 |
## drop some proportion of data if drop is specified |
|
173 |
if(!is.null(drop)) dat=dat[sample(1:nrow(dat),round(nrow(dat)*(1-drop))),] |
|
174 |
dat=dat[order(dat$month),] |
|
175 |
dat$time=(1:nrow(dat))/nrow(dat) |
|
176 |
xbar=mean(dat$value,na.rm=T) |
|
177 |
fit=optim(unlist(parms),minimize.model, |
|
178 |
x=dat$time,y=dat$value-xbar, |
|
179 |
control=list(trace=0,maxit=10000),method="L-BFGS-B",lower=low,upper=high) |
|
180 |
|
|
181 |
dat$pred=xbar+sin(fit$par[["phi"]]+(dat$time*2*pi))*fit$par[["A"]] |
|
182 |
dat$A=fit$par[["A"]] |
|
183 |
dat$phi=fit$par[["phi"]] |
|
184 |
dat |
|
185 |
} |
|
186 |
## apply the function |
|
187 |
d4=ddply(dlst2ls,.(station),.fun=makepreds) |
|
188 |
d4l=melt(d4,id.vars=c("station","month"),measure.vars=c("value","pred")) |
|
189 |
``` |
|
134 | 190 |
|
135 |
We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [http://dx.doi.org/10.1016/j.rse.2007.07.007]. The modelled data can be obtained from external climate models such as NCEP Reanalysis (at >0.5 degree of resolution). For example: |
|
191 |
Let's see what that looks for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function. |
|
192 |
```{r,echo=FALSE} |
|
193 |
xyplot (value~month|station, data=d4l, groups=variable,type="l", |
|
194 |
main="LST at station locations by month", |
|
195 |
ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45))) |
|
196 |
``` |
|
197 |
And a histogram of the residuals for these stations: |
|
198 |
```{r,echo=FALSE,fig.height=3} |
|
199 |
d4$anomoly=d4$value-d4$pred |
|
200 |
hist(d4$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C") |
|
201 |
``` |
|
202 |
|
|
203 |
Not bad... Now let's drop 25% of the observations from each station and do it again: |
|
204 |
```{r,echo=F} |
|
205 |
## apply the function |
|
206 |
d4a=ddply(dlst2ls,.(station),.fun=makepreds,drop=.25) |
|
207 |
d4a$anomoly=d4a$value-d4a$pred |
|
208 |
d4al=melt(d4a,id.vars=c("station","month"),measure.vars=c("value","pred")) |
|
136 | 209 |
|
137 |
![title](http://i.imgur.com/qZs0KZn.png) |
|
210 |
xyplot (value~month|station, data=d4al, groups=variable,type="l", |
|
211 |
main="LST at station locations by month", |
|
212 |
ylab="LST (monthly)", xlab="LST (2-month Mean)",cex=.5,pch=16,auto.key=list(space="right"),scales=list(x=list(rot=45))) |
|
213 |
``` |
|
214 |
|
|
215 |
And the of residuals for these 10 stations with 25% of the observations removed: |
|
216 |
```{r,echo=FALSE,fig.height=3} |
|
217 |
hist(d4a$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C") |
|
218 |
``` |
|
219 |
|
|
220 |
### Apply this function to the full CONUS dataset described above. |
|
221 |
Now look at the distributions of the residuals for the full conus dataset. |
|
222 |
```{r,echo=FALSE,fig.height=3} |
|
223 |
## apply the function |
|
224 |
d5=ddply(dlst2l,.(station),.fun=makepreds,drop=.25) |
|
225 |
d5$anomoly=d5$value-d5$pred |
|
226 |
hist(d5$anomoly,col="grey",main="Histogram of residuals for predictions from sinusoidal model",xlab="Degrees C") |
|
227 |
``` |
|
228 |
And summarize the residuals into RMSE's by station: |
|
229 |
```{r, echo=FALSE,fig.height=3} |
|
230 |
d5_rmse=ddply(d5,.(station),.fun=function(dat){ |
|
231 |
sqrt(mean( (dat$pred - dat$value)^2, na.rm = TRUE)) |
|
232 |
}) |
|
233 |
hist(d5_rmse$V1,col="grey",main="Distribution of station RMSE's for predictions from sinusoidal model",xlab="RMSE") |
|
234 |
``` |
|
235 |
So the vast majority of stations will have a RMSE of <5 using this simple method even if 25% of the data are missing. |
|
138 | 236 |
|
139 |
However, this would require a major reorganization of how we are approaching the problem. |
|
237 |
# Next steps |
|
238 |
We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST. It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. |
|
140 | 239 |
|
141 | 240 |
# Remaining Questions |
142 | 241 |
|
climate/research/LST_evaluation.html | ||
---|---|---|
259 | 259 |
<h1>TMax~LST relationship seasonal variability</h1> |
260 | 260 |
|
261 | 261 |
<p>First let's explore the variability of the TMax~LST relationship by month, grouped by tile. In this plot grey lines indicate a Tmax~LST regression within each tile (stations may be present in multiple tiles). Variability among grey lines represents spatial variability in the fitted lapse rate and the heavy black line is overall mean relationship (by month). |
262 |
<img src="http://i.imgur.com/V5b3GSo.png" alt="plot of chunk unnamed-chunk-4"/> </p>
|
|
262 |
<img src="http://i.imgur.com/RZbujnI.png" alt="plot of chunk unnamed-chunk-4"/> </p>
|
|
263 | 263 |
|
264 | 264 |
<h1>Comparison of monthly means with 3-month moving window</h1> |
265 | 265 |
|
266 | 266 |
<p>Here we compare the monthly means with a three month moving window (e.g. January mean LST with December-February mean LST). Note that the relationship is very good (R<sup>2</sup> >0.95) but slightly weaker in winter months, likely due primarily to seasonal minimums. Heavy black line is 1:1, and thin line is the fitted regression.</p> |
267 | 267 |
|
268 |
<p><img src="http://i.imgur.com/A0BKGfp.png" alt="plot of chunk unnamed-chunk-5"/> </p>
|
|
268 |
<p><img src="http://i.imgur.com/aAmbMaz.png" alt="plot of chunk unnamed-chunk-5"/> </p>
|
|
269 | 269 |
|
270 | 270 |
<h1>Comparison of monthly means with 2-month moving window that does not include the month</h1> |
271 | 271 |
|
272 | 272 |
<p>Here we compare the monthly means with a two month moving window that does not include the month of interest (e.g. January mean LST with December and February mean LST, but not including the January LST). Heavy black line is 1:1, and thin line is the fitted regression.</p> |
273 | 273 |
|
274 |
<p><img src="http://i.imgur.com/ykO8AJI.png" alt="plot of chunk unnamed-chunk-6"/> </p>
|
|
274 |
<p><img src="http://i.imgur.com/AeJ0AFq.png" alt="plot of chunk unnamed-chunk-6"/> </p>
|
|
275 | 275 |
|
276 | 276 |
<p>Now let's look at the differences between the 1-month and 2-month LST values. These represent a measure of how wrong we would be if we only had data from the two surrounding months and not the month in question. </p> |
277 | 277 |
|
278 | 278 |
<pre><code class="r">histogram(dlst2$lst - dlst2$lst2m, col = "grey", xlab = "Anomolies (1 month - 2 month means)") |
279 | 279 |
</code></pre> |
280 | 280 |
|
281 |
<p><img src="http://i.imgur.com/N8R2Iln.png" alt="plot of chunk unnamed-chunk-7"/> </p>
|
|
281 |
<p><img src="http://i.imgur.com/qIlsdWo.png" alt="plot of chunk unnamed-chunk-7"/> </p>
|
|
282 | 282 |
|
283 | 283 |
<p>The 95th quantile of the absolute value is only 4.2, so the differences are quite small. From this analysis, it appears that broadening the temporal window will maintain a relatively consistent estimate of LST (R<sup>2</sup> ranged from 0.88-0.9) even when using only data from the surrounding months.</p> |
284 | 284 |
|
285 | 285 |
<p>Let's see how the seasonal cycle is represented by these proxies for a few randomly selected stations. Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest).</p> |
286 | 286 |
|
287 |
<pre><code class="r">dlst2l = melt(dlst2, id.vars = c("station", "month"), measure.vars = c("TMax", |
|
288 |
"lst", "lst3m", "lst2m")) |
|
289 |
ss = sample(dlst2l$station, 10) |
|
290 |
</code></pre> |
|
291 |
|
|
292 |
<p><img src="http://i.imgur.com/Zax7J5C.png" alt="plot of chunk unnamed-chunk-9"/> </p> |
|
287 |
<p><img src="http://i.imgur.com/Ve0mC1C.png" alt="plot of chunk unnamed-chunk-8"/> </p> |
|
293 | 288 |
|
294 | 289 |
<h1>Processing Options</h1> |
295 | 290 |
|
... | ... | |
320 | 315 |
|
321 | 316 |
<p>In this case a mean would be 'pulled' towards the mean value of month-1. Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent </p> |
322 | 317 |
|
323 |
<h2>Solution 2 (complex - based on Kalman filter and ancillary model)</h2> |
|
318 |
<h2>New Solution 2 (also relatively easy)</h2> |
|
319 |
|
|
320 |
<ol> |
|
321 |
<li>Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month.</li> |
|
322 |
<li>Fill in missing months using temporal interpolation: |
|
323 |
|
|
324 |
<ul> |
|
325 |
<li>Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty.</li> |
|
326 |
<li>Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.)</li> |
|
327 |
</ul></li> |
|
328 |
</ol> |
|
329 |
|
|
330 |
<p>Remember from above that the LST curves are fairly smooth (caveat: we've only looked at locations in the CONUS dataset). So let's try fitting a simple sin function to estimate missing months:</p> |
|
331 |
|
|
332 |
<pre><code class="r">### Curve fitting function Set parameter starting values and limits |
|
333 |
parms = list(phi = 5, A = 0.05) #starting parameters |
|
334 |
low = list(phi = -999, A = 0) #lower bounds - constrain s3 to be positive |
|
335 |
high = list(phi = 999, A = 999) #lower bounds - constrain s3 to be positive |
|
336 |
|
|
337 |
## define a simple model estimating a sin function with variable shift and |
|
338 |
## amplitude |
|
339 |
fit.model <- function(parms, x) { |
|
340 |
fit = sin(parms[["phi"]] + (x * 2 * pi)) * abs(parms[["A"]]) |
|
341 |
fit[fit == Inf] = .Machine$double.xmax #if Inf, replace with maximum value |
|
342 |
fit[fit == -Inf] = .Machine$double.xmin #if Inf, replace with maximum value |
|
343 |
list(fit = fit) |
|
344 |
} |
|
345 |
## function to minimize the RMSE of predictions |
|
346 |
minimize.model <- function(parms, x, y) { |
|
347 |
ss = sum((y - fit.model(parms, x)$fit)^2) |
|
348 |
ss = min(ss, .Machine$double.xmax) #if Inf, replace with maximum value |
|
349 |
ss = max(ss, .Machine$double.xmin) #if -Inf, replace with maximum value |
|
350 |
list(ss = ss) |
|
351 |
} |
|
352 |
|
|
353 |
## Process station data and make predictions for each month |
|
354 |
makepreds = function(dat, drop = NULL) { |
|
355 |
dat = na.omit(dat) |
|
356 |
dat = dat[dat$variable == "lst", ] |
|
357 |
## drop |
|
358 |
if (nrow(dat) == 0) { |
|
359 |
return(NULL) |
|
360 |
} |
|
361 |
## drop some proportion of data if drop is specified |
|
362 |
if (!is.null(drop)) |
|
363 |
dat = dat[sample(1:nrow(dat), round(nrow(dat) * (1 - drop))), ] |
|
364 |
dat = dat[order(dat$month), ] |
|
365 |
dat$time = (1:nrow(dat))/nrow(dat) |
|
366 |
xbar = mean(dat$value, na.rm = T) |
|
367 |
fit = optim(unlist(parms), minimize.model, x = dat$time, y = dat$value - |
|
368 |
xbar, control = list(trace = 0, maxit = 10000), method = "L-BFGS-B", |
|
369 |
lower = low, upper = high) |
|
370 |
|
|
371 |
dat$pred = xbar + sin(fit$par[["phi"]] + (dat$time * 2 * pi)) * fit$par[["A"]] |
|
372 |
dat$A = fit$par[["A"]] |
|
373 |
dat$phi = fit$par[["phi"]] |
|
374 |
dat |
|
375 |
} |
|
376 |
## apply the function |
|
377 |
d4 = ddply(dlst2ls, .(station), .fun = makepreds) |
|
378 |
d4l = melt(d4, id.vars = c("station", "month"), measure.vars = c("value", "pred")) |
|
379 |
</code></pre> |
|
380 |
|
|
381 |
<p>Let's see what that looks for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function. |
|
382 |
<img src="http://i.imgur.com/oNteXlf.png" alt="plot of chunk unnamed-chunk-10"/> </p> |
|
383 |
|
|
384 |
<p>And a histogram of the residuals for these stations: |
|
385 |
<img src="http://i.imgur.com/1z4IHQs.png" alt="plot of chunk unnamed-chunk-11"/> </p> |
|
386 |
|
|
387 |
<p>Not bad… Now let's drop 25% of the observations from each station and do it again: |
|
388 |
<img src="http://i.imgur.com/6zkd5CB.png" alt="plot of chunk unnamed-chunk-12"/> </p> |
|
389 |
|
|
390 |
<p>And the of residuals for these 10 stations with 25% of the observations removed: |
|
391 |
<img src="http://i.imgur.com/c1Wyh9J.png" alt="plot of chunk unnamed-chunk-13"/> </p> |
|
392 |
|
|
393 |
<h3>Apply this function to the full CONUS dataset described above.</h3> |
|
394 |
|
|
395 |
<p>Now look at the distributions of the residuals for the full conus dataset. |
|
396 |
<img src="http://i.imgur.com/od4iR2S.png" alt="plot of chunk unnamed-chunk-14"/> </p> |
|
397 |
|
|
398 |
<p>And summarize the residuals into RMSE's by station: |
|
399 |
<img src="http://i.imgur.com/KIV2rBk.png" alt="plot of chunk unnamed-chunk-15"/> </p> |
|
324 | 400 |
|
325 |
<p>We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [<a href="http://dx.doi.org/10.1016/j.rse.2007.07.007">http://dx.doi.org/10.1016/j.rse.2007.07.007</a>]. The modelled data can be obtained from external climate models such as NCEP Reanalysis (at >0.5 degree of resolution). For example:</p>
|
|
401 |
<p>So the vast majority of stations will have a RMSE of <5 using this simple method even if 25% of the data are missing.</p>
|
|
326 | 402 |
|
327 |
<p><img src="http://i.imgur.com/qZs0KZn.png" alt="title"/></p>
|
|
403 |
<h1>Next steps</h1>
|
|
328 | 404 |
|
329 |
<p>However, this would require a major reorganization of how we are approaching the problem.</p>
|
|
405 |
<p>We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST. It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. </p>
|
|
330 | 406 |
|
331 | 407 |
<h1>Remaining Questions</h1> |
332 | 408 |
|
climate/research/LST_evaluation.md | ||
---|---|---|
28 | 28 |
# TMax~LST relationship seasonal variability |
29 | 29 |
|
30 | 30 |
First let's explore the variability of the TMax~LST relationship by month, grouped by tile. In this plot grey lines indicate a Tmax~LST regression within each tile (stations may be present in multiple tiles). Variability among grey lines represents spatial variability in the fitted lapse rate and the heavy black line is overall mean relationship (by month). |
31 |
![plot of chunk unnamed-chunk-4](http://i.imgur.com/V5b3GSo.png)
|
|
31 |
![plot of chunk unnamed-chunk-4](http://i.imgur.com/RZbujnI.png)
|
|
32 | 32 |
|
33 | 33 |
|
34 | 34 |
|
35 | 35 |
# Comparison of monthly means with 3-month moving window |
36 | 36 |
Here we compare the monthly means with a three month moving window (e.g. January mean LST with December-February mean LST). Note that the relationship is very good (R^2 >0.95) but slightly weaker in winter months, likely due primarily to seasonal minimums. Heavy black line is 1:1, and thin line is the fitted regression. |
37 | 37 |
|
38 |
![plot of chunk unnamed-chunk-5](http://i.imgur.com/A0BKGfp.png)
|
|
38 |
![plot of chunk unnamed-chunk-5](http://i.imgur.com/aAmbMaz.png)
|
|
39 | 39 |
|
40 | 40 |
|
41 | 41 |
# Comparison of monthly means with 2-month moving window that does not include the month |
42 | 42 |
Here we compare the monthly means with a two month moving window that does not include the month of interest (e.g. January mean LST with December and February mean LST, but not including the January LST). Heavy black line is 1:1, and thin line is the fitted regression. |
43 | 43 |
|
44 |
![plot of chunk unnamed-chunk-6](http://i.imgur.com/ykO8AJI.png)
|
|
44 |
![plot of chunk unnamed-chunk-6](http://i.imgur.com/AeJ0AFq.png)
|
|
45 | 45 |
|
46 | 46 |
|
47 | 47 |
Now let's look at the differences between the 1-month and 2-month LST values. These represent a measure of how wrong we would be if we only had data from the two surrounding months and not the month in question. |
... | ... | |
51 | 51 |
histogram(dlst2$lst - dlst2$lst2m, col = "grey", xlab = "Anomolies (1 month - 2 month means)") |
52 | 52 |
``` |
53 | 53 |
|
54 |
![plot of chunk unnamed-chunk-7](http://i.imgur.com/N8R2Iln.png)
|
|
54 |
![plot of chunk unnamed-chunk-7](http://i.imgur.com/qIlsdWo.png)
|
|
55 | 55 |
|
56 | 56 |
The 95th quantile of the absolute value is only 4.2, so the differences are quite small. From this analysis, it appears that broadening the temporal window will maintain a relatively consistent estimate of LST (R^2 ranged from 0.88-0.9) even when using only data from the surrounding months. |
57 | 57 |
|
58 | 58 |
Let's see how the seasonal cycle is represented by these proxies for a few randomly selected stations. Here the red line is the observed TMax data, the heavy black line represents the mean LST in that month, and the green line is a three-month moving window, while the purple line is the 2-month window (not including the month of interest). |
59 | 59 |
|
60 | 60 |
|
61 |
```r |
|
62 |
dlst2l = melt(dlst2, id.vars = c("station", "month"), measure.vars = c("TMax", |
|
63 |
"lst", "lst3m", "lst2m")) |
|
64 |
ss = sample(dlst2l$station, 10) |
|
65 |
``` |
|
66 | 61 |
|
67 | 62 |
|
68 |
![plot of chunk unnamed-chunk-9](http://i.imgur.com/Zax7J5C.png)
|
|
63 |
![plot of chunk unnamed-chunk-8](http://i.imgur.com/Ve0mC1C.png)
|
|
69 | 64 |
|
70 | 65 |
|
71 | 66 |
# Processing Options |
... | ... | |
89 | 84 |
|
90 | 85 |
In this case a mean would be 'pulled' towards the mean value of month-1. Using a fixed number of observations in t-1 and t+1 and a linear interpolation that accounts for the time of those observations would prevent |
91 | 86 |
|
92 |
## Solution 2 (complex - based on Kalman filter and ancillary model) |
|
87 |
## New Solution 2 (also relatively easy) |
|
88 |
|
|
89 |
1. Create 12 monthly LST climatologies if at least 5 or 10 observations are available for each pixel-month. |
|
90 |
2. Fill in missing months using temporal interpolation: |
|
91 |
* Fit function through all available months (rather than just +/- 1 month) to capture seasonal variabilty. |
|
92 |
* Use some function that can capture the cyclical nature (sin, spline, polynomial, etc.) |
|
93 |
|
|
94 |
|
|
95 |
Remember from above that the LST curves are fairly smooth (caveat: we've only looked at locations in the CONUS dataset). So let's try fitting a simple sin function to estimate missing months: |
|
96 |
|
|
97 |
```r |
|
98 |
### Curve fitting function Set parameter starting values and limits |
|
99 |
parms = list(phi = 5, A = 0.05) #starting parameters |
|
100 |
low = list(phi = -999, A = 0) #lower bounds - constrain s3 to be positive |
|
101 |
high = list(phi = 999, A = 999) #lower bounds - constrain s3 to be positive |
|
102 |
|
|
103 |
## define a simple model estimating a sin function with variable shift and |
|
104 |
## amplitude |
|
105 |
fit.model <- function(parms, x) { |
|
106 |
fit = sin(parms[["phi"]] + (x * 2 * pi)) * abs(parms[["A"]]) |
|
107 |
fit[fit == Inf] = .Machine$double.xmax #if Inf, replace with maximum value |
|
108 |
fit[fit == -Inf] = .Machine$double.xmin #if Inf, replace with maximum value |
|
109 |
list(fit = fit) |
|
110 |
} |
|
111 |
## function to minimize the RMSE of predictions |
|
112 |
minimize.model <- function(parms, x, y) { |
|
113 |
ss = sum((y - fit.model(parms, x)$fit)^2) |
|
114 |
ss = min(ss, .Machine$double.xmax) #if Inf, replace with maximum value |
|
115 |
ss = max(ss, .Machine$double.xmin) #if -Inf, replace with maximum value |
|
116 |
list(ss = ss) |
|
117 |
} |
|
118 |
|
|
119 |
## Process station data and make predictions for each month |
|
120 |
makepreds = function(dat, drop = NULL) { |
|
121 |
dat = na.omit(dat) |
|
122 |
dat = dat[dat$variable == "lst", ] |
|
123 |
## drop |
|
124 |
if (nrow(dat) == 0) { |
|
125 |
return(NULL) |
|
126 |
} |
|
127 |
## drop some proportion of data if drop is specified |
|
128 |
if (!is.null(drop)) |
|
129 |
dat = dat[sample(1:nrow(dat), round(nrow(dat) * (1 - drop))), ] |
|
130 |
dat = dat[order(dat$month), ] |
|
131 |
dat$time = (1:nrow(dat))/nrow(dat) |
|
132 |
xbar = mean(dat$value, na.rm = T) |
|
133 |
fit = optim(unlist(parms), minimize.model, x = dat$time, y = dat$value - |
|
134 |
xbar, control = list(trace = 0, maxit = 10000), method = "L-BFGS-B", |
|
135 |
lower = low, upper = high) |
|
136 |
|
|
137 |
dat$pred = xbar + sin(fit$par[["phi"]] + (dat$time * 2 * pi)) * fit$par[["A"]] |
|
138 |
dat$A = fit$par[["A"]] |
|
139 |
dat$phi = fit$par[["phi"]] |
|
140 |
dat |
|
141 |
} |
|
142 |
## apply the function |
|
143 |
d4 = ddply(dlst2ls, .(station), .fun = makepreds) |
|
144 |
d4l = melt(d4, id.vars = c("station", "month"), measure.vars = c("value", "pred")) |
|
145 |
``` |
|
146 |
|
|
147 |
|
|
148 |
Let's see what that looks for the 10 example stations above. The blue line is the observed LST value and the pink line is the predicted LST from the sinusoidal function. |
|
149 |
![plot of chunk unnamed-chunk-10](http://i.imgur.com/oNteXlf.png) |
|
150 |
|
|
151 |
And a histogram of the residuals for these stations: |
|
152 |
![plot of chunk unnamed-chunk-11](http://i.imgur.com/1z4IHQs.png) |
|
153 |
|
|
154 |
|
|
155 |
Not bad... Now let's drop 25% of the observations from each station and do it again: |
|
156 |
![plot of chunk unnamed-chunk-12](http://i.imgur.com/6zkd5CB.png) |
|
157 |
|
|
158 |
|
|
159 |
And the of residuals for these 10 stations with 25% of the observations removed: |
|
160 |
![plot of chunk unnamed-chunk-13](http://i.imgur.com/c1Wyh9J.png) |
|
161 |
|
|
162 |
|
|
163 |
### Apply this function to the full CONUS dataset described above. |
|
164 |
Now look at the distributions of the residuals for the full conus dataset. |
|
165 |
![plot of chunk unnamed-chunk-14](http://i.imgur.com/od4iR2S.png) |
|
93 | 166 |
|
94 |
We could also re-imagine the interpolation of daily missing pixel values based on the temporal trend of previous and future observation driven by an ancillary modelled data [http://dx.doi.org/10.1016/j.rse.2007.07.007]. The modelled data can be obtained from external climate models such as NCEP Reanalysis (at >0.5 degree of resolution). For example: |
|
167 |
And summarize the residuals into RMSE's by station: |
|
168 |
![plot of chunk unnamed-chunk-15](http://i.imgur.com/KIV2rBk.png) |
|
95 | 169 |
|
96 |
![title](http://i.imgur.com/qZs0KZn.png)
|
|
170 |
So the vast majority of stations will have a RMSE of <5 using this simple method even if 25% of the data are missing.
|
|
97 | 171 |
|
98 |
However, this would require a major reorganization of how we are approaching the problem. |
|
172 |
# Next steps |
|
173 |
We should pick 3-4 problematic tiles and 1 tile with little missing data and explore how these various options differ in infilling LST. It would also be ideal if we could complete the interpolation using these different datasets and compare the resulting estimates of TMax over these tiles. |
|
99 | 174 |
|
100 | 175 |
# Remaining Questions |
101 | 176 |
|
Also available in: Unified diff
Adding a new option 2