- Table of contents
- VegBank validation
VegBank validation¶
6 feature requests
completed¶
"I observe that some, but not all, plots in VegBank dump do not have the plot name populated" (CVS validation: #10)this is issue #965
2014-1-x¶
- PRIORITY FEATURE REQUEST: remove embargoed columns from staging tables
extract¶
*VegBank.2014-1-x.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)
subset import command¶
time yes|(export log= version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_temp by_col=1 n=; make inputs/VegBank/{projectcontributor_,observationcontributor_,observation__community}/import_temp by_col=1 n=; make inputs/VegBank/scrub; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3.5 min ("3m23.895s") @vegbiendev
query¶
SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM "VegBank"."^taxon_observation.**.sample"
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
analytical_plot."datasource" = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID"
ORDER BY "^taxon_observation.**.sample"."locationName", "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID"
2013-11-14¶
*VegBank.2013-11-14.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)
add project_name
Bob's conference call feedback¶
OK with no more validation extracts [for VegBank/CVS]
Bob's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+¶
Yes this is all valid and we can move along
If an option at this point, I would be happier if we managed to retain the VegBank variable project_id_name, at least in addition to the nearly meaningless projectID shown in the BIEN datafile. Indeed, the projectID number is nowhere in the file defined and one needs to return to VegBank to lookup its meaning.
Is adding project_name necessary for you to sign off on VegBank, or is this a feature request?
A highly desirable feature request. I consider that I have already signed off.
Mike Lee's e-mail feedback¶
VALID! I'm happy to sign off on VegBank.
Mike Lee's e-mail feedback on attribution:[Bien-db]+Primary+plot+data+providers+and+projects+in+BIEN3+¶
project name is needed for attributionnow in the normalized DB, though not yet in the validation viewmap projectContributor table for attribution
If the VegBank project name could be captured, that would be ideal.The person responsible for the data is not always clear in VegBank, but it should be either in theobservationContributor tableand/or the projectContributor table.
So we will need to map the projectContributor table as well? Is this also true for CVS?
For VegBank, that would be great.
2013-11-10¶
*VegBank.2013-11-10.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)
e-mail feedback issues¶
individualCount should be NULL instead of defaulting to 1 when not specified
Mike Lee's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+¶
I have completed the QA of the VegBank to VegBIEN export. It's completely correct
except for one error, and that error is unfortunately systematic. One of the new fields, column BA, (individualCount) is defaulting to 1 instead of returning nulls when stemCount.stemCount in VegBank is null.This underreports the real number of individuals in most cases.For example, on INW18913, Danthonia intermedia has 40% cover over 400 sq. meters. This is a grass, and it's really hard to imagine that it can cover 40% of this space with one plant. It is amusing to try, though, perhaps with incredibly aggressive cloning!
Otherwise, all the new fields look correct, and your adjustments seemed to have worked correctly.
Bob's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+¶
Yes, it is looking good.
Mike found only one error, which has a default to 1 instead of null where there are no stems.I will try to take a look myself in the very near future, though I expect Mike looked very closely.
2013-10-31¶
*VegBank.2013-10-31.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)
subset import command:
time (export log= version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_temp by_col=1 n=; make inputs/VegBank/{observationcontributor_,observation__community}/import_temp by_col=1 n=; make inputs/VegBank/scrub; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3 min ("2m48.125s"), most of it on observationcontributor_, observation__community (together "2m1.005s" vs. "0m25.173s" for ^taxon_observation.**.sample) @starscream; 5 min ("5m18.533s") @vegbiendev
query:
SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM "VegBank"."^taxon_observation.**.sample"
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
analytical_plot."datasource" = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID"
WHERE "^taxon_observation.**.sample"."locationID" IN (/*Heritage*/27581, /*CVS*/80476, /*NPS*/25572, /*GAP*/46073, /*Jennings*/11862, /*soil*/27234)
ORDER BY "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID"
2013-10-10¶
See *VegBank.2013-10-10.6_plots.xls*
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
2013-10-25 conference call feedback¶
map VegBank taxoninterpretation.currentinterpretation -> taxondetermination.iscurrent- Bob said there shouldn't be multiple taxoninterpretations with currentinterpretation=true, but Mike Lee mentioned earlier that there were
- FEATURE REQUEST (Mike Lee): do not use taxoninterpretations with interpretationtype = "simplification for analysis" as the current interpretation
remove CVS records from VegBank to avoid duplication- identify duplicates by looking for CVS-style (SSN-format) authorplotcodes no, not all plots use SSN format
- all VegBank CVS data is also in CVS extract (Mike Lee), so OK to delete all CVS plots from VegBank extract
Mike Lee's 2013-10-17 conference call feedback¶
- no records missing or garbled
mapseems to have fixed itself since the extract was generatedcoordinateUncertaintyInMeters
addslopeAspect
,slopeGradient
to denormalized view- FEATURE REQUEST (Mike Lee: syntax issue): rename the input scientificName to scientificName_verbatim
omitauthorplantname
because it is not specific to the taxoninterpretation row (this is in a separate taxoninterpretation for the original determination instead)omitint_*plantsci*
fields because they are not specific to the taxoninterpretation row (they are autogenerated from other fields, so are not needed)maptaxoninterpretation.party_id
toidentifiedBy
addwe actually call thisstemCount
to denormalized viewindividualCount
, because for us,stemCount
is the count of stems within the individual, not the size class- FEATURE REQUEST (Mike Lee: needed if doing concept-based taxonomy; Bob: would hate to lose that info if we have it, but OK not to validate this): map
plantconcept.reference_id
toaccordingTo
Mike Lee's e-mail feedback¶
- FEATURE REQUEST (Mike Lee; Bob: OK with validating just the most recent interpretation for each taxonobservation; Brad: might have to live with validating just one taxoninterpretation per taxonobservation in the interest of moving on):
include all taxonInterpretations in validation extractfor now, include latest taxoninterpretation by date (Mike Lee)
Global comment:
We don't seem to have VegBank's accession codes stored in the output stream. This means it will be nearly impossible to update and sort out what's the same vs. what's already been added. Not sure how big of a deal that is, with this being a one-off operation vs. one we'd like to repeat. I'd prefer to have the accessionCode stored in VegBien from not only VegBank, but any outside database with GUIDs.Plot-level information: Everything that ported matched well,
3 fields that I'd include normally were missing in VegBIEN. Perhaps not absolutely CRITICAL, but certainly best-practice to include, especiallylocationAccuracy.coordinateUncertaintyInMeters (important)did not come through, (e.g., 150 m on Peet-Rockies-2), stored in VegBank's plot.LocationAccuracySlopeAspect and SlopeGradient(both in VegBank's plot table) also didn't come throughCover and Taxon Occurrence data: All rows ported, but the
verbatim name didn't come throughandmultiple taxonInterpretations (always problematic) aren't being ported[they are to VegBIEN, just not to the validation view], just the original (unless you got the VegBank data before Oct 3, 2013 [no, we got it on 2013-10-8])It appears that scientificName_verbatim is not taking exactly what is in the taxonObservation.authorPlantName field, but rather the scientific name with authority. I don't see where that field is going to, if anywhere. It's not critically important, but it is the author's original name used for the plant, which is a good idea to retain.
The
taxonInterpretation record that is chosen for each taxonObservation also appears not to be correct, or at least not always. VegBank doesn't police which version is "the only right answer" but some taxonInterpretations are marked as "current" and there can be multiple per taxonObservation, e.g., http://vegbank.org/get/detail/taxonobservation/927046For this record, someone (me), has marked Ludwigia suffruticosa as the updated taxon. I did this on 03-Oct-2012. If your data for pulling into VegBank are older than this date, then the import proceeded correctly. If not, it missed that new interpretation. This happens a couple of times in the dataset that I QA'd. The other is this: http://vegbank.org/get/detail/taxonobservation/927031
One right answer would be to port all interpretations over.
I did not address the issues of duplication, as the query output was duplicated by multiple stems, etc.
Stem Data: Important ERROR in that the
stemCount field didn't port to VegBien. All rows moved though, look good otherwise.
I do not see the number of individuals for each stem. This is technically not necessary, as there was one stem of the size mentioned, but there were also in some cases 60 additional ones. It depends on what the data will be used for.For example, on this plot:
http://vegbank.org/cite/VB.ob.27169.PEETROCKIES2
there are many stems for each stem class record:
http://vegbank.org/get/summary/stemcount/27169?where=where_observation_pkThis missing field is (VegBank's stemCount.stemCount).
plots to include¶
There several different types of plots in VegBank:
Heritage program, e.g., http://vegbank.org/cite/VB.ob.27516.BURNS1
CVS: http://vegbank.org/cite/urn:lsid:cvs.bio.unc.edu:observation:6317-%7BD4888010-F606-4FCE-AD08-ED8E1FC1BB9C%7D
NPS: http://vegbank.org/get/comprehensive/observation/25507
GAP: http://vegbank.org/get/comprehensive/observation/46008
Jennings: http://vegbank.org/cite/VB.Ob.11801.INW18913If all those transition well, I think we are in great shape. I have included plots with stems wherever possible.
One more: This one has userdefined soil variables. Are you even attempting to catch this (it's complicated and perhaps not central to the purpose of what you are doing)? This may be a good candidate for "feature request."
http://vegbank.org/get/std/observation/27169
Scroll down to where the soil variable are. Humic matter present, total N, etc.
tables to include¶
I think that one table id fine for now. We will let you know if there is a problem.
input columns to include¶
The only field I think it important (and even then we would survive without it) is this:
*observation_--dateaccuracyThis is for situations when the exact date of collection is not known, and the date given is approximate. Dateaccuracy says how approximate.
If area is important, i.e. the bounds in which taxa were collected, stems measured, etc., if anyone is to determine stem density, basal area, or work with species richness over area, we need these:
*observation_--taxonobservationarea
*observation_--stemobservationarea
if a particular class of stems are sampling in an area different from the plot, i.e., subsampling via stratum or stems:
*stemcount_--inferencearea
*stemcount_--stemtaxonareaYou seem to have captured plot.area, which is generally the same as the rest of these, but can differ depending on the method of sampling.
But BIEN seems to be an observation database, not a plot database, so perhaps these don't matter. If the plot was really large (e.g., a hectare) then that will affect the location accuracy, as the plot location is accurate only to part of the plot. If the plot extends 100m in one direction, the plant observed could be 100m further away from the geocoordinates than is indicated in the lat/long fields.
We might need some of these, but they might be embedded elsewhere. StratumHeight and stratumBase define where the cover value for that particular stratum starts and stops.
*stemcount_--stratumbase
*stemcount_--stratumheight
*stemcount_--stemdiameteraccuracy
*stemcount_--stemheightaccuracy
*stratum--stratumheight
*stratum--stratumbase
*stratum--stratumcover
*stratumtype--stratumindex
Since these fields are not currently mapped to VegBIEN, do you want me to add them before generating the extract, or would you be OK to validate an extract with the columns we have now?
Since time is of the essence, I am fine to validate without them.
subset import command:
time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_scrub by_col= n=; make inputs/VegBank/{observationcontributor_,observation__community}/import by_col=1 n=; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 30 min ("27m45.464s"), most of it on ^taxon_observation.**.sample ("0:24:48.082284 sec")
query:
SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM "VegBank"."^taxon_observation.**.sample"
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
analytical_plot."datasource" = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID"
WHERE "^taxon_observation.**.sample"."locationID" IN (/*Heritage*/27581, /*CVS*/80476, /*NPS*/25572, /*GAP*/46073, /*Jennings*/11862, /*soil*/27234)
ORDER BY "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID"
2013-9-19¶
query:
SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT analytical_plot.*, *
FROM "VegBank"."taxon_observation.**"(100)
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
analytical_plot."datasource" = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = "taxon_observation.**"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "taxon_observation.**"."aggregateOrganismObservationID"
quick import command:
time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/import_scrub by_col= n=100; make inputs/VegBank/{observationcontributor_,observation__community}/import by_col=1 n=; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3.5 min ("3m19.342s")
full import command: (needed in order to include all the selected plots rather than just those in the first 100 occurrences)
time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; unset n; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots # takes 1 hr ("01:01:45.938899" from `select 'Aug 9 14:59:46 PDT 2013'::timestamp with time zone - 'Aug 9 14:06:34 PDT 2013'::timestamp with time zone + /*TNRS*/'6m7.515s' + /*analytical_stem*/'146423.899 ms'`)
2013-9-5¶
records Mark requested:
- 3 plots from 3 places
- records with each of the following fields NOT NULL:
stemdiameter, stemcount, stemheight, cover, basalarea, biomass, stratum (stratumname)
2013-8-7¶
query:
SELECT analytical_plot.*, *
FROM "VegBank".project
JOIN "VegBank".observation USING (project_id)
JOIN "VegBank".plot USING (plot_id)
JOIN "VegBank".taxonobservation USING (observation_id)
LEFT JOIN "VegBank".taxoninterpretation USING (taxonobservation_id)
LEFT JOIN "VegBank".plantconcept_ USING (plantconcept_id)
LEFT JOIN "VegBank".taxonimportance USING (taxonobservation_id)
LEFT JOIN "VegBank".stemcount USING (taxonimportance_id)
LEFT JOIN r10667.analytical_plot ON
analytical_plot."datasource" = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = taxonobservation.taxonobservation_id::text
WHERE project_id IN ('10191', '928', '9297', '10504')
ORDER BY project_id
--LIMIT 1
import command:
(export version="$(make -s version)"; make schemas/$version/install; unset n; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots # takes 1 hr ("01:01:45.938899" from `select 'Aug 9 14:59:46 PDT 2013'::timestamp with time zone - 'Aug 9 14:06:34 PDT 2013'::timestamp with time zone + /*TNRS*/'6m7.515s' + /*analytical_stem*/'146423.899 ms'`)
quick import command (to test a subset of each table, without the rows necessarily matching up between tables):
(n=1000; export version="$(make -s version)_n$n"; make schemas/$version/reinstall; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots
2012-12-5¶
Each subheading is a project.
Bob Peet's comments: (e-mail on 2012-12-9)
For our validation of plots I think you will need to summarize the data in not one but three tables.
- Plot information, which contains variables specific to the plot, like location and the persons who observed the plot and total basal area and soils variables
- Species in the plot with their strata and cover and names
- Individual tree stems in the plot with their individual attributes
Of course you could denormalize these down to one, but it will be almost impossible to interpret
Vegetation of the east slope of the Colorado Front Range¶
See *VegBank.10191.50_rows.csv*
query:
SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '10191' LIMIT 50;
Great Smoky Mountains National Park¶
Bob Peet's comments: (e-mail on 2012-12-9)
Your data come from the plot summarized in vegbank at http://vegbank.org/vegbank/views/observation_comprehensive.jsp?view=comprehensive&wparam=3536&entity=observation¶ms=3536
items labeled FIXED should be reconfirmed in the next import
The unique plot ID is missingThis is now in the locationIDCommunity classification is missingThe person collecting the plot is not shownthis would be inobservationContributor
.DecimalLatitude and decimalLongiture lose a digit of significancecomparing attachment:VegBank.928.50_rows.csv and VegBank plot 3536 @ project 928, the values are actually equal. might Excel be displaying only the first 10 digits of the value?State is wrong, not Wyoming, but TennesseeNeed location name translated form FKThis is now in the locationName- DO NOT FIX (Brad): I would rather have the plot area in m2 than ha
ScientificName has only the genus and the species gets added only in the speciesBinomialwithmorphospecies – the scientificNamewith morphospeices is the styrange concatenation of author, genus speciesThe first species (Vaccinium staminium line 2) also shows up in lines 12 and 13. The latter 2 have a cover of .55, which is the correct cover for the species in the shortshrub stratum, which is the only stratum it is reported in, so why 3 values?
this is because there is one row per size class (stemCount
), not per taxon (taxonImportance
) .
query:
SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '928' LIMIT 50;
Carolina Vegetation Survey 27¶
See *VegBank.9297.50_rows.csv*
query:
SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '9297' LIMIT 50;
Cape Fear Riparian¶
See *VegBank.10504.50_rows.csv*
Bob Peet's comments: (e-mail on 2012-12-9)
Your data come from the plot summarized in vegbank at http://vegbank.org/vegbank/views/observation_comprehensive.jsp?view=comprehensive&wparam=28256&entity=observation¶ms=28256
items labeled FIXED should be reconfirmed in the next import
County is incorrect (not Powell, but Orange)The unique plot ID is missingThis is now in the locationIDCommunity classification is missingThe person collecting the plot is not shownthis would be inobservationContributor
.DecimalLatitude and decimalLongiture lose a digit of significancecomparing attachment:VegBank.10504.50_rows.csv and VegBank plot 28256 @ project 10504, the values are actually equal. might Excel be displaying only the first 10 digits of the value?CoordinateUncertaintyInMeters is missing (should be 50 m)fixed in r10684ScientificName (Why No suitable Matches for any of the records??Why is percentCover blank?fixed in r10695, r10696For some woody species there is a diameterBreastHeight given, whereas there are many diameters in the datasetthis is because there is one row per size class (stemCount
), not per taxon (taxonImportance
) .Species occur in various strata, but the strata are not indicated- FEATURE REQUEST (POSTPONE until after Oct) (Brad): Soil data are not shown
query:
SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '10504' LIMIT 50;