Project

General

Profile

VegBank validation

6 feature requests

completed

  1. "I observe that some, but not all, plots in VegBank dump do not have the plot name populated" (CVS validation: #10) this is issue #965

2014-1-x

  • PRIORITY FEATURE REQUEST: remove embargoed columns from staging tables

extract

*VegBank.2014-1-x.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)

subset import command

time yes|(export log= version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_temp by_col=1 n=; make inputs/VegBank/{projectcontributor_,observationcontributor_,observation__community}/import_temp by_col=1 n=; make inputs/VegBank/scrub; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3.5 min ("3m23.895s") @vegbiendev

query

SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM      "VegBank"."^taxon_observation.**.sample" 
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
    analytical_plot."datasource"                     = 'VegBank'
AND analytical_plot."taxonOccurrenceID"              = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID" 
ORDER BY "^taxon_observation.**.sample"."locationName", "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID" 

2013-11-14

*VegBank.2013-11-14.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)

  • add project_name

Bob's conference call feedback

OK with no more validation extracts [for VegBank/CVS]

Bob's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+

Yes this is all valid and we can move along

If an option at this point, I would be happier if we managed to retain the VegBank variable project_id_name, at least in addition to the nearly meaningless projectID shown in the BIEN datafile. Indeed, the projectID number is nowhere in the file defined and one needs to return to VegBank to lookup its meaning.

Is adding project_name necessary for you to sign off on VegBank, or is this a feature request?

A highly desirable feature request. I consider that I have already signed off.

Mike Lee's e-mail feedback

VALID! I'm happy to sign off on VegBank.

Mike Lee's e-mail feedback on attribution:[Bien-db]+Primary+plot+data+providers+and+projects+in+BIEN3+

  • project name is needed for attribution now in the normalized DB, though not yet in the validation view
  • map projectContributor table for attribution

If the VegBank project name could be captured, that would be ideal. The person responsible for the data is not always clear in VegBank, but it should be either in the observationContributor table and/or the projectContributor table.

So we will need to map the projectContributor table as well? Is this also true for CVS?

For VegBank, that would be great.

2013-11-10

*VegBank.2013-11-10.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)

e-mail feedback issues

  • individualCount should be NULL instead of defaulting to 1 when not specified

Mike Lee's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+

I have completed the QA of the VegBank to VegBIEN export. It's completely correct except for one error, and that error is unfortunately systematic. One of the new fields, column BA, (individualCount) is defaulting to 1 instead of returning nulls when stemCount.stemCount in VegBank is null. This underreports the real number of individuals in most cases.

For example, on INW18913, Danthonia intermedia has 40% cover over 400 sq. meters. This is a grass, and it's really hard to imagine that it can cover 40% of this space with one plant. It is amusing to try, though, perhaps with incredibly aggressive cloning!

Otherwise, all the new fields look correct, and your adjustments seemed to have worked correctly.

Bob's e-mail feedback:[Bien-db]+Fwd:+VegBank+validation+extract+

Yes, it is looking good. Mike found only one error, which has a default to 1 instead of null where there are no stems. I will try to take a look myself in the very near future, though I expect Mike looked very closely.

2013-10-31

*VegBank.2013-10-31.6_plots.xls*
(input and output data are in separate tabs. refer to the VegCore data dictionary for column definitions.)

subset import command:

time (export log= version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_temp by_col=1 n=; make inputs/VegBank/{observationcontributor_,observation__community}/import_temp by_col=1 n=; make inputs/VegBank/scrub; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3 min ("2m48.125s"), most of it on observationcontributor_, observation__community (together "2m1.005s" vs. "0m25.173s" for ^taxon_observation.**.sample) @starscream; 5 min ("5m18.533s") @vegbiendev

query:

SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM      "VegBank"."^taxon_observation.**.sample" 
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
    analytical_plot."datasource"                     = 'VegBank'
AND analytical_plot."taxonOccurrenceID"              = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID" 
WHERE "^taxon_observation.**.sample"."locationID" IN (/*Heritage*/27581, /*CVS*/80476, /*NPS*/25572, /*GAP*/46073, /*Jennings*/11862, /*soil*/27234)
ORDER BY "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID" 

2013-10-10

See *VegBank.2013-10-10.6_plots.xls*
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

2013-10-25 conference call feedback

  • map VegBank taxoninterpretation.currentinterpretation -> taxondetermination.iscurrent
    • Bob said there shouldn't be multiple taxoninterpretations with currentinterpretation=true, but Mike Lee mentioned earlier that there were
  • FEATURE REQUEST (Mike Lee): do not use taxoninterpretations with interpretationtype = "simplification for analysis" as the current interpretation
  • remove CVS records from VegBank to avoid duplication
    • identify duplicates by looking for CVS-style (SSN-format) authorplotcodes no, not all plots use SSN format
    • all VegBank CVS data is also in CVS extract (Mike Lee), so OK to delete all CVS plots from VegBank extract

Mike Lee's 2013-10-17 conference call feedback

  • no records missing or garbled
  • map coordinateUncertaintyInMeters seems to have fixed itself since the extract was generated
  • add slopeAspect, slopeGradient to denormalized view
  • FEATURE REQUEST (Mike Lee: syntax issue): rename the input scientificName to scientificName_verbatim
  • omit authorplantname because it is not specific to the taxoninterpretation row (this is in a separate taxoninterpretation for the original determination instead)
  • omit int_*plantsci* fields because they are not specific to the taxoninterpretation row (they are autogenerated from other fields, so are not needed)
  • map taxoninterpretation.party_id to identifiedBy
  • add stemCount to denormalized view we actually call this individualCount, because for us, stemCount is the count of stems within the individual, not the size class
  • FEATURE REQUEST (Mike Lee: needed if doing concept-based taxonomy; Bob: would hate to lose that info if we have it, but OK not to validate this): map plantconcept.reference_id to accordingTo

Mike Lee's e-mail feedback

  • FEATURE REQUEST (Mike Lee; Bob: OK with validating just the most recent interpretation for each taxonobservation; Brad: might have to live with validating just one taxoninterpretation per taxonobservation in the interest of moving on):
    include all taxonInterpretations in validation extract
    • for now, include latest taxoninterpretation by date (Mike Lee)

Global comment:
We don't seem to have VegBank's accession codes stored in the output stream. This means it will be nearly impossible to update and sort out what's the same vs. what's already been added. Not sure how big of a deal that is, with this being a one-off operation vs. one we'd like to repeat. I'd prefer to have the accessionCode stored in VegBien from not only VegBank, but any outside database with GUIDs.

Plot-level information: Everything that ported matched well, 3 fields that I'd include normally were missing in VegBIEN. Perhaps not absolutely CRITICAL, but certainly best-practice to include, especially locationAccuracy.
coordinateUncertaintyInMeters (important)
did not come through, (e.g., 150 m on Peet-Rockies-2), stored in VegBank's plot.LocationAccuracy
SlopeAspect and SlopeGradient (both in VegBank's plot table) also didn't come through

Cover and Taxon Occurrence data: All rows ported, but the verbatim name didn't come through and multiple taxonInterpretations (always problematic) aren't being ported [they are to VegBIEN, just not to the validation view], just the original (unless you got the VegBank data before Oct 3, 2013 [no, we got it on 2013-10-8])

It appears that scientificName_verbatim is not taking exactly what is in the taxonObservation.authorPlantName field, but rather the scientific name with authority. I don't see where that field is going to, if anywhere. It's not critically important, but it is the author's original name used for the plant, which is a good idea to retain.

The taxonInterpretation record that is chosen for each taxonObservation also appears not to be correct, or at least not always. VegBank doesn't police which version is "the only right answer" but some taxonInterpretations are marked as "current" and there can be multiple per taxonObservation, e.g., http://vegbank.org/get/detail/taxonobservation/927046

For this record, someone (me), has marked Ludwigia suffruticosa as the updated taxon. I did this on 03-Oct-2012. If your data for pulling into VegBank are older than this date, then the import proceeded correctly. If not, it missed that new interpretation. This happens a couple of times in the dataset that I QA'd. The other is this: http://vegbank.org/get/detail/taxonobservation/927031

One right answer would be to port all interpretations over.

I did not address the issues of duplication, as the query output was duplicated by multiple stems, etc.

Stem Data: Important ERROR in that the stemCount field didn't port to VegBien. All rows moved though, look good otherwise.
I do not see the number of individuals for each stem. This is technically not necessary, as there was one stem of the size mentioned, but there were also in some cases 60 additional ones. It depends on what the data will be used for.

For example, on this plot:
http://vegbank.org/cite/VB.ob.27169.PEETROCKIES2
there are many stems for each stem class record:
http://vegbank.org/get/summary/stemcount/27169?where=where_observation_pk

This missing field is (VegBank's stemCount.stemCount).

plots to include

_(from Mike Lee)_

There several different types of plots in VegBank:

Heritage program, e.g., http://vegbank.org/cite/VB.ob.27516.BURNS1
CVS: http://vegbank.org/cite/urn:lsid:cvs.bio.unc.edu:observation:6317-%7BD4888010-F606-4FCE-AD08-ED8E1FC1BB9C%7D
NPS: http://vegbank.org/get/comprehensive/observation/25507
GAP: http://vegbank.org/get/comprehensive/observation/46008
Jennings: http://vegbank.org/cite/VB.Ob.11801.INW18913

If all those transition well, I think we are in great shape. I have included plots with stems wherever possible.

One more: This one has userdefined soil variables. Are you even attempting to catch this (it's complicated and perhaps not central to the purpose of what you are doing)? This may be a good candidate for "feature request."

http://vegbank.org/get/std/observation/27169

Scroll down to where the soil variable are. Humic matter present, total N, etc.

tables to include

_(from Bob Peet)_

I think that one table id fine for now. We will let you know if there is a problem.

input columns to include

_(from Mike Lee)_

The only field I think it important (and even then we would survive without it) is this:
*observation_--dateaccuracy

This is for situations when the exact date of collection is not known, and the date given is approximate. Dateaccuracy says how approximate.

If area is important, i.e. the bounds in which taxa were collected, stems measured, etc., if anyone is to determine stem density, basal area, or work with species richness over area, we need these:
*observation_--taxonobservationarea
*observation_--stemobservationarea
if a particular class of stems are sampling in an area different from the plot, i.e., subsampling via stratum or stems:
*stemcount_--inferencearea
*stemcount_--stemtaxonarea

You seem to have captured plot.area, which is generally the same as the rest of these, but can differ depending on the method of sampling.

But BIEN seems to be an observation database, not a plot database, so perhaps these don't matter. If the plot was really large (e.g., a hectare) then that will affect the location accuracy, as the plot location is accurate only to part of the plot. If the plot extends 100m in one direction, the plant observed could be 100m further away from the geocoordinates than is indicated in the lat/long fields.

We might need some of these, but they might be embedded elsewhere. StratumHeight and stratumBase define where the cover value for that particular stratum starts and stops.
*stemcount_--stratumbase
*stemcount_--stratumheight
*stemcount_--stemdiameteraccuracy
*stemcount_--stemheightaccuracy
*stratum--stratumheight
*stratum--stratumbase
*stratum--stratumcover
*stratumtype--stratumindex

Since these fields are not currently mapped to VegBIEN, do you want me to add them before generating the extract, or would you be OK to validate an extract with the columns we have now?

Since time is of the essence, I am fine to validate without them.

subset import command:

time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/{Source,'^taxon_observation.**.sample'}/import_scrub by_col= n=; make inputs/VegBank/{observationcontributor_,observation__community}/import by_col=1 n=; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 30 min ("27m45.464s"), most of it on ^taxon_observation.**.sample ("0:24:48.082284 sec")

query:

SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT *
FROM      "VegBank"."^taxon_observation.**.sample" 
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
    analytical_plot."datasource"                     = 'VegBank'
AND analytical_plot."taxonOccurrenceID"              = "^taxon_observation.**.sample"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "^taxon_observation.**.sample"."aggregateOrganismObservationID" 
WHERE "^taxon_observation.**.sample"."locationID" IN (/*Heritage*/27581, /*CVS*/80476, /*NPS*/25572, /*GAP*/46073, /*Jennings*/11862, /*soil*/27234)
ORDER BY "^taxon_observation.**.sample"."identificationID", "^taxon_observation.**.sample"."aggregateOrganismObservationID" 

2013-9-19

query:

SET search_path TO "VegBank_VegBIEN"; -- needed for locationevent__contributors(), locationevent__communities() for now
SELECT analytical_plot.*, *
FROM      "VegBank"."taxon_observation.**"(100)
LEFT JOIN "VegBank_VegBIEN".analytical_plot ON
    analytical_plot."datasource"                     = 'VegBank'
AND analytical_plot."taxonOccurrenceID"              = "taxon_observation.**"."taxonOccurrenceID"::text
AND analytical_plot."aggregateOrganismObservationID" = "taxon_observation.**"."aggregateOrganismObservationID" 

quick import command:

time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; make inputs/VegBank/import_scrub by_col= n=100; make inputs/VegBank/{observationcontributor_,observation__community}/import by_col=1 n=; make inputs/VegBank/publish; bin/make_analytical_db) # runtime: 3.5 min ("3m19.342s")

full import command: (needed in order to include all the selected plots rather than just those in the first 100 occurrences)

time (export version=VegBank_VegBIEN; make schemas/$version/reinstall; unset n; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots
    # takes 1 hr ("01:01:45.938899" from `select 'Aug  9 14:59:46 PDT 2013'::timestamp with time zone - 'Aug  9 14:06:34 PDT 2013'::timestamp with time zone + /*TNRS*/'6m7.515s' + /*analytical_stem*/'146423.899 ms'`)

2013-9-5

records Mark requested:

  • 3 plots from 3 places
  • records with each of the following fields NOT NULL:
    stemdiameter, stemcount, stemheight, cover, basalarea, biomass, stratum (stratumname)

2013-8-7

query:

SELECT analytical_plot.*, *
FROM      "VegBank".project
JOIN      "VegBank".observation         USING (project_id)
JOIN      "VegBank".plot                USING (plot_id)
JOIN      "VegBank".taxonobservation    USING (observation_id)
LEFT JOIN "VegBank".taxoninterpretation USING (taxonobservation_id)
LEFT JOIN "VegBank".plantconcept_       USING (plantconcept_id)
LEFT JOIN "VegBank".taxonimportance     USING (taxonobservation_id)
LEFT JOIN "VegBank".stemcount           USING (taxonimportance_id)
LEFT JOIN r10667.analytical_plot    ON
    analytical_plot."datasource"        = 'VegBank'
AND analytical_plot."taxonOccurrenceID" = taxonobservation.taxonobservation_id::text
WHERE project_id IN ('10191', '928', '9297', '10504')
ORDER BY project_id
--LIMIT 1

import command:

(export version="$(make -s version)"; make schemas/$version/install; unset n; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots
    # takes 1 hr ("01:01:45.938899" from `select 'Aug  9 14:59:46 PDT 2013'::timestamp with time zone - 'Aug  9 14:06:34 PDT 2013'::timestamp with time zone + /*TNRS*/'6m7.515s' + /*analytical_stem*/'146423.899 ms'`)

quick import command (to test a subset of each table, without the rows necessarily matching up between tables):

(n=1000; export version="$(make -s version)_n$n"; make schemas/$version/reinstall; make inputs/VegBank/import_scrub by_col=1; bin/make_analytical_db) # need all rows in order to get all spot-checked plots

2012-12-5

Each subheading is a project.

Bob Peet's comments: (e-mail on 2012-12-9)

For our validation of plots I think you will need to summarize the data in not one but three tables.

  1. Plot information, which contains variables specific to the plot, like location and the persons who observed the plot and total basal area and soils variables
  2. Species in the plot with their strata and cover and names
  3. Individual tree stems in the plot with their individual attributes

Of course you could denormalize these down to one, but it will be almost impossible to interpret

Vegetation of the east slope of the Colorado Front Range

See *VegBank.10191.50_rows.csv*

query:

SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '10191' LIMIT 50;

Great Smoky Mountains National Park

See *VegBank.928.50_rows.csv*

Bob Peet's comments: (e-mail on 2012-12-9)

Your data come from the plot summarized in vegbank at http://vegbank.org/vegbank/views/observation_comprehensive.jsp?view=comprehensive&wparam=3536&entity=observation&params=3536

items labeled FIXED should be reconfirmed in the next import

  1. The unique plot ID is missing This is now in the locationID
  2. Community classification is missing
  3. The person collecting the plot is not shown this would be in observationContributor .
  4. DecimalLatitude and decimalLongiture lose a digit of significance comparing attachment:VegBank.928.50_rows.csv and VegBank plot 3536 @ project 928, the values are actually equal. might Excel be displaying only the first 10 digits of the value?
  5. State is wrong, not Wyoming, but Tennessee
  6. Need location name translated form FK This is now in the locationName
  7. DO NOT FIX (Brad): I would rather have the plot area in m2 than ha
  8. ScientificName has only the genus and the species gets added only in the speciesBinomialwithmorphospecies – the scientificNamewith morphospeices is the styrange concatenation of author, genus species
  9. The first species (Vaccinium staminium line 2) also shows up in lines 12 and 13. The latter 2 have a cover of .55, which is the correct cover for the species in the shortshrub stratum, which is the only stratum it is reported in, so why 3 values?
    this is because there is one row per size class (stemCount), not per taxon (taxonImportance) .

query:

SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '928' LIMIT 50;

Carolina Vegetation Survey 27

See *VegBank.9297.50_rows.csv*

query:

SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '9297' LIMIT 50;

Cape Fear Riparian

See *VegBank.10504.50_rows.csv*

Bob Peet's comments: (e-mail on 2012-12-9)

Your data come from the plot summarized in vegbank at http://vegbank.org/vegbank/views/observation_comprehensive.jsp?view=comprehensive&wparam=28256&entity=observation&params=28256

items labeled FIXED should be reconfirmed in the next import

  1. County is incorrect (not Powell, but Orange)
  2. The unique plot ID is missing This is now in the locationID
  3. Community classification is missing
  4. The person collecting the plot is not shown this would be in observationContributor .
  5. DecimalLatitude and decimalLongiture lose a digit of significance comparing attachment:VegBank.10504.50_rows.csv and VegBank plot 28256 @ project 10504, the values are actually equal. might Excel be displaying only the first 10 digits of the value?
  6. CoordinateUncertaintyInMeters is missing (should be 50 m) fixed in r10684
  7. ScientificName (Why No suitable Matches for any of the records??
  8. Why is percentCover blank? fixed in r10695, r10696
  9. For some woody species there is a diameterBreastHeight given, whereas there are many diameters in the dataset this is because there is one row per size class (stemCount), not per taxon (taxonImportance) .
  10. Species occur in various strata, but the strata are not indicated
  11. FEATURE REQUEST (POSTPONE until after Oct) (Brad): Soil data are not shown

query:

SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'VegBank' AND "projectID" = '10504' LIMIT 50;