spot-checking validation details¶
- BIEN2 traits validation
- CVS validation
- FIA validation
- GBIF validation
- SALVIAS validation
- Spot-checking workflow issues
- TEX validation
- VegBank validation
See also overall status
Note that these are validations of analytical_stem_view
(the denormalized view) rather than the raw VegBIEN data, because this was believed to be the output of VegBIEN at the time of the validation.
(Also, the denormalized view is much simpler, and it would be very time-consuming for validators to check several dozen VegBIEN tables against their data.)
This means that it was up to the validator to request the addition of columns to analytical_stem_view
, if they required this to sign off on their validation.
Thus, there may be mis-mappings in the non-core columns that were not part of analytical_stem_view
, in spite of a successful validation.
Note: analytical_stem.institutionCode
has been renamed to datasource
as of r6795 (2012-12-11). institutionCode
now refers to the subprovider instead.
Queries¶
To get rows for a datasource, use one of the following queries.
- If
analytical_stem
has not yet been created, it's OK to useanalytical_stem_view
instead ofanalytical_specimen/plot
. - If you change
analytical_stem_view
to fix a bug, you must also reloadanalytical_stem
becauseanalytical_specimen/plot
use this materialized table instead of the view.
Specimens¶
Input and output data side-by-side:
After running the query, split the input and output data into separate spreadsheet tabs
SELECT *
FROM "<datasource>"."Specimen"
LEFT JOIN /*r#.*/analytical_specimen ON
analytical_specimen."datasource" = '<datasource>'
AND analytical_specimen."occurrenceID" = "Specimen"."occurrenceID"
ORDER BY "Specimen".row_num
LIMIT 100;
Plots¶
Input data must be retrieved separately.
Output data:
SELECT *
FROM /*r#.*/analytical_plot
WHERE "datasource" = '<datasource>'
LIMIT 100;
- Table of contents
- spot-checking validation details
- ACAD
- ARIZ
- BRIT
- CTFS
- CVS
- HVAA
- Madidi
- MO
- NCU
- NVS
- NY
- REMIB
- SALVIAS
- SpeciesLink
- TEAM
- TEX
- U
- UNCC
- VegBank
- Herbaria used
ACAD¶
Source data used is on vegbiendev in /home/bien/inputs/ACAD/Specimen/occurrence.txt
2013-3-6¶
See ACAD.2013-3-6.100_rows.xls and ACAD.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-3-7)
Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.
query:
SELECT *
FROM "ACAD"."Specimen"
LEFT JOIN /*r7855.*/analytical_specimen ON
analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-26¶
See ACAD.2013-2-26.100_rows.xls and ACAD.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27, ACAD.2013-2-26.100_rows_bb.xls)
- institutionCode:
Should be herbarium acronym (ACAD) [not valueAcadia University
]. See: http://rs.tdwg.org/dwc/terms/#institutionCode
query:
SELECT *
FROM "ACAD"."Specimen"
LEFT JOIN /*r7721.*/analytical_specimen ON
analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See ACAD.2013-2-20.100_rows.xls and ACAD.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-21, ACAD.2013-2-20.100_rows_bb.xls)
- FIXED:
collectionCode:
See spreadsheet "vegbien_identifiers". For collectionCode, note "In general, use whatever the data provider provides, unless badly formed". They have provided a properly formed, globally unique urn from the "Biodiversity Collections Index". Just use it as provided (urn:lsid:biocol.org:col:13294) - FIXED:
scientificName_verbatim:
Still incorrect Include family only if no lower ranked name is provided.
query:
SELECT *
FROM "ACAD"."Specimen"
LEFT JOIN /*r7620.*/analytical_specimen ON
analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-1-25¶
See ACAD.2013-1-25.100_rows.xls and ACAD.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-1-31, ACAD.2013-1-25.100_rows_bb.xls)
Validation complete. I found not problems with content. Great! This means we are done validating all of Canadensys.
I have suggested some changes to the layout of the specimen extract. Please apply these changes to all future validation extracts for all specimens from all sources, not just ACAD.
column name | comments |
datasource | |
institutionCode | |
collectionCode | |
catalogNumber | |
occurrenceID | |
country | |
stateProvince | |
county | |
locality | |
decimalLatitude | |
decimalLongitude | |
coordinateUncertaintyInMeters | |
coordinateSource_bien | |
georeferenceProtocol_bien | |
geovalid_bien | |
isNewWorld_bien | |
dateCollected | |
higherPlantGroup_bien | |
family | |
genus | |
speciesBinomial | |
taxonName | |
scientificNameAuthorship | |
scientificNameWithMorphospecies | omit for all specimen validation extracts; this pertains only to plots |
threatened_bien | move to after `growthForm` |
identifiedBy | omit; this pertains to TNRS accepted name |
dateIdentified | omit; this pertains to TNRS accepted name |
identificationRemarks | omit; this pertains to TNRS accepted name |
growthForm | |
cultivated_bien | |
cultivatedBasis_bien | |
recordedBy | move these two columns to before `dateCollected` |
recordNumber | move these two columns to before `dateCollected` |
Add the following columns before `family`: | |
family_verbatim | As provided by author |
taxonName_verbatim | As provided by author |
identifiedBy_verbatim | As provided by author |
dateIdentified_verbatim | As provided by author |
identificationRemarks_verbatim | As provided by author |
family_matched_tnrs | As matched by TNRS |
taxonName_matched_tnrs | As matched by TNRS |
Add the following column after `growthForm` | |
reproductiveCondition |
query:
SELECT *
FROM "ACAD"."Specimen"
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
analytical_stem."datasource" = 'ACAD'
AND analytical_stem."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;
2013-1-15¶
See ACAD.50_rows.xls and ACAD.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column
query:
SELECT *
FROM "ACAD"."Specimen"
LEFT JOIN /*r7023.*/analytical_stem ON
analytical_stem."datasource" = 'ACAD'
AND analytical_stem."occurrenceID" = "Specimen".id
LIMIT 50;
ARIZ¶
Source data is on vegbiendev in /home/bien/inputs/ARIZ/_MySQL/MySQL.{schema,data}.sql
2012-12-11¶
Source data used is on vegbiendev in /home/bien/inputs/ARIZ/ARIZ_DiGIR_21012010.csv.tar.gz
See ARIZ.50_rows.xls and ARIZ.50_rows.csv
Brad Boyle's comments: (ariz_validation_report.xlsx)
Fixed? | ||
coordinateUncertaintyInMeters | Populate with the values in column 'CoordinatePrecision' in ariz_raw (see below). | CoordinatePrecision does not exist in the raw data, so CoordinateUncertaintyInMeters is used instead |
elevationInMeters | This should equal (MinimumElevation+MaximumElevation)/2 if (a) both columns provide valid elevation values and MaximumElevation>=MinimumElevation). Currently on the value in MinimumElevation is being used. | PRIORITY FEATURE REQUEST |
plotArea_ha | Omit this column from specimen data extracts | FIXED |
samplingProtocol | Set to "Dried herbarium specimen" for all herbarium specimen records. | FEATURE REQUEST |
dateCollected | For valid date ranges, is it possible to use the middle value of the dates provided, assuming dates are valid? See CatalogueNumber 10021. Use later date if dates split into two even groups. Thus, mid-date for this record would be 1916-08-29. | FEATURE REQUEST |
speciesBinomialWithMorphospecies | This is correctly formed, but I recommend we call the column simply 'morphospecies' | FIXED |
scientificNameWithMorphospecies | Incorrectly formed, but we don't need it. Omit this column | FIXED |
threatened | This is not part of the original data providers data. In any view of data to reviewed by the data provider for validation, etc., I recommend suffixing the names of columns containing data added by BIEN with '_bien'. Thus, this field would become 'threatened_bien' | FIXED |
growthForm | Rename to growthForm_bien. See 'threatened', above. | The growthForm field actually contains data provided by the data provider, which is NULL for ARIZ because their DwC export for BIEN2 doesn't contain a growthForm column. |
cultivated | Rename to cultivated_bien. See 'threatened', above. | FIXED (the suffix might not always apply because it's sometimes from the data provider and sometimes parsed from the locality description) |
cultivatedBasis | Rename to cultivatedBasis_bien. See 'threatened', above. | FIXED |
coverPercent | Omit | FIXED |
diameterBreastHeight_cm | Omit | FIXED |
height_m | Omit | FIXED |
tag | Omit | FIXED |
organismX_m | Omit | FIXED |
organismY_m | Omit | FIXED |
authorTaxonCode | Omit | FIXED |
individualID | Omit | FIXED |
authorStemCode | Omit | FIXED |
Missing columns that should be included in validation view (names shown are from original data dump ariz_raw):
Fixed? | ||
Locality | The free-text locality description | FIXED |
FieldNotes | This is the detailed specimen description (of the plant, not the locality). Not sure what field this corresponds to in VegCore. | FIXED |
HorizontalDatum | =geodeticdatum in dwc | FEATURE REQUEST |
OriginalCoordinateSystem | =verbatimcoordinatesystem in dwc | FEATURE REQUEST |
GeoRefMethod | =georeferenceprotocol in dwc | FEATURE REQUEST |
CoordinatePrecision | This column incorrectly used by ARIZ. It appears they are actually entering undertainty in meters; therefore values should be transferred to coordinateUncertaintyInMeters. | The column is actually called CoordinateUncertaintyInMeters in the original data (nimoy:/home/bien_shared/raw_data/ariz/ARIZ_DiGIR_21012010.csv.tar.gz ), so there is no need to remap it. However, NYBG does use coordinatePrecision incorrectly, and I remapped it to coordinateUncertaintyInMeters. |
LatLongComments | =georeferenceremarks in dwc | FEATURE REQUEST |
IdentifiedDate | =identificationmodifier in dwc I think you mean dateIdentified | FIXED |
IdentificationModifier | Ubducates undertainty of identifications. In general, I rather than returning verbatim values, I recomomend populating with contents of column 'Annotations' from the TNRS output. This field should come after 'dateIdentified'. | PRIORITY FEATURE REQUEST |
query:
SELECT * FROM "public.2012-12-06-23-19-52.r6672".analytical_stem WHERE "institutionCode" = 'ARIZ' LIMIT 50;
BRIT¶
Source data used is on vegbiendev in /home/bien/inputs/BRIT/MySQL.*.sql
2013-3-6¶
See BRIT.2013-3-6.100_rows.xls and BRIT.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-3-7)
Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.
query:
SELECT *
FROM "BRIT".specimen_flat
LEFT JOIN /*r7855.*/analytical_specimen ON
analytical_specimen."datasource" = 'BRIT'
AND analytical_specimen."occurrenceID" = specimen_flat.specimen_id::text
WHERE specimen_flat.specimen_id >= 1660
ORDER BY specimen_flat.specimen_id
LIMIT 100;
import command:
make inputs/BRIT/import_scrub by_col=1 n=100 start=952; bin/make_analytical_db
2013-2-26¶
See BRIT.2013-2-26.100_rows.xls and BRIT.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27, BRIT.2013-2-26.100_rows_bb.xls)
- FIXED:
occurrenceRemarks:
Weird. Looks like they imported full label information and mixed together locality, habitat, specimen description and even the intitutional footer, all in one field. This will be too dangerous to keep as is. Let's discuss what to do about it.
if no Locality_Description, now ignore Notes_Plant because it's likely malformed - FIXED:
decimalLatitude/decimalLongitude:
BRIT looks like it's going to be a big headache to fix. Could you do me a favor and check to see if any of the BRIT records have non-NULL coordinates? I couldn't see any in the example extract.
Yes, about 15% of their rows have a valid lat/long. These rows are not in the first 100, however.
query:
SELECT *
FROM "BRIT".specimen_flat
LEFT JOIN /*r7724.*/analytical_specimen ON
analytical_specimen."datasource" = 'BRIT'
AND analytical_specimen."occurrenceID" = specimen_flat.specimen_id::text
ORDER BY specimen_flat.specimen_id
LIMIT 100;
import command:
make inputs/BRIT/import_scrub by_col=1 n=100; bin/make_analytical_db
CTFS¶
Source data used is in http://capstone.bradley.edu/~asingh2/files/bci_01April2011.zip
2012-12-17¶
See CTFS.2012-12-17.50_rows.xls and CTFS.2012-12-17.50_rows.csv
query:
SELECT
datasource AS "BIEN:datasource"
-- Country
, country AS "Country.CountryName"
-- Site
, "locationName" AS "Site.PlotID; Quadrat.QuadratID"
, "decimalLatitude" AS "Site.Latitude"
, "decimalLongitude" AS "Site.Longitude"
, "elevationInMeters" AS "Site.Elevation"
, "plotArea_ha" AS "Site.SizeOfSite"
-- Census
, "dateCollected" AS "Census.StartDate"
-- Tree
, "organismX_m" AS "Tree.x"
, "organismY_m" AS "Tree.y"
-- Stem
, tag AS "Stem.StemTag or Tree.Tag"
, "authorStemCode" AS "Stem.StemID"
, "diameterBreastHeight_cm" AS "Stem.DBH"
, "individualID" AS "Measurement.MeasureID"
-- Family
, "family" AS "Family.Family"
, "genus" AS "Genus.Genus"
, "speciesBinomial" AS "Genus.Genus SpecificEpithet.SpeciesName"
, "scientificNameAuthorship" AS "SpecificEpithet.SpeciesAuthority"
-- BIEN
, "higherPlantGroup" AS "BIEN:higherPlantGroup"
, "isNewWorld" AS "BIEN:isNewWorld"
, geovalid AS "BIEN:geovalid"
, cultivated AS "BIEN:cultivated"
, threatened AS "BIEN:threatened"
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'CTFS' LIMIT 50;
2012-12-12¶
See CTFS.50_rows.xls and CTFS.50_rows.csv
*Rick Condit* wanted the data in a form similar to CTFS itself, which has been provided (above)
query:
SELECT * FROM "public.2012-12-11-11-46-55.r6795".analytical_stem WHERE "datasource" = 'CTFS' LIMIT 50;
CVS¶
see CVS validation
HVAA¶
Source data used is on vegbiendev in /home/bien/inputs/HVAA/Specimen/Herbario_occur_1360871068.csv
2013-2-26¶
See HVAA.2013-2-26.100_rows.xls and HVAA.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27)
The following sources passed validation and are complete:
REMIB
HVAA
SpeciesLink
SALVIAS
TEAM
query:
SELECT *
FROM "HVAA"."Specimen"
LEFT JOIN /*r7722.*/analytical_specimen ON
analytical_specimen."datasource" = 'HVAA'
AND analytical_specimen."occurrenceID" = "Specimen"."symbiotaId"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/HVAA/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See HVAA.2013-2-20.100_rows.xls and HVAA.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-21, HVAA.2013-2-20.100_rows_bb.xls)
- FIXED:
scientificName_verbatim:
Include family only when no other lower ranked name is provided - FIXED:
ocurrenceRemarks:
Missing. please include in all specimen validation extracts- FIXED:
mapping this source only:=IF(input.occurrenceRemarks IS NULL, input.dynamicProperties,input.occurrenceRemarks)
- FIXED:
query:
SELECT *
FROM "HVAA"."Specimen"
LEFT JOIN /*r7620.*/analytical_specimen ON
analytical_specimen."datasource" = 'HVAA'
AND analytical_specimen."occurrenceID" = "Specimen"."symbiotaId"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/HVAA/import_scrub by_col=1 n=100; bin/make_analytical_db
Madidi¶
See the *Madidi validation spreadsheet*: (Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.)
Source data used is on vegbiendev in /home/bien/inputs/Madidi/*/*.csv
2013-3-7¶
query:
SELECT *
FROM "Madidi"."IndividualObservation"
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7870.*/analytical_plot ON
analytical_plot."datasource" = 'Madidi'
AND analytical_plot."locationName" = "IndividualObservation"."PlotName"
AND analytical_plot."subplot" = "IndividualObservation"."Sequence"
AND analytical_plot."individualCode" = "IndividualObservation"."Tree"
ORDER BY "IndividualObservation".row_num
LIMIT 100;
import command:
make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-3-6¶
See Madidi.2013-3-6.100_rows.xls and Madidi.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
query:
SELECT *
FROM "Madidi"."IndividualObservation"
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7870.*/analytical_plot ON
analytical_plot."datasource" = 'Madidi'
AND analytical_plot."institutionCode" = "IndividualObservation"."Institutions"
AND analytical_plot."collectionCode" IS NULL
AND analytical_plot."catalogNumber" = "IndividualObservation"."SpecimenID"
ORDER BY "IndividualObservation".row_num
LIMIT 100;
import command:
make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-3-5¶
See Madidi.2013-3-5.100_rows.xls and Madidi.2013-3-5.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-3-6)
Started working on this, but realized you sent me a specimen-data validation extract. This is plot data. Please send me a plot-format validation extract, containing plot codes, subplots, morphospecies names, etc.
query:
SELECT *
FROM "Madidi"."IndividualObservation"
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7848.*/analytical_specimen ON
analytical_specimen."datasource" = 'Madidi'
AND analytical_specimen."institutionCode" = "IndividualObservation"."Institutions"
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "IndividualObservation"."SpecimenID"
ORDER BY "IndividualObservation".row_num
LIMIT 100;
import command:
make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db
2012-12-5¶
See Madidi.50_rows.xls and Madidi.50_rows.csv
*Peter Jorgensen* wants the data refreshed first.
query:
SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'Madidi' AND "organismX_m" IS NOT NULL LIMIT 50;
MO¶
Source data used is on vegbiendev in /home/bien/inputs/MO/Specimen/digirexport*.txt
2012-11-30¶
See MO.50_rows.xls and MO.50_rows.csv
Peter Jorgensen's comments: MO.50_rows.annotated.xlsx
query:
SELECT * FROM "public.2012-11-28-18-18-51.r6487".analytical_stem WHERE "institutionCode" = 'MO' ORDER BY "recordNumber" LIMIT 50;
NCU¶
Source data used is on vegbiendev in /home/bien/inputs/NCU/_archive/NCU-NCSC_2010-02-12.csv
2012-12-7¶
See NCU-NCSC.50_rows.xls and NCU-NCSC.50_rows.csv
Bob Peet's comments: (e-mail on 2012-12-9)
FIXED: The collection dates often read something like 2065 rather than 1965. The refresh uses 4-digit years, so no longer has this problem. 2-digit years can be supported for other datasources using the _fix_date()
filter.
FIXED: Perhaps I misunderstand, but I also fail to see the verbatim name from the original record. Otherwise looks fine. scientificName_verbatim has been added to the analytical DB
query:
SELECT * FROM "public.2012-12-05-12-51-00.r6646".analytical_stem WHERE "institutionCode" = 'NCU-NCSC' LIMIT 50;
NVS¶
Source data used is on vegbiendev in /home/bien/inputs/NVS/_src/AKATARAWA 2008 .xls
2013-1-22¶
See NVS.2013-1-22.50_rows.xls and NVS.2013-1-22.50_rows.csv
Input and output data are in separate tabs
query:
SELECT *
FROM "NVS"."StemObservation"
LEFT JOIN r7317.analytical_stem analytical_stem ON
analytical_stem."datasource" = 'NVS'
AND analytical_stem."individualObservationID" = "StemObservation"."Item Obs ID"::text
ORDER BY "StemObservation".row_num
LIMIT 50;
2012-12-21¶
See NVS.2012-12-21.50_rows.xls and NVS.2012-12-21.50_rows.csv. This includes just the columns with NVS data and omits taxonomic scrubbing for easier verification with the source data.
query: (using a test-only import)
SELECT
datasource
, "projectID"
, "locationID"
, "plotArea_ha"
, "scientificName"
, "growthForm"
, "diameterBreastHeight_cm"
, tag
, "taxonOccurrenceID"
, "authorTaxonCode"
, "individualObservationID"
FROM analytical_stem
WHERE datasource = 'NVS'
AND "individualObservationID" IS NOT NULL
ORDER BY "locationID" DESC
LIMIT 50;
2012-12-6¶
See NVS.50_rows.xls and NVS.50_rows.csv
Nick Spencer's comments: (e-mail on 2012-12-18)
Mostly in the right places, but I did note the following issues.
FIXED: Suggested mapping corrections
NVS export element | -> | VegCSV element | Fixed? |
Identifier | -> | tag | Y |
itemID | -> | individualID | Y |
Item Obs ID | -> | taxonOccurrenceID or occurrenceID (depending on definition)* | Y |
species code | -> | authorTaxonCode | Y |
Verbatim Code | -> | authorStemCode | Yes, but actually corresponds to authorTaxonCode |
1. Coordinates
The requested plot coordinate output for a spread sheet export of NVS data is a map coordinate (as opposed to lat long). Obviously this doesn’t suit a multi data provider data repository such as BIEN. I will discuss with Shirley about altering our output to include a lat long. However, I wondered if you require a standard WGS84 value or in fact have to deal with various projections during your import process? How do you tell from the import data you have been given, which coordinate system was used?
2. *taxonOccurrenceID or occurrenceID
What is the definition or practical difference between these two elements? Our item Obs ID is a unique ID per stem observation, not a unique ID of the occurrence of a taxon (regardless of how many individuals occurrences there were, i.e. 4 tree of species Pseudowintera axillaris and a cover estimate of the same species by strata)
There is obviously a more rich set of attributes in the NVS export cf with what you mapped including other sources of occurrence data as well as cover estimates by strata. Have you tested these other data types in other data sets?
query:
SELECT * FROM "public.2012-12-05-12-51-00.r6646".analytical_stem WHERE "institutionCode" = 'NVS' LIMIT 50;
NY¶
Source data used is on vegbiendev in /home/bien/inputs/NY/Specimen/NYSpecimenDataAmericas.csv
2013-3-6¶
See NY.2013-3-6.100_rows.xls and NY.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-3-7)
Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.
query:
SELECT *
FROM "NY"."Ecatalog_all"
LEFT JOIN /*r7854.*/analytical_specimen ON
analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode"
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode"
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber"
ORDER BY "Ecatalog_all".pkey
LIMIT 100;
import command:
make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-26¶
See NY.2013-2-26.100_rows.xls and NY.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27, NY.2013-2-26.100_rows_bb.xls)
- FIXED:
occurrenceRemarks:
Do not preprend "plant fungus: "
query:
SELECT *
FROM "NY"."Ecatalog_all"
LEFT JOIN /*r7722.*/analytical_specimen ON
analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode"
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode"
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber"
ORDER BY "Ecatalog_all".pkey
LIMIT 100;
import command:
make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See NY.2013-2-20.100_rows.xls and NY.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-21, NY.2013-2-20.100_rows_bb.xls)
- FIXED:
scientificName_verbatim:
Include family only when no other lower ranked name is provided - FIXED:
ocurrenceRemarks:
Missing. please include in all specimen validation extracts- FIXED:
mapping this source only:=input.PlantFungDescription
- FIXED:
query:
SELECT *
FROM "NY"."Ecatalog_all"
LEFT JOIN /*r7623.*/analytical_specimen ON
analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode"
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode"
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber"
ORDER BY "Ecatalog_all".pkey
LIMIT 100;
import command:
make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db
2012-12-17¶
See NY.2012-12-17.50_rows.xls and NY.2012-12-17.50_rows.csv
query:
SELECT
datasource AS "BIEN:datasource"
, "institutionCode" AS "InstitutionCode"
, "collectionCode" AS "CollectionCode"
, "catalogNumber" AS "CatalogNumber"
, "scientificName" AS "ScientificName"
, "family" AS "Family"
, "genus" AS "Genus"
, "speciesBinomial" AS "Genus Species"
, "scientificNameAuthorship" AS "ScientificNameAuthor"
, "identifiedBy" AS "IdentifiedBy"
, "recordNumber" AS "FieldNumber"
, "recordedBy" AS "Collector"
, "dateCollected" AS "YearCollected-MonthCollected-DayCollected"
, "country" AS "Country"
, "stateProvince" AS "StateProvince"
, "county" AS "County"
, "decimalLongitude" AS "Longitude"
, "decimalLatitude" AS "Latitude"
, "coordinateUncertaintyInMeters" AS "CoordinatePrecision"
, "elevationInMeters" AS "MinimumElevation"
-- BIEN
, "higherPlantGroup" AS "BIEN:higherPlantGroup"
, "isNewWorld" AS "BIEN:isNewWorld"
, geovalid AS "BIEN:geovalid"
, cultivated AS "BIEN:cultivated"
, threatened AS "BIEN:threatened"
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'NY'
AND "institutionCode" = 'NY'
AND "collectionCode" = 'Herbarium'
LIMIT 50;
With the analytical DB's schema: NY.2012-12-17.orig.50_rows.xls and NY.2012-12-17.orig.50_rows.csv
query:
SELECT *
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'NY'
AND "institutionCode" = 'NY'
AND "collectionCode" = 'Herbarium'
LIMIT 50;
2012-11-28¶
See NY.csv on vegbiendev.nceas.ucsb.edu at /home/bien/inputs/NY/verify/NY.csv
(access via SSH/SFTP)
Barbara Thiers' comments: (e-mail on 2012-11-28)
I tested a sample set of 10 records, comparing the NY data extracted from BIEN 3.0, the NY dataset that was input into BIEN 3.0, and the original database at NY.
The data are correctly represented in BIEN 3.0, with one exception FIXED (see below). We use our Barcode number as our unique specimen identifier. In the input file, these data are stored in the column labelled "CatalogNumber." We also have an internal record number, which we don't use, but is included in the input file under the column "UniqueNYInternalRecordNumber."
In BIEN 3.0, the "UniqueNYInternalRecordNumber" (henceforth IRN) was used as the unique identifier (stored under "PlotName, and also concatenated with collector's number in RecordNumber.
FIXED: Instead, the barcode number (stored under "CatalogNumber" in the input file) should be used instead of the IRN. The barcode numbers should be unique, but if if is desirable to maintain the concatenated RecordNumber field, then it is the CatalogNumber that should be combined with the FieldNumber value from the input table.
however, some rows are missing a CatalogNumber1, so CatalogNumber is not a unique key for all rows.
1 1060 of 941,107 rowsSELECT * FROM "NY"."Ecatalog_all" WHERE "accessionNumber" IS NULL
Total query runtime: 942 ms.
FIXED: The FieldNumber is now in analytical_stem.recordNumber
. It would be very helpful to also maintain the FieldNumber as a separate column in the analytical database, so that the combination of Collector + Collector number could be easily queried (this is the 4th most used type of query for our database).
The attached spreadsheet [ Mapping changes needed for NYBG data in BIEN 3.0.xlsx ] illustrates the mapping changes needed
query:
SELECT * FROM "public.2012-11-24-14-51-43.r6446".analytical_stem WHERE "institutionCode" = 'NY';
REMIB¶
Source data used is on vegbiendev in /home/bien/inputs/REMIB/Specimen.src/node.*.csv
2013-2-26¶
See REMIB.2013-2-26.100_rows.xls and REMIB.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27)
The following sources passed validation and are complete:
REMIB
HVAA
SpeciesLink
SALVIAS
TEAM
query:
SELECT *
FROM "REMIB"."Specimen"
LEFT JOIN /*r7723.*/analytical_specimen ON
analytical_specimen."datasource" = 'REMIB'
AND analytical_specimen."institutionCode" = "Specimen".acronym
AND analytical_specimen."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num"
LIMIT 100;
import command:
make inputs/REMIB/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See REMIB.2013-2-20.100_rows.xls and REMIB.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-21, REMIB.2013-2-20.100_rows_bb.xls)
- FIXED:
county:
For this source only, "ND" = "no disponible = "not available". Just leave blank when county = "ND" - FIXED:
locality:
For this source only, "ND" = "no disponible = "not available". Just leave this field blank when both locality and habitat = "ND" - FIXED:
decimalLatitude:
This source uses 99 to indicate a missing value. Set both decimalLatitude and decimalLongitude to NULL if decimalLatitude=99 - FIXED:
decimalLongitude:
This source uses -999 to indicate a missing value. Set both decimalLatitude and decimalLongitude to NULL if decimalLongitude=-999 - FIXED:
scientificName_verbatim:
Include family only when no other lower ranked name is provided
query:
SELECT *
FROM "REMIB"."Specimen"
LEFT JOIN /*r7620.*/analytical_specimen ON
analytical_specimen."datasource" = 'REMIB'
AND analytical_specimen."institutionCode" = "Specimen".acronym
AND analytical_specimen."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num"
LIMIT 100;
import command:
make inputs/REMIB/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-1-25¶
See REMIB.2013-1-25.100_rows.xls and REMIB.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
query:
SELECT *
FROM "REMIB"."Specimen"
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
analytical_stem."datasource" = 'REMIB'
AND analytical_stem."institutionCode" = "Specimen".acronym
AND analytical_stem."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num"
LIMIT 100;
2013-1-15¶
See REMIB.50_rows.xls and REMIB.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column
query:
SELECT *
FROM "REMIB"."Specimen"
LEFT JOIN /*r7023.*/analytical_stem ON
analytical_stem."datasource" = 'REMIB'
AND analytical_stem."institutionCode" = "Specimen".acronym
AND analytical_stem."catalogNumber" = "Specimen".accession_number
LIMIT 50;
SALVIAS¶
SpeciesLink¶
Source data used is on vegbiendev in /home/bien/inputs/SpeciesLink/Specimen/specieslink.csv
2013-2-26¶
See SpeciesLink.2013-2-26.100_rows.xls and SpeciesLink.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27)
The following sources passed validation and are complete:
REMIB
HVAA
SpeciesLink
SALVIAS
TEAM
query:
SELECT *
FROM "SpeciesLink"."Specimen"
LEFT JOIN /*r7723.*/analytical_specimen ON
analytical_specimen."datasource" = 'SpeciesLink'
AND analytical_specimen."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/SpeciesLink/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See SpeciesLink.2013-2-20.100_rows.xls and SpeciesLink.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-22, SpeciesLink.2013-2-20.100_rows_bb.xls)
- FIXED:
scientificName_verbatim:
Include family only when no other lower ranked name is provided - FIXED:
identificationRemarks:
Incorrectly formed. The values shown are the field description of the specimen, and should be in columnn occurrenceRemarks (see below). As far as I can tell, identificationRemarks were not supplied for this source. - FIXED:
ocurrenceRemarks:
Missing. please include in all specimen validation extracts- FIXED:
mapping this source only:=input.dwc_terms_occurrenceRemarks
- FIXED:
query:
SELECT *
FROM "SpeciesLink"."Specimen"
LEFT JOIN /*r7632.*/analytical_specimen ON
analytical_specimen."datasource" = 'SpeciesLink'
AND analytical_specimen."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/SpeciesLink/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-1-25¶
See SpeciesLink.2013-1-25.100_rows.xls and SpeciesLink.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-1-31, SpeciesLink.2013-1-25.100_rows_bb.xls)
Content passes validation. I have the same format comments as for other specimen sources.
I did wonder why there were NULL values in the two analytical db fields geovalid_bien and isNewWorld_bien.
Rows that don't have both coordinates and a placename can't be geovalidated, so geovalid_bien is null. isNewWorld_bien is null whenever there isn't a scrubbed name match.
See attached [SpeciesLink.2013-1-25.100_rows_bb.xls].
column name | comments |
datasource | |
institutionCode | |
collectionCode | |
catalogNumber | |
occurrenceID | |
country | |
stateProvince | |
county | |
locality | |
decimalLatitude | |
decimalLongitude | |
coordinateUncertaintyInMeters | |
coordinateSource_bien | |
georeferenceProtocol_bien | |
geovalid_bien | Should be 0,1 only. Why are some values NULL? |
isNewWorld_bien | Should be 0,1 only. Why are some values NULL? |
dateCollected | |
higherPlantGroup_bien | |
family | |
genus | |
speciesBinomial | |
taxonName | |
scientificNameAuthorship | |
scientificNameWithMorphospecies | omit for all specimen validation extracts; this pertains only to plots |
threatened_bien | move to after `growthForm` |
identifiedBy | omit; this pertains to TNRS accepted name |
dateIdentified | omit; this pertains to TNRS accepted name |
identificationRemarks | omit; this pertains to TNRS accepted name |
growthForm | |
cultivated_bien | |
cultivatedBasis_bien | |
recordedBy | move these two columns to before `dateCollected` |
recordNumber | move these two columns to before `dateCollected` |
Add the following columns before `family`: | |
family_verbatim | As provided by author |
taxonName_verbatim | As provided by author |
identifiedBy_verbatim | As provided by author |
dateIdentified_verbatim | As provided by author |
identificationRemarks_verbatim | As provided by author |
family_matched_tnrs | As matched by TNRS |
taxonName_matched_tnrs | As matched by TNRS |
Add the following column after `growthForm` | |
reproductiveCondition |
query:
SELECT *
FROM "SpeciesLink"."Specimen"
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
analytical_stem."datasource" = 'SpeciesLink'
AND analytical_stem."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier"
ORDER BY "Specimen".row_num
LIMIT 100;
TEAM¶
Source data used is on vegbiendev in /home/bien/inputs/TEAM/Specimen.src/node.*.csv
2013-2-20¶
See TEAM.2013-2-20.100_rows.xls and TEAM.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-26)
[Re Plot data validations for SALVIAS and TEAM:]
Both look good. The only problem I could find was the incorrectly formed scientificName_verbatim, but I believe you have fixed that now, correct? [yes] If that issue is fixed then these are both good and I do not need to see them again. Consider them done.
query:
SELECT *
FROM "TEAM"."TEAM_Sites"
RIGHT JOIN "TEAM"."VT" ON "VT"."1ha Plot Number" = "TEAM_Sites"."Name"
LEFT JOIN /*r7629.*/analytical_plot ON
analytical_plot."datasource" = 'TEAM'
AND analytical_plot."individualObservationID" = "VT"."Id"::text
ORDER BY "VT".row_num
LIMIT 100;
import command:
make inputs/TEAM/import_scrub by_col=1 n=100; bin/make_analytical_db
TEX¶
see TEX validation
U¶
Source data used is on vegbiendev in /home/bien/inputs/U/Specimen/UtrechtHerbarium.csv
2013-2-26¶
See U.2013-2-26.100_rows.xls and U.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-27, U.2013-2-26.100_rows_bb.xls)
- FIXED:
institutionCode:
Please populate. Should be "U" taxonName_verbatim:no, the family must be included in what's sent to TNRS
Still incorrectly formed. Include family only when no other lower ranked name is provided.
query:
SELECT *
FROM "U"."Specimen"
LEFT JOIN /*r7723.*/analytical_specimen ON
analytical_specimen."datasource" = 'U'
AND analytical_specimen."occurrenceID" = "Specimen"."ID"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/U/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-20¶
See U.2013-2-20.100_rows.xls and U.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-2-21, U.2013-2-20.100_rows_bb.xls)
- FIXED:
locality:Was already fixed
This source uses input column "Ecology" as "habitat". Please append contents of input column "Ecology" - FIXED:
scientificName_verbatim:
Include family only when no other lower ranked name is provided - FIXED:
ocurrenceRemarks:
Missing. please include in all specimen validation extracts.Input field "Notes" maps to occurrenceRemarks for this source.
query:
SELECT *
FROM "U"."Specimen"
LEFT JOIN /*r7620.*/analytical_specimen ON
analytical_specimen."datasource" = 'U'
AND analytical_specimen."occurrenceID" = "Specimen"."ID"
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/U/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-1-25¶
See U.2013-1-25.100_rows.xls and U.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Brad Boyle's comments: (e-mail on 2013-1-31, U.2013-1-25.100_rows_bb.xls)
[...] consider the validation of U to be complete. See attached spreadsheet [U.2013-1-25.100_rows_bb.xls], with the usual format comments.
Content for U is fine. Although, as mentioned in my earlier email, I have some concerns about how we are forming DwC fields such as occurrenceID. But if we make any changes, it will apply to the entire database, so let's discuss this issue as a group.
column name | comments |
datasource | |
catalogNumber | |
occurrenceID | |
country | |
stateProvince | |
county | |
locality | |
decimalLatitude | |
decimalLongitude | |
coordinateUncertaintyInMeters | |
coordinateSource_bien | |
georeferenceProtocol_bien | |
geovalid_bien | |
isNewWorld_bien | |
dateCollected | |
higherPlantGroup_bien | |
family | |
genus | |
speciesBinomial | |
taxonName | |
scientificNameAuthorship | |
scientificNameWithMorphospecies | omit for all specimen validation extracts; this pertains only to plots |
threatened_bien | move to after `growthForm` |
identifiedBy | omit; this pertains to TNRS accepted name |
dateIdentified | omit; this pertains to TNRS accepted name |
identificationRemarks | omit; this pertains to TNRS accepted name |
growthForm | |
cultivated_bien | |
cultivatedBasis_bien | |
recordedBy | move these two columns to before `dateCollected` |
recordNumber | move these two columns to before `dateCollected` |
Add the following columns before `family`: | |
family_verbatim | As provided by author |
taxonName_verbatim | As provided by author |
identifiedBy_verbatim | As provided by author |
dateIdentified_verbatim | As provided by author |
identificationRemarks_verbatim | As provided by author |
family_matched_tnrs | As matched by TNRS |
taxonName_matched_tnrs | As matched by TNRS |
Add the following column after `growthForm` | |
reproductiveCondition | |
Add the following columns after datasource (even if null): | |
institutionCode | |
collectionCode |
query:
SELECT *
FROM "U"."Specimen"
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
analytical_stem."datasource" = 'U'
AND analytical_stem."occurrenceID" = "Specimen"."ID"
ORDER BY "Specimen".row_num
LIMIT 100;
2013-1-15¶
See U.50_rows.xls and U.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column
query:
SELECT *
FROM "U"."Specimen"
LEFT JOIN /*r7023.*/analytical_stem ON
analytical_stem."datasource" = 'U'
AND analytical_stem."occurrenceID" = "Specimen"."ID"
LIMIT 50;
UNCC¶
Source data used is on vegbiendev in /home/bien/inputs/UNCC/Specimen/UNCC.csv
2013-2-26¶
See UNCC.2013-2-26.100_rows.xls and UNCC.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
query:
SELECT *
FROM "UNCC"."Specimen"
LEFT JOIN /*r7870.*/analytical_specimen ON
analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium
OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL
OR analytical_specimen."institutionCode" = 'UNCC' AND "Specimen".herbarium = 'UNCCD'
)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/UNCC/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-2-26¶
See UNCC.2013-2-26.100_rows.xls and UNCC.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
Bob Peet's and Brad Boyle's comments: (e-mails on 2013-3-5, UNCC-validate_rkp_bb_rkp.xlsx)
Input | Output | Note | Brad's comments | Bob's comments on Brad's comments | Brad's final comments | Fixed? |
row_num | Might be good to save this | If this is a database ID, then I suggest Aaron save it under datasetRecordID | ok | |||
datasource | All records currently = UNCC; probably should be where you got the data = NCU | Bob: Index Herbariorum does not recognize UNC, only UNCC. These are UNCC collections, so I suggest we use UNCC as both the datasource (an internal BIEN field) and institutionCode (a DwC field). | I suppose UNCC is less confusing than NCU, but technically I got the data from NCU which got it from UNCC. | Prefer to use UNCC | FIXED | |
accession | Catalog Number | OK, I think. Need to chek with Brad regardss our final decisions on this | This looks good to me. If accessionNumber is provided by the data provider, is unique and used for every observation, it should be used at the catalogNumber | ok | ||
herbarium | institutionCode | Wrong. Herbarium is where UNCC got the collection. For institution code you want UNCC for all records | I believe Bob is right.` institutionCode` refers to the institution where the specimen is stored, which in this case is UNCC for all specimens. I believe `herbarium` is populated with a value other than UNCC when it was received as a duplicate from another herbarium. I don't think there is any point in storing the other values as there is not guarantee that another duplicate is stored at that institution. They may simply be fowarding a specimen they don't want; we have no way of knowing. So yes, make all values "UNCC" | omit herbarium | OK | FIXED: We don't currently have separate fields for this in VegBIEN or the analytical DB, so I just mapped it to a placeholder VegCore name. |
occurrenceID | ?? What is this? | This should be populated for all sources. Aaron, please use my recommended formula is in the spreadsheet "vegbien_identifiers", based on the DwC specification. | ||||
family | family_verbatim; AND ScientificName_verbatim (part 1 of 3) | OK | OK | |||
genus | OMITTED | OK | ||||
species | OMITTED | OK | ||||
usdarank | OMITTED | OK | ||||
infrarank | OMITTED | OK | ||||
SciName | ScientificName_verbatim (part 2 of 3) | ScientificName_verbatim = family + SciName + authors; Note sure why family was concatenated with SciName, and then why were the authors concatenated and these listed a second time as authors. The true verbatim scientificName is to my thinking just SciName | I think Aaron's formulation is correct, except for one record ("ZANNICHELLIACEAE (KEUTZ.) CORRELL" should not include the authority). Bob, you weren't part of the entire discussion, but the problem is that DwC defines scientificName as the taxon name PLUS authority. I don't much like the DwC definition either, but we decided to follow DwC. Because we didn't want to keep the authority separate from the name, we created a new field, taxonName_verbatim, which contain the lowest-level taxon name applied to the observation, MINUS the authority. That is why family sometimes appears in this column. If the specimen is only identified to order, the field will contain an ordinal name. Etc. Make sense? | Makes no sense to me to always concatenate family + genus + species to render ScientificName_verbatim as a trinomial. Aaron populated this as family + SciName + authors, whereas it should be SciName + authors except for those cases where there is no SciName and there is a family name. | Bob, I think you may be looking at an earlier incorrect version of the validation extract. Originally, Aaron was dumping to this field the value we send to the TNRS, which includes family pre-pended. I have since asked him to place family (or a higher taxon) in this field only if no determination below the rank of genus is provided. So you should never see family + genus + species. In version I am looking at, the name is always a correctly formed taxon name according to the ICBN. Aaron: your current version is correct | This happens when there is no input taxonName, causing the ranks and author to be concatenated together |
authors | scientificNameAuthorship_verbatim; AND ScientificName_verbatim (part 3 of 3) | Note sure why this is saved in two places | I only see it in one place, scientificNameAuthorship_verbatim. We also display scientificNameAuthorship_matched, but that is different: it's what the TNRS found after matching the name to Tropicos. | Perhaps you answered this above, but the field 'authors' in the input is copied into TWO places in the output, first as the final third of ScientificName_verbatim and then as a separate scientificNameAuthorship_verbatum | OK, this must definitely be a misunderstanding due to you using at an older version of the validation. Sorry about that.The current validation only contains only taxonName_verbatim (without authors) and scientificNameAuthorship_verbatum (the authors only). I would have preferred taxonNameAuthorship_verbatim, but as there was already a DwC name for the latter entity, we had to use it. Basically, DwC's ill-thought-out naming convention has screwed us up, but there's nothing we can do but use a new name if DwC has no unambiguous name for the taxon name without author. Aaron: your current version is correct. | |
collector | recordedBy (part 1 of 4) | OK | ||||
collector1 | recordedBy (part 2 of 4) | OK | ||||
collector2 | recordedBy (part 3 of 4) | OK | ||||
collector3 | recordedBy (part 4 of 4) | OK | ||||
collectno | recordNumber | OK | ||||
collmonth | dateCollected (part 2 of 3) | OK | ||||
collday | dateCollected (part 3 of 3) | OK | ||||
collyear | dateCollected (part 1 of 3) | OK | ||||
country | country | OK, but should probably convert into a standard character string | Bob: Aaron's presentation is correct for now. We haven't yet implemented geovalidation for BIEN3. When we do, he will return both the verbatim and scrubbed political division names. | ok | ||
state | stateProvince | OK, but should probably convert into a standard character string | Bob: Aaron's presentation is correct for now. We haven't yet implemented geovalidation for BIEN3. When we do, he will return both the verbatim and scrubbed political division names. | ok | ||
County | county | OK | ||||
Campus | Omit | Bob, what does this column mean? If it has anything to do with plants being collected on a campus (and therefore cultivated) we should definitely use it. | Yes, this refers to occurrence on the UNCC campus, but I do not know the meaning of the codes. Presumably C = cultivated | OK, this is important information in that case. Here's what I suggest. Aaron, if the column contains a non-null value, please put text "cultivated, collected on campus" in the column `cultivated_verbatim`. | FIXED | |
leaves | Omit | Agree | ok | |||
flower | reproductiveCOndition | Just a code that will be meaningless; if we were to keep this we would also want fruit. I suggest we omit. | Aaron, I suggest you get a formula from Bob for how to translate their codes for columns `flower` and `fruit` into the plain English terms "flower", "fruit", and "fertile" (the latter in the case of ferns, gymnosperms). This is definitely important information which we should not discard. And yes, I agree that we do not need to keep the information in the columns `leaves` and `root`. | Flower = AFIJMN, Fruit = 2AHIMOQS, definitions unknown | Bob: should we interpret flowers (or buds) present if flower is not null? And should we interpret fruits present if "fruit" is not null? I would like to capture this phenology information if at all possible. | FIXED: Mapped using translation |
fruit | Omit | No. See above. | ok | |||
root | Omit | Agree. | ||||
locality | locality (in part) | OK? | If we could figure out a way to separate the specimen description, I would prefer to append everything from all the `comment` columns pertaining to the locality description in this field. But can't think of a way to do that. Bob, suggestions? | Too inconsistent to for a simple answer | ||
habitat | locality (in part) | OK??? I might place habitat as yet another comment and not part of locality | We have in general been appending the habitat description to the verbatim locality description. We do this because of the inconsistent practices of herbaria. Some keep them in separate fields, and some (like UNCC and ARIZ) keep them in separate fields. | |||
comment1 | Missing -- need to preserve; not sure variable name | I agree that this information is important, but the four denormalized columns unpredictable mix locality description with habitat description with the specimen description. It is crucially important for both trait data mining and for detecting cultivated specimens, that the specimen description be preserved. But I'm at a loss as to what Aaron should do here. Bob, can you suggest an algorithm whereby Aaron could separate the locality description from the specimen description? | Hopelessly inconsistent. I would just place all of habitat and comments1-4 in a general notes field | OK Aaron, dump everything in these fields to another field, and I will modify my script to check that field for the word cultivated but nothing else. Is there another DwC field we can use other than occurrenceRemarks (we need to keep locality descriptions out of that field). | These are concatenated in occurrenceRemarks | |
comment2 | Missing -- need to preserve; not sure variable name | |||||
comment3 | Missing -- need to preserve; not sure variable name | |||||
comment4 | Missing -- need to preserve; not sure variable name | |||||
loanto | OMITTED | OK | ||||
inorout | OMITTED | OK | ||||
sheetno | OMITTED | OK | ||||
cultivated | cultivated_bien | Translation seems inconsistent. Where do the ocassional seros come from? | Aaron, please see my validation for source "TEX". I suggested we add the column `cultivatedVerbatim` to vegCore and vegBIEN, and display this column in all specimen validation views. As with TEX, the contents of UNCC's column `cultivated` should be dumped to `cultivatedVerbatim`.This means that for the validation extract you display, cultivatedVerbatim would be blank for all records except Accession Numbers 39691 and 19561. It wold be helpful to do some minimal translation. Bob, do the meanings of the codes "X" and "Y" differ in your field `CULTIVATED`? `cultivated_bien` is the results of post-import validation by BIEN; it is the results of combining parsing of the specimen description as well as using any cultivated flags supplied by the data provider, if any. Aaron, could you remind me why some values are blank (NULL) whereas others are zero? And why isn't Accession number 19561 flagged with a 1? | Aaron, if this column contains a non-null value, please put text "cultivated" in column `cultivated_verbatim`, unless you have already populated `cultivated_verbatim` due to non-null value in column `Campus`. | FIXED: all non-NULL values will be assumed to be cultivated . "Where do the ocassional seros come from?": FIXED: The zeros were being populated as the result of locality parsing, and should have been NULL |
|
filler | OMITTED | OK | ||||
decimalLatitude | NA | |||||
decimalLongitude | NA | |||||
coordinateUncertaintyInMeters | NA | |||||
elevationInMeters | NA | |||||
identifiedBy | NA | |||||
dateIdentified | NA | |||||
identificationRemarks | NA | |||||
identificationRemarks | Header error. This is familyName_matched. | ??? I don't understand Bob. Looks corrected to me. All values for this column are blank in the extract I am looking at. | Looks ok now. Previously the header was showing as a duplicate of identificationRemarks, but perhaps it was just my excel misbehaving. | FIXED | ||
taxonName_matched | OK (should families be excluded here? I noticed a case where the cultivar names was dropped upon matching) | Aaron's formulation is correct. taxonMatched contains the lowest taxon name matched, minus the authority. The TNRS doesn't process cultivar names, so they are dumped to `unmatchedTerms` (not displayed in specimen validation extracts) | ok | |||
scientificNameAuthorship_matched | OK | |||||
higherPlantGroup_bien | ||||||
family | (I noticed a case wherefamilyName_matched was blank and a genus name was captured from the family_verbatim; suggests an error) | Yes, I see that too. Accession number 39691. Genus "ELEAGNUS" was incorrectly placed in field `FAMILY` in the original data. I'm pretty sure there's no way the TNRS would have returned "Eleagnus" as a family. Aaron, is this a parsing error? | ! | Genera in the family column are not passed to TNRS | ||
genus | OK | |||||
taxonName | OK | |||||
scientificNameAuthorship | OK | |||||
growthForm | NA?? | Aaron: omit this from specimen validations. | FIXED | |||
threatened_bien | source? | Aaron: omit this from specimen validations.Sorry, I should have told you that earlier.Bob: not yet implemented, but we will us IUCN. | FIXED | |||
cultivatedBasis_bien | NA?? | Bob: for specimens flagged as cultivated (isCultivatedBien=1) this will contain a standard bit of text describing the reason it was flagged as cultivated (e.g., "Key words found in specimens description", etc.) |
query:
SELECT *
FROM "UNCC"."Specimen"
LEFT JOIN /*r7724.*/analytical_specimen ON
analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium
OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL
OR analytical_specimen."institutionCode" = 'UNCC' AND "Specimen".herbarium = 'UNCCD'
)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;
2013-2-20¶
See UNCC.2013-2-20.100_rows.xls and UNCC.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.
query:
SELECT *
FROM "UNCC"."Specimen"
LEFT JOIN /*r7634.*/analytical_specimen ON
analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;
import command:
make inputs/UNCC/import_scrub by_col=1 n=100; bin/make_analytical_db
2013-1-24¶
See UNCC.2013-1-24.50_rows.xls and UNCC.2013-1-24.50_rows.csv
Input and output data are in separate tabs
query: (using one-time import)
SELECT *
FROM "UNCC"."Specimen"
LEFT JOIN /*r7354.*/analytical_stem_view ON
analytical_stem_view."datasource" = 'UNCC'
AND (analytical_stem_view."institutionCode" = "Specimen".herbarium OR analytical_stem_view."institutionCode" IS NULL AND "Specimen".herbarium IS NULL)
AND analytical_stem_view."collectionCode" IS NULL
AND analytical_stem_view."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 50;
2013-1-22¶
See UNCC.50_rows.xls and UNCC.50_rows.csv
Input and output data are in separate tabs
Bob Peet's comments: (UNCC-validation-Peet.xlsx from e-mail on 2013-1-23)
BIEN output for validation | Problem ? | Field on UNCC dataset reported | What is reported | What I think this BIEN field should be | Response |
datasource | !! | none | Index Herbariorum code for this collection = UNCC | Where we obtained the information. Perhaps in all cases here "Robert Peet, NCU" or some code for me. | This is actually the short name of the organization that the data is from |
institutionCode | !! | herbarium | The name or acronym in use by the institution that donated the collection to UNCC | Dc "The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record." For all records here this should be "UNCC" | This is the herbarium field as indicated in the source data. Different rows use different acronyms for this field. |
occurrenceID | !! | accession | Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." | Dc occurrenceID "An identifier for the Occurrence (as opposed to a particular digital record of the occurrence)…", Examples: "urn:lsid:nhm.ku.edu:Herps:32", "urn:catalog:FMNH:Mammal:145732". | FIXED: I'm assuming you meant to switch the mapping from occurrenceID to catalogNumber rather than the other way around, as this field was previously mapped to occurrenceID |
country | country | UNCC provides their own idiosyncratic country codes and this is the best place. Ideally these would be converted into ISO 3166 country codes and placed in Dc "CountryCode" | |||
stateProvince | state | Idiosyncratic UNCC abbreviations for state/province | |||
county | county | County | |||
locationID | !! | accession | Identical to OccurrenceID; Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." | Dc "An identifier for the set of location information (data associated with dcterms:Location). May be a global unique identifier or an identifier specific to the data set." | This is primarily for plots data and will be the same as occurrenceID for specimens data |
locationName | !! | accession | Identical to OccurrenceID; Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." | Omit | Some input fields are mapped to multiple places in VegBIEN to be used as unique keys for the various tables |
individualCode | !! | collectno | Collection number of the collector: Dc = "recordNumber" | Omit | |
dateCollected | collyear + collmonth + collday | Dc eventDate | |||
higherPlantGroup | |||||
family | family | Dc family | |||
genus | genus | Dc genus | |||
speciesBinomial | genus + species | Dc genus + Dc specificEpithet | |||
taxonName | ? | genus + species | I cannot tell from the records if 'infrarank' and 'infraname' are concatenated here as they should be | This is the scrubbed name provided by TNRS | |
scientificNameAuthorship | authors | Dc scientificNameAuithorship | |||
scientificNameWithMorphospecies | ? | genus + species | In the example dataset 34950 has a morphospecies listed as a infraname and this is lost in the cleaning (perhaps not possible to parse) | This is because Brad's formula for scientificNameWithMorphospecies uses the specificEpithet instead of the morphospecies when available, and this name also contains a specificEpithet (manihot) | |
threatened | ? | ? | Source is not clear. | This field is populated by the import process | |
cultivated | ? | cultivated ? | I cannot check this without more data | This field is populated by the import process | |
cultivatedBasis | ? | ? | Source not clear | This field is populated by the import process | |
RecordedBy | collector + collector1 + collector2 + collector3 | Dc RecordedBy | |||
recordNumber | collectno | Dc recordNumber; duplicates individualCode, which should be omitted | |||
TaxonOccurrenceID | !! | accession | Duplicates occurrenceID; not sure what this is supposed to be | This is primarily for plots data and will be the same as occurrenceID for specimens data | |
MISSING FROM BIEN VALIDATION | |||||
!! | locality | Dc verbatimLocality | FIXED: These are now concatenated in the locality field | ||
!! | habitat | Dc locationRemarks | |||
!! | comment1 | Dc locationRemarks | |||
!! | comment2 | Dc locationRemarks | |||
!! | comment3 | Dc locationRemarks |
query:
SELECT *
FROM "UNCC"."Specimen"
LEFT JOIN r7317.analytical_stem_view ON
analytical_stem_view."datasource" = 'UNCC'
AND analytical_stem_view."occurrenceID" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 50;
VegBank¶
Herbaria used¶
741 names
query:
SELECT name from "public.2012-11-28-18-18-51.r6487".sourcename order BY name;