Project

General

Profile

spot-checking validation details

See also overall status

Note that these are validations of analytical_stem_view (the denormalized view) rather than the raw VegBIEN data, because this was believed to be the output of VegBIEN at the time of the validation.
(Also, the denormalized view is much simpler, and it would be very time-consuming for validators to check several dozen VegBIEN tables against their data.)
This means that it was up to the validator to request the addition of columns to analytical_stem_view, if they required this to sign off on their validation.
Thus, there may be mis-mappings in the non-core columns that were not part of analytical_stem_view, in spite of a successful validation.

Note: analytical_stem.institutionCode has been renamed to datasource as of r6795 (2012-12-11). institutionCode now refers to the subprovider instead.

Queries

To get rows for a datasource, use one of the following queries.

  • If analytical_stem has not yet been created, it's OK to use analytical_stem_view instead of analytical_specimen/plot.
  • If you change analytical_stem_view to fix a bug, you must also reload analytical_stem because analytical_specimen/plot use this materialized table instead of the view.

Specimens

Input and output data side-by-side:

After running the query, split the input and output data into separate spreadsheet tabs

SELECT *
FROM "<datasource>"."Specimen" 
LEFT JOIN /*r#.*/analytical_specimen ON
    analytical_specimen."datasource" = '<datasource>'
AND analytical_specimen."occurrenceID" = "Specimen"."occurrenceID" 
ORDER BY "Specimen".row_num
LIMIT 100;

Plots

Input data must be retrieved separately.

Output data:

SELECT *
FROM /*r#.*/analytical_plot
WHERE "datasource" = '<datasource>'
LIMIT 100;

ACAD

Source data used is on vegbiendev in /home/bien/inputs/ACAD/Specimen/occurrence.txt

2013-3-6

See ACAD.2013-3-6.100_rows.xls and ACAD.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-3-7)

Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.

query:

SELECT *
FROM "ACAD"."Specimen" 
LEFT JOIN /*r7855.*/analytical_specimen ON
    analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-26

See ACAD.2013-2-26.100_rows.xls and ACAD.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27, ACAD.2013-2-26.100_rows_bb.xls)

query:

SELECT *
FROM "ACAD"."Specimen" 
LEFT JOIN /*r7721.*/analytical_specimen ON
    analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See ACAD.2013-2-20.100_rows.xls and ACAD.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-21, ACAD.2013-2-20.100_rows_bb.xls)

  • FIXED: collectionCode:
    See spreadsheet "vegbien_identifiers". For collectionCode, note "In general, use whatever the data provider provides, unless badly formed". They have provided a properly formed, globally unique urn from the "Biodiversity Collections Index". Just use it as provided (urn:lsid:biocol.org:col:13294)
  • FIXED: scientificName_verbatim:
    Still incorrect Include family only if no lower ranked name is provided.

query:

SELECT *
FROM "ACAD"."Specimen" 
LEFT JOIN /*r7620.*/analytical_specimen ON
    analytical_specimen."datasource" = 'ACAD'
AND analytical_specimen."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/ACAD/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-1-25

See ACAD.2013-1-25.100_rows.xls and ACAD.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-1-31, ACAD.2013-1-25.100_rows_bb.xls)

Validation complete. I found not problems with content. Great! This means we are done validating all of Canadensys.

I have suggested some changes to the layout of the specimen extract. Please apply these changes to all future validation extracts for all specimens from all sources, not just ACAD.

column name comments
datasource
institutionCode
collectionCode
catalogNumber
occurrenceID
country
stateProvince
county
locality
decimalLatitude
decimalLongitude
coordinateUncertaintyInMeters
coordinateSource_bien
georeferenceProtocol_bien
geovalid_bien
isNewWorld_bien
dateCollected
higherPlantGroup_bien
family
genus
speciesBinomial
taxonName
scientificNameAuthorship
scientificNameWithMorphospecies omit for all specimen validation extracts; this pertains only to plots
threatened_bien move to after `growthForm`
identifiedBy omit; this pertains to TNRS accepted name
dateIdentified omit; this pertains to TNRS accepted name
identificationRemarks omit; this pertains to TNRS accepted name
growthForm
cultivated_bien
cultivatedBasis_bien
recordedBy move these two columns to before `dateCollected`
recordNumber move these two columns to before `dateCollected`
Add the following columns before `family`:
family_verbatim As provided by author
taxonName_verbatim As provided by author
identifiedBy_verbatim As provided by author
dateIdentified_verbatim As provided by author
identificationRemarks_verbatim As provided by author
family_matched_tnrs As matched by TNRS
taxonName_matched_tnrs As matched by TNRS
Add the following column after `growthForm`
reproductiveCondition

query:

SELECT *
FROM "ACAD"."Specimen" 
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
    analytical_stem."datasource" = 'ACAD'
AND analytical_stem."occurrenceID" = "Specimen".id
ORDER BY "Specimen".row_num
LIMIT 100;

2013-1-15

See ACAD.50_rows.xls and ACAD.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column

query:

SELECT *
FROM "ACAD"."Specimen" 
LEFT JOIN /*r7023.*/analytical_stem ON
    analytical_stem."datasource" = 'ACAD'
AND analytical_stem."occurrenceID" = "Specimen".id
LIMIT 50;

ARIZ

Source data is on vegbiendev in /home/bien/inputs/ARIZ/_MySQL/MySQL.{schema,data}.sql

2012-12-11

Source data used is on vegbiendev in /home/bien/inputs/ARIZ/ARIZ_DiGIR_21012010.csv.tar.gz

See ARIZ.50_rows.xls and ARIZ.50_rows.csv

Brad Boyle's comments: (ariz_validation_report.xlsx)

Fixed?
coordinateUncertaintyInMeters Populate with the values in column 'CoordinatePrecision' in ariz_raw (see below). CoordinatePrecision does not exist in the raw data, so CoordinateUncertaintyInMeters is used instead
elevationInMeters This should equal (MinimumElevation+MaximumElevation)/2 if (a) both columns provide valid elevation values and MaximumElevation>=MinimumElevation). Currently on the value in MinimumElevation is being used. PRIORITY FEATURE REQUEST
plotArea_ha Omit this column from specimen data extracts FIXED
samplingProtocol Set to "Dried herbarium specimen" for all herbarium specimen records. FEATURE REQUEST
dateCollected For valid date ranges, is it possible to use the middle value of the dates provided, assuming dates are valid? See CatalogueNumber 10021. Use later date if dates split into two even groups. Thus, mid-date for this record would be 1916-08-29. FEATURE REQUEST
speciesBinomialWithMorphospecies This is correctly formed, but I recommend we call the column simply 'morphospecies' FIXED
scientificNameWithMorphospecies Incorrectly formed, but we don't need it. Omit this column FIXED
threatened This is not part of the original data providers data. In any view of data to reviewed by the data provider for validation, etc., I recommend suffixing the names of columns containing data added by BIEN with '_bien'. Thus, this field would become 'threatened_bien' FIXED
growthForm Rename to growthForm_bien. See 'threatened', above. The growthForm field actually contains data provided by the data provider, which is NULL for ARIZ because their DwC export for BIEN2 doesn't contain a growthForm column.
cultivated Rename to cultivated_bien. See 'threatened', above. FIXED (the suffix might not always apply because it's sometimes from the data provider and sometimes parsed from the locality description)
cultivatedBasis Rename to cultivatedBasis_bien. See 'threatened', above. FIXED
coverPercent Omit FIXED
diameterBreastHeight_cm Omit FIXED
height_m Omit FIXED
tag Omit FIXED
organismX_m Omit FIXED
organismY_m Omit FIXED
authorTaxonCode Omit FIXED
individualID Omit FIXED
authorStemCode Omit FIXED

Missing columns that should be included in validation view (names shown are from original data dump ariz_raw):

Fixed?
Locality The free-text locality description FIXED
FieldNotes This is the detailed specimen description (of the plant, not the locality). Not sure what field this corresponds to in VegCore. FIXED
HorizontalDatum =geodeticdatum in dwc FEATURE REQUEST
OriginalCoordinateSystem =verbatimcoordinatesystem in dwc FEATURE REQUEST
GeoRefMethod =georeferenceprotocol in dwc FEATURE REQUEST
CoordinatePrecision This column incorrectly used by ARIZ. It appears they are actually entering undertainty in meters; therefore values should be transferred to coordinateUncertaintyInMeters. The column is actually called CoordinateUncertaintyInMeters in the original data (nimoy:/home/bien_shared/raw_data/ariz/ARIZ_DiGIR_21012010.csv.tar.gz), so there is no need to remap it. However, NYBG does use coordinatePrecision incorrectly, and I remapped it to coordinateUncertaintyInMeters.
LatLongComments =georeferenceremarks in dwc FEATURE REQUEST
IdentifiedDate =identificationmodifier in dwc I think you mean dateIdentified FIXED
IdentificationModifier Ubducates undertainty of identifications. In general, I rather than returning verbatim values, I recomomend populating with contents of column 'Annotations' from the TNRS output. This field should come after 'dateIdentified'. PRIORITY FEATURE REQUEST

query:

SELECT * FROM "public.2012-12-06-23-19-52.r6672".analytical_stem WHERE "institutionCode" = 'ARIZ' LIMIT 50;

BRIT

Source data used is on vegbiendev in /home/bien/inputs/BRIT/MySQL.*.sql

2013-3-6

See BRIT.2013-3-6.100_rows.xls and BRIT.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-3-7)

Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.

query:

SELECT *
FROM "BRIT".specimen_flat
LEFT JOIN /*r7855.*/analytical_specimen ON
    analytical_specimen."datasource" = 'BRIT'
AND analytical_specimen."occurrenceID" = specimen_flat.specimen_id::text
WHERE specimen_flat.specimen_id >= 1660
ORDER BY specimen_flat.specimen_id
LIMIT 100;

import command:

make inputs/BRIT/import_scrub by_col=1 n=100 start=952; bin/make_analytical_db

2013-2-26

See BRIT.2013-2-26.100_rows.xls and BRIT.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27, BRIT.2013-2-26.100_rows_bb.xls)

  • FIXED: occurrenceRemarks:
    Weird. Looks like they imported full label information and mixed together locality, habitat, specimen description and even the intitutional footer, all in one field. This will be too dangerous to keep as is. Let's discuss what to do about it.
    if no Locality_Description, now ignore Notes_Plant because it's likely malformed
  • FIXED: decimalLatitude/decimalLongitude:
    BRIT looks like it's going to be a big headache to fix. Could you do me a favor and check to see if any of the BRIT records have non-NULL coordinates? I couldn't see any in the example extract.
    Yes, about 15% of their rows have a valid lat/long. These rows are not in the first 100, however.

query:

SELECT *
FROM "BRIT".specimen_flat
LEFT JOIN /*r7724.*/analytical_specimen ON
    analytical_specimen."datasource" = 'BRIT'
AND analytical_specimen."occurrenceID" = specimen_flat.specimen_id::text
ORDER BY specimen_flat.specimen_id
LIMIT 100;

import command:

make inputs/BRIT/import_scrub by_col=1 n=100; bin/make_analytical_db

CTFS

Source data used is in http://capstone.bradley.edu/~asingh2/files/bci_01April2011.zip

2012-12-17

See CTFS.2012-12-17.50_rows.xls and CTFS.2012-12-17.50_rows.csv

query:

SELECT
datasource AS "BIEN:datasource" 
-- Country
, country AS "Country.CountryName" 
-- Site
, "locationName" AS "Site.PlotID; Quadrat.QuadratID" 
, "decimalLatitude" AS "Site.Latitude" 
, "decimalLongitude" AS "Site.Longitude" 
, "elevationInMeters" AS "Site.Elevation" 
, "plotArea_ha" AS "Site.SizeOfSite" 
-- Census
, "dateCollected" AS "Census.StartDate" 
-- Tree
, "organismX_m" AS "Tree.x" 
, "organismY_m" AS "Tree.y" 
-- Stem
, tag AS "Stem.StemTag or Tree.Tag" 
, "authorStemCode" AS "Stem.StemID" 
, "diameterBreastHeight_cm" AS "Stem.DBH" 
, "individualID" AS "Measurement.MeasureID" 
-- Family
, "family" AS "Family.Family" 
, "genus" AS "Genus.Genus" 
, "speciesBinomial" AS "Genus.Genus SpecificEpithet.SpeciesName" 
, "scientificNameAuthorship" AS "SpecificEpithet.SpeciesAuthority" 
-- BIEN
, "higherPlantGroup" AS "BIEN:higherPlantGroup" 
, "isNewWorld" AS "BIEN:isNewWorld" 
, geovalid AS "BIEN:geovalid" 
, cultivated AS "BIEN:cultivated" 
, threatened AS "BIEN:threatened" 
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'CTFS' LIMIT 50;

2012-12-12

See CTFS.50_rows.xls and CTFS.50_rows.csv

*Rick Condit* wanted the data in a form similar to CTFS itself, which has been provided (above)

query:

SELECT * FROM "public.2012-12-11-11-46-55.r6795".analytical_stem WHERE "datasource" = 'CTFS' LIMIT 50;

CVS

see CVS validation

HVAA

Source data used is on vegbiendev in /home/bien/inputs/HVAA/Specimen/Herbario_occur_1360871068.csv

2013-2-26

See HVAA.2013-2-26.100_rows.xls and HVAA.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27)

The following sources passed validation and are complete:

REMIB
HVAA
SpeciesLink
SALVIAS
TEAM

query:

SELECT *
FROM "HVAA"."Specimen" 
LEFT JOIN /*r7722.*/analytical_specimen ON
    analytical_specimen."datasource" = 'HVAA'
AND analytical_specimen."occurrenceID" = "Specimen"."symbiotaId" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/HVAA/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See HVAA.2013-2-20.100_rows.xls and HVAA.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-21, HVAA.2013-2-20.100_rows_bb.xls)

  • FIXED: scientificName_verbatim:
    Include family only when no other lower ranked name is provided
  • FIXED: ocurrenceRemarks:
    Missing. please include in all specimen validation extracts
    • FIXED: mapping this source only:
      =IF(input.occurrenceRemarks IS NULL, input.dynamicProperties,input.occurrenceRemarks)
      

query:

SELECT *
FROM "HVAA"."Specimen" 
LEFT JOIN /*r7620.*/analytical_specimen ON
    analytical_specimen."datasource" = 'HVAA'
AND analytical_specimen."occurrenceID" = "Specimen"."symbiotaId" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/HVAA/import_scrub by_col=1 n=100; bin/make_analytical_db

Madidi

See the *Madidi validation spreadsheet*: (Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.)

Loading Google Spreadsheet...

Source data used is on vegbiendev in /home/bien/inputs/Madidi/*/*.csv

2013-3-7

query:

SELECT *
FROM "Madidi"."IndividualObservation" 
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7870.*/analytical_plot ON
    analytical_plot."datasource" = 'Madidi'
AND analytical_plot."locationName" = "IndividualObservation"."PlotName" 
AND analytical_plot."subplot" = "IndividualObservation"."Sequence" 
AND analytical_plot."individualCode" = "IndividualObservation"."Tree" 
ORDER BY "IndividualObservation".row_num
LIMIT 100;

import command:

make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-3-6

See Madidi.2013-3-6.100_rows.xls and Madidi.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

query:

SELECT *
FROM "Madidi"."IndividualObservation" 
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7870.*/analytical_plot ON
    analytical_plot."datasource" = 'Madidi'
AND analytical_plot."institutionCode" = "IndividualObservation"."Institutions" 
AND analytical_plot."collectionCode" IS NULL
AND analytical_plot."catalogNumber" = "IndividualObservation"."SpecimenID" 
ORDER BY "IndividualObservation".row_num
LIMIT 100;

import command:

make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-3-5

See Madidi.2013-3-5.100_rows.xls and Madidi.2013-3-5.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-3-6)

Started working on this, but realized you sent me a specimen-data validation extract. This is plot data. Please send me a plot-format validation extract, containing plot codes, subplots, morphospecies names, etc.

query:

SELECT *
FROM "Madidi"."IndividualObservation" 
RIGHT JOIN "Madidi"."LocationObservation" USING ("PlotID", "PlotInventoryName")
RIGHT JOIN "Madidi"."Location" USING ("PlotID")
LEFT JOIN /*r7848.*/analytical_specimen ON
    analytical_specimen."datasource" = 'Madidi'
AND analytical_specimen."institutionCode" = "IndividualObservation"."Institutions" 
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "IndividualObservation"."SpecimenID" 
ORDER BY "IndividualObservation".row_num
LIMIT 100;

import command:

make inputs/Madidi/import_scrub by_col=1 n=100; bin/make_analytical_db

2012-12-5

See Madidi.50_rows.xls and Madidi.50_rows.csv

*Peter Jorgensen* wants the data refreshed first.

query:

SELECT * FROM "public.2012-12-04-03-46-35.r6569".analytical_stem WHERE "institutionCode" = 'Madidi' AND "organismX_m" IS NOT NULL LIMIT 50;

MO

Source data used is on vegbiendev in /home/bien/inputs/MO/Specimen/digirexport*.txt

2012-11-30

See MO.50_rows.xls and MO.50_rows.csv

Peter Jorgensen's comments: MO.50_rows.annotated.xlsx

query:

SELECT * FROM "public.2012-11-28-18-18-51.r6487".analytical_stem WHERE "institutionCode" = 'MO' ORDER BY "recordNumber" LIMIT 50;

NCU

Source data used is on vegbiendev in /home/bien/inputs/NCU/_archive/NCU-NCSC_2010-02-12.csv

2012-12-7

See NCU-NCSC.50_rows.xls and NCU-NCSC.50_rows.csv

Bob Peet's comments: (e-mail on 2012-12-9)

FIXED: The collection dates often read something like 2065 rather than 1965. The refresh uses 4-digit years, so no longer has this problem. 2-digit years can be supported for other datasources using the _fix_date() filter.
FIXED: Perhaps I misunderstand, but I also fail to see the verbatim name from the original record. Otherwise looks fine. scientificName_verbatim has been added to the analytical DB

query:

SELECT * FROM "public.2012-12-05-12-51-00.r6646".analytical_stem WHERE "institutionCode" = 'NCU-NCSC' LIMIT 50;

NVS

Source data used is on vegbiendev in /home/bien/inputs/NVS/_src/AKATARAWA 2008 .xls

2013-1-22

See NVS.2013-1-22.50_rows.xls and NVS.2013-1-22.50_rows.csv
Input and output data are in separate tabs

query:

SELECT *
FROM "NVS"."StemObservation" 
LEFT JOIN r7317.analytical_stem analytical_stem ON
    analytical_stem."datasource" = 'NVS'
AND analytical_stem."individualObservationID" = "StemObservation"."Item Obs ID"::text
ORDER BY "StemObservation".row_num 
LIMIT 50;

2012-12-21

See NVS.2012-12-21.50_rows.xls and NVS.2012-12-21.50_rows.csv. This includes just the columns with NVS data and omits taxonomic scrubbing for easier verification with the source data.

query: (using a test-only import)

SELECT
datasource
, "projectID" 
, "locationID" 
, "plotArea_ha" 
, "scientificName" 
, "growthForm" 
, "diameterBreastHeight_cm" 
, tag
, "taxonOccurrenceID" 
, "authorTaxonCode" 
, "individualObservationID" 
FROM analytical_stem
WHERE datasource = 'NVS'
AND "individualObservationID" IS NOT NULL
ORDER BY "locationID" DESC
LIMIT 50;

2012-12-6

See NVS.50_rows.xls and NVS.50_rows.csv

Nick Spencer's comments: (e-mail on 2012-12-18)

Mostly in the right places, but I did note the following issues.

FIXED: Suggested mapping corrections

NVS export element -> VegCSV element Fixed?
Identifier -> tag Y
itemID -> individualID Y
Item Obs ID -> taxonOccurrenceID or occurrenceID (depending on definition)* Y
species code -> authorTaxonCode Y
Verbatim Code -> authorStemCode Yes, but actually corresponds to authorTaxonCode

1. Coordinates

The requested plot coordinate output for a spread sheet export of NVS data is a map coordinate (as opposed to lat long). Obviously this doesn’t suit a multi data provider data repository such as BIEN. I will discuss with Shirley about altering our output to include a lat long. However, I wondered if you require a standard WGS84 value or in fact have to deal with various projections during your import process? How do you tell from the import data you have been given, which coordinate system was used?

2. *taxonOccurrenceID or occurrenceID

What is the definition or practical difference between these two elements? Our item Obs ID is a unique ID per stem observation, not a unique ID of the occurrence of a taxon (regardless of how many individuals occurrences there were, i.e. 4 tree of species Pseudowintera axillaris and a cover estimate of the same species by strata)

There is obviously a more rich set of attributes in the NVS export cf with what you mapped including other sources of occurrence data as well as cover estimates by strata. Have you tested these other data types in other data sets?

query:

SELECT * FROM "public.2012-12-05-12-51-00.r6646".analytical_stem WHERE "institutionCode" = 'NVS' LIMIT 50;

NY

Source data used is on vegbiendev in /home/bien/inputs/NY/Specimen/NYSpecimenDataAmericas.csv

2013-3-6

See NY.2013-3-6.100_rows.xls and NY.2013-3-6.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-3-7)

Excellent. No issues. Mark all these sources [NY, BRIT, ACAD] as done.

query:

SELECT *
FROM "NY"."Ecatalog_all" 
LEFT JOIN /*r7854.*/analytical_specimen ON
    analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode" 
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode" 
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber" 
ORDER BY "Ecatalog_all".pkey
LIMIT 100;

import command:

make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-26

See NY.2013-2-26.100_rows.xls and NY.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27, NY.2013-2-26.100_rows_bb.xls)

  • FIXED: occurrenceRemarks:
    Do not preprend "plant fungus: "

query:

SELECT *
FROM "NY"."Ecatalog_all" 
LEFT JOIN /*r7722.*/analytical_specimen ON
    analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode" 
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode" 
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber" 
ORDER BY "Ecatalog_all".pkey
LIMIT 100;

import command:

make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See NY.2013-2-20.100_rows.xls and NY.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-21, NY.2013-2-20.100_rows_bb.xls)

  • FIXED: scientificName_verbatim:
    Include family only when no other lower ranked name is provided
  • FIXED: ocurrenceRemarks:
    Missing. please include in all specimen validation extracts
    • FIXED: mapping this source only:
      =input.PlantFungDescription
      

query:

SELECT *
FROM "NY"."Ecatalog_all" 
LEFT JOIN /*r7623.*/analytical_specimen ON
    analytical_specimen."datasource" = 'NY'
AND analytical_specimen."institutionCode" = "Ecatalog_all"."InstitutionCode" 
AND analytical_specimen."collectionCode" = "Ecatalog_all"."CollectionCode" 
AND analytical_specimen."catalogNumber" = "Ecatalog_all"."CatalogNumber" 
ORDER BY "Ecatalog_all".pkey
LIMIT 100;

import command:

make inputs/NY/import_scrub by_col=1 n=100; bin/make_analytical_db

2012-12-17

See NY.2012-12-17.50_rows.xls and NY.2012-12-17.50_rows.csv

query:

SELECT
datasource AS "BIEN:datasource" 
, "institutionCode" AS "InstitutionCode" 
, "collectionCode" AS "CollectionCode" 
, "catalogNumber" AS "CatalogNumber" 
, "scientificName" AS "ScientificName" 
, "family" AS "Family" 
, "genus" AS "Genus" 
, "speciesBinomial" AS "Genus Species" 
, "scientificNameAuthorship" AS "ScientificNameAuthor" 
, "identifiedBy" AS "IdentifiedBy" 
, "recordNumber" AS "FieldNumber" 
, "recordedBy" AS "Collector" 
, "dateCollected" AS "YearCollected-MonthCollected-DayCollected" 
, "country" AS "Country" 
, "stateProvince" AS "StateProvince" 
, "county" AS "County" 
, "decimalLongitude" AS "Longitude" 
, "decimalLatitude" AS "Latitude" 
, "coordinateUncertaintyInMeters" AS "CoordinatePrecision" 
, "elevationInMeters" AS "MinimumElevation" 
-- BIEN
, "higherPlantGroup" AS "BIEN:higherPlantGroup" 
, "isNewWorld" AS "BIEN:isNewWorld" 
, geovalid AS "BIEN:geovalid" 
, cultivated AS "BIEN:cultivated" 
, threatened AS "BIEN:threatened" 
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'NY'
AND "institutionCode" = 'NY'
AND "collectionCode" = 'Herbarium'
LIMIT 50;

With the analytical DB's schema: NY.2012-12-17.orig.50_rows.xls and NY.2012-12-17.orig.50_rows.csv

query:

SELECT *
FROM "public.2012-12-14-15-22-55.r6870".analytical_stem
WHERE "datasource" = 'NY'
AND "institutionCode" = 'NY'
AND "collectionCode" = 'Herbarium'
LIMIT 50;

2012-11-28

See NY.csv on vegbiendev.nceas.ucsb.edu at /home/bien/inputs/NY/verify/NY.csv (access via SSH/SFTP)

Barbara Thiers' comments: (e-mail on 2012-11-28)

I tested a sample set of 10 records, comparing the NY data extracted from BIEN 3.0, the NY dataset that was input into BIEN 3.0, and the original database at NY.

The data are correctly represented in BIEN 3.0, with one exception FIXED (see below). We use our Barcode number as our unique specimen identifier. In the input file, these data are stored in the column labelled "CatalogNumber." We also have an internal record number, which we don't use, but is included in the input file under the column "UniqueNYInternalRecordNumber."

In BIEN 3.0, the "UniqueNYInternalRecordNumber" (henceforth IRN) was used as the unique identifier (stored under "PlotName, and also concatenated with collector's number in RecordNumber.

FIXED: Instead, the barcode number (stored under "CatalogNumber" in the input file) should be used instead of the IRN. The barcode numbers should be unique, but if if is desirable to maintain the concatenated RecordNumber field, then it is the CatalogNumber that should be combined with the FieldNumber value from the input table.
however, some rows are missing a CatalogNumber1, so CatalogNumber is not a unique key for all rows.

1 1060 of 941,107 rows
SELECT * FROM "NY"."Ecatalog_all" WHERE "accessionNumber" IS NULL
Total query runtime: 942 ms.

FIXED: The FieldNumber is now in analytical_stem.recordNumber. It would be very helpful to also maintain the FieldNumber as a separate column in the analytical database, so that the combination of Collector + Collector number could be easily queried (this is the 4th most used type of query for our database).

The attached spreadsheet [ Mapping changes needed for NYBG data in BIEN 3.0.xlsx ] illustrates the mapping changes needed

query:

SELECT * FROM "public.2012-11-24-14-51-43.r6446".analytical_stem WHERE "institutionCode" = 'NY';

REMIB

Source data used is on vegbiendev in /home/bien/inputs/REMIB/Specimen.src/node.*.csv

2013-2-26

See REMIB.2013-2-26.100_rows.xls and REMIB.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27)

The following sources passed validation and are complete:

REMIB
HVAA
SpeciesLink
SALVIAS
TEAM

query:

SELECT *
FROM "REMIB"."Specimen" 
LEFT JOIN /*r7723.*/analytical_specimen ON
    analytical_specimen."datasource" = 'REMIB'
AND analytical_specimen."institutionCode" = "Specimen".acronym
AND analytical_specimen."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num" 
LIMIT 100;

import command:

make inputs/REMIB/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See REMIB.2013-2-20.100_rows.xls and REMIB.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-21, REMIB.2013-2-20.100_rows_bb.xls)

  • FIXED: county:
    For this source only, "ND" = "no disponible = "not available". Just leave blank when county = "ND"
  • FIXED: locality:
    For this source only, "ND" = "no disponible = "not available". Just leave this field blank when both locality and habitat = "ND"
  • FIXED: decimalLatitude:
    This source uses 99 to indicate a missing value. Set both decimalLatitude and decimalLongitude to NULL if decimalLatitude=99
  • FIXED: decimalLongitude:
    This source uses -999 to indicate a missing value. Set both decimalLatitude and decimalLongitude to NULL if decimalLongitude=-999
  • FIXED: scientificName_verbatim:
    Include family only when no other lower ranked name is provided

query:

SELECT *
FROM "REMIB"."Specimen" 
LEFT JOIN /*r7620.*/analytical_specimen ON
    analytical_specimen."datasource" = 'REMIB'
AND analytical_specimen."institutionCode" = "Specimen".acronym
AND analytical_specimen."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num" 
LIMIT 100;

import command:

make inputs/REMIB/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-1-25

See REMIB.2013-1-25.100_rows.xls and REMIB.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

query:

SELECT *
FROM "REMIB"."Specimen" 
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
    analytical_stem."datasource" = 'REMIB'
AND analytical_stem."institutionCode" = "Specimen".acronym
AND analytical_stem."catalogNumber" = "Specimen".accession_number
ORDER BY "Specimen"."Specimen.src.row_num" 
LIMIT 100;

2013-1-15

See REMIB.50_rows.xls and REMIB.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column

query:

SELECT *
FROM "REMIB"."Specimen" 
LEFT JOIN /*r7023.*/analytical_stem ON
    analytical_stem."datasource" = 'REMIB'
AND analytical_stem."institutionCode" = "Specimen".acronym
AND analytical_stem."catalogNumber" = "Specimen".accession_number
LIMIT 50;

SALVIAS

see SALVIAS validation

SpeciesLink

Source data used is on vegbiendev in /home/bien/inputs/SpeciesLink/Specimen/specieslink.csv

2013-2-26

See SpeciesLink.2013-2-26.100_rows.xls and SpeciesLink.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27)

The following sources passed validation and are complete:

REMIB
HVAA
SpeciesLink
SALVIAS
TEAM

query:

SELECT *
FROM "SpeciesLink"."Specimen" 
LEFT JOIN /*r7723.*/analytical_specimen ON
    analytical_specimen."datasource" = 'SpeciesLink'
AND analytical_specimen."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/SpeciesLink/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See SpeciesLink.2013-2-20.100_rows.xls and SpeciesLink.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-22, SpeciesLink.2013-2-20.100_rows_bb.xls)

  • FIXED: scientificName_verbatim:
    Include family only when no other lower ranked name is provided
  • FIXED: identificationRemarks:
    Incorrectly formed. The values shown are the field description of the specimen, and should be in columnn occurrenceRemarks (see below). As far as I can tell, identificationRemarks were not supplied for this source.
  • FIXED: ocurrenceRemarks:
    Missing. please include in all specimen validation extracts
    • FIXED: mapping this source only: =input.dwc_terms_occurrenceRemarks

query:

SELECT *
FROM "SpeciesLink"."Specimen" 
LEFT JOIN /*r7632.*/analytical_specimen ON
    analytical_specimen."datasource" = 'SpeciesLink'
AND analytical_specimen."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/SpeciesLink/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-1-25

See SpeciesLink.2013-1-25.100_rows.xls and SpeciesLink.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-1-31, SpeciesLink.2013-1-25.100_rows_bb.xls)

Content passes validation. I have the same format comments as for other specimen sources.

I did wonder why there were NULL values in the two analytical db fields geovalid_bien and isNewWorld_bien.
Rows that don't have both coordinates and a placename can't be geovalidated, so geovalid_bien is null. isNewWorld_bien is null whenever there isn't a scrubbed name match.

See attached [SpeciesLink.2013-1-25.100_rows_bb.xls].

column name comments
datasource
institutionCode
collectionCode
catalogNumber
occurrenceID
country
stateProvince
county
locality
decimalLatitude
decimalLongitude
coordinateUncertaintyInMeters
coordinateSource_bien
georeferenceProtocol_bien
geovalid_bien Should be 0,1 only. Why are some values NULL?
isNewWorld_bien Should be 0,1 only. Why are some values NULL?
dateCollected
higherPlantGroup_bien
family
genus
speciesBinomial
taxonName
scientificNameAuthorship
scientificNameWithMorphospecies omit for all specimen validation extracts; this pertains only to plots
threatened_bien move to after `growthForm`
identifiedBy omit; this pertains to TNRS accepted name
dateIdentified omit; this pertains to TNRS accepted name
identificationRemarks omit; this pertains to TNRS accepted name
growthForm
cultivated_bien
cultivatedBasis_bien
recordedBy move these two columns to before `dateCollected`
recordNumber move these two columns to before `dateCollected`
Add the following columns before `family`:
family_verbatim As provided by author
taxonName_verbatim As provided by author
identifiedBy_verbatim As provided by author
dateIdentified_verbatim As provided by author
identificationRemarks_verbatim As provided by author
family_matched_tnrs As matched by TNRS
taxonName_matched_tnrs As matched by TNRS
Add the following column after `growthForm`
reproductiveCondition

query:

SELECT *
FROM "SpeciesLink"."Specimen" 
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
    analytical_stem."datasource" = 'SpeciesLink'
AND analytical_stem."occurrenceID" = "Specimen"."dwc_dwcore_GlobalUniqueIdentifier" 
ORDER BY "Specimen".row_num
LIMIT 100;

TEAM

Source data used is on vegbiendev in /home/bien/inputs/TEAM/Specimen.src/node.*.csv

2013-2-20

See TEAM.2013-2-20.100_rows.xls and TEAM.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-26)

[Re Plot data validations for SALVIAS and TEAM:]
Both look good. The only problem I could find was the incorrectly formed scientificName_verbatim, but I believe you have fixed that now, correct? [yes] If that issue is fixed then these are both good and I do not need to see them again. Consider them done.

query:

SELECT *
FROM "TEAM"."TEAM_Sites" 
RIGHT JOIN "TEAM"."VT" ON "VT"."1ha Plot Number" = "TEAM_Sites"."Name" 
LEFT JOIN /*r7629.*/analytical_plot ON
    analytical_plot."datasource" = 'TEAM'
AND analytical_plot."individualObservationID" = "VT"."Id"::text
ORDER BY "VT".row_num
LIMIT 100;

import command:

make inputs/TEAM/import_scrub by_col=1 n=100; bin/make_analytical_db

TEX

see TEX validation

U

Source data used is on vegbiendev in /home/bien/inputs/U/Specimen/UtrechtHerbarium.csv

2013-2-26

See U.2013-2-26.100_rows.xls and U.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-27, U.2013-2-26.100_rows_bb.xls)

  • FIXED: institutionCode:
    Please populate. Should be "U"
  • taxonName_verbatim:
    Still incorrectly formed. Include family only when no other lower ranked name is provided.
    no, the family must be included in what's sent to TNRS

query:

SELECT *
FROM "U"."Specimen" 
LEFT JOIN /*r7723.*/analytical_specimen ON
    analytical_specimen."datasource" = 'U'
AND analytical_specimen."occurrenceID" = "Specimen"."ID" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/U/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-20

See U.2013-2-20.100_rows.xls and U.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-2-21, U.2013-2-20.100_rows_bb.xls)

  • FIXED: locality: Was already fixed
    This source uses input column "Ecology" as "habitat". Please append contents of input column "Ecology"
  • FIXED: scientificName_verbatim:
    Include family only when no other lower ranked name is provided
  • FIXED: ocurrenceRemarks:
    Missing. please include in all specimen validation extracts. Input field "Notes" maps to occurrenceRemarks for this source.

query:

SELECT *
FROM "U"."Specimen" 
LEFT JOIN /*r7620.*/analytical_specimen ON
    analytical_specimen."datasource" = 'U'
AND analytical_specimen."occurrenceID" = "Specimen"."ID" 
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/U/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-1-25

See U.2013-1-25.100_rows.xls and U.2013-1-25.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Brad Boyle's comments: (e-mail on 2013-1-31, U.2013-1-25.100_rows_bb.xls)

[...] consider the validation of U to be complete. See attached spreadsheet [U.2013-1-25.100_rows_bb.xls], with the usual format comments.

Content for U is fine. Although, as mentioned in my earlier email, I have some concerns about how we are forming DwC fields such as occurrenceID. But if we make any changes, it will apply to the entire database, so let's discuss this issue as a group.

column name comments
datasource
catalogNumber
occurrenceID
country
stateProvince
county
locality
decimalLatitude
decimalLongitude
coordinateUncertaintyInMeters
coordinateSource_bien
georeferenceProtocol_bien
geovalid_bien
isNewWorld_bien
dateCollected
higherPlantGroup_bien
family
genus
speciesBinomial
taxonName
scientificNameAuthorship
scientificNameWithMorphospecies omit for all specimen validation extracts; this pertains only to plots
threatened_bien move to after `growthForm`
identifiedBy omit; this pertains to TNRS accepted name
dateIdentified omit; this pertains to TNRS accepted name
identificationRemarks omit; this pertains to TNRS accepted name
growthForm
cultivated_bien
cultivatedBasis_bien
recordedBy move these two columns to before `dateCollected`
recordNumber move these two columns to before `dateCollected`
Add the following columns before `family`:
family_verbatim As provided by author
taxonName_verbatim As provided by author
identifiedBy_verbatim As provided by author
dateIdentified_verbatim As provided by author
identificationRemarks_verbatim As provided by author
family_matched_tnrs As matched by TNRS
taxonName_matched_tnrs As matched by TNRS
Add the following column after `growthForm`
reproductiveCondition
Add the following columns after datasource (even if null):
institutionCode
collectionCode

query:

SELECT *
FROM "U"."Specimen" 
LEFT JOIN /*r7317.*/analytical_stem_view analytical_stem ON
    analytical_stem."datasource" = 'U'
AND analytical_stem."occurrenceID" = "Specimen"."ID" 
ORDER BY "Specimen".row_num
LIMIT 100;

2013-1-15

See U.50_rows.xls and U.50_rows.csv
Input data is on the left, and output data is on the right, starting with the datasource column

query:

SELECT *
FROM "U"."Specimen" 
LEFT JOIN /*r7023.*/analytical_stem ON
    analytical_stem."datasource" = 'U'
AND analytical_stem."occurrenceID" = "Specimen"."ID" 
LIMIT 50;

UNCC

Source data used is on vegbiendev in /home/bien/inputs/UNCC/Specimen/UNCC.csv

2013-2-26

See UNCC.2013-2-26.100_rows.xls and UNCC.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

query:

SELECT *
FROM "UNCC"."Specimen" 
LEFT JOIN /*r7870.*/analytical_specimen ON
    analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium
    OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL
    OR analytical_specimen."institutionCode" = 'UNCC' AND "Specimen".herbarium = 'UNCCD'
)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/UNCC/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-2-26

See UNCC.2013-2-26.100_rows.xls and UNCC.2013-2-26.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

Bob Peet's and Brad Boyle's comments: (e-mails on 2013-3-5, UNCC-validate_rkp_bb_rkp.xlsx)

Input Output Note Brad's comments Bob's comments on Brad's comments Brad's final comments Fixed?
row_num Might be good to save this If this is a database ID, then I suggest Aaron save it under datasetRecordID ok
datasource All records currently = UNCC; probably should be where you got the data = NCU Bob: Index Herbariorum does not recognize UNC, only UNCC. These are UNCC collections, so I suggest we use UNCC as both the datasource (an internal BIEN field) and institutionCode (a DwC field). I suppose UNCC is less confusing than NCU, but technically I got the data from NCU which got it from UNCC. Prefer to use UNCC FIXED
accession Catalog Number OK, I think. Need to chek with Brad regardss our final decisions on this This looks good to me. If accessionNumber is provided by the data provider, is unique and used for every observation, it should be used at the catalogNumber ok
herbarium institutionCode Wrong. Herbarium is where UNCC got the collection. For institution code you want UNCC for all records I believe Bob is right.` institutionCode` refers to the institution where the specimen is stored, which in this case is UNCC for all specimens. I believe `herbarium` is populated with a value other than UNCC when it was received as a duplicate from another herbarium. I don't think there is any point in storing the other values as there is not guarantee that another duplicate is stored at that institution. They may simply be fowarding a specimen they don't want; we have no way of knowing. So yes, make all values "UNCC" omit herbarium OK FIXED: We don't currently have separate fields for this in VegBIEN or the analytical DB, so I just mapped it to a placeholder VegCore name.
occurrenceID ?? What is this? This should be populated for all sources. Aaron, please use my recommended formula is in the spreadsheet "vegbien_identifiers", based on the DwC specification.
family family_verbatim; AND ScientificName_verbatim (part 1 of 3) OK OK
genus OMITTED OK
species OMITTED OK
usdarank OMITTED OK
infrarank OMITTED OK
SciName ScientificName_verbatim (part 2 of 3) ScientificName_verbatim = family + SciName + authors; Note sure why family was concatenated with SciName, and then why were the authors concatenated and these listed a second time as authors. The true verbatim scientificName is to my thinking just SciName I think Aaron's formulation is correct, except for one record ("ZANNICHELLIACEAE (KEUTZ.) CORRELL" should not include the authority). Bob, you weren't part of the entire discussion, but the problem is that DwC defines scientificName as the taxon name PLUS authority. I don't much like the DwC definition either, but we decided to follow DwC. Because we didn't want to keep the authority separate from the name, we created a new field, taxonName_verbatim, which contain the lowest-level taxon name applied to the observation, MINUS the authority. That is why family sometimes appears in this column. If the specimen is only identified to order, the field will contain an ordinal name. Etc. Make sense? Makes no sense to me to always concatenate family + genus + species to render ScientificName_verbatim as a trinomial. Aaron populated this as family + SciName + authors, whereas it should be SciName + authors except for those cases where there is no SciName and there is a family name. Bob, I think you may be looking at an earlier incorrect version of the validation extract. Originally, Aaron was dumping to this field the value we send to the TNRS, which includes family pre-pended. I have since asked him to place family (or a higher taxon) in this field only if no determination below the rank of genus is provided. So you should never see family + genus + species. In version I am looking at, the name is always a correctly formed taxon name according to the ICBN. Aaron: your current version is correct This happens when there is no input taxonName, causing the ranks and author to be concatenated together
authors scientificNameAuthorship_verbatim; AND ScientificName_verbatim (part 3 of 3) Note sure why this is saved in two places I only see it in one place, scientificNameAuthorship_verbatim. We also display scientificNameAuthorship_matched, but that is different: it's what the TNRS found after matching the name to Tropicos. Perhaps you answered this above, but the field 'authors' in the input is copied into TWO places in the output, first as the final third of ScientificName_verbatim and then as a separate scientificNameAuthorship_verbatum OK, this must definitely be a misunderstanding due to you using at an older version of the validation. Sorry about that.The current validation only contains only taxonName_verbatim (without authors) and scientificNameAuthorship_verbatum (the authors only). I would have preferred taxonNameAuthorship_verbatim, but as there was already a DwC name for the latter entity, we had to use it. Basically, DwC's ill-thought-out naming convention has screwed us up, but there's nothing we can do but use a new name if DwC has no unambiguous name for the taxon name without author. Aaron: your current version is correct.
collector recordedBy (part 1 of 4) OK
collector1 recordedBy (part 2 of 4) OK
collector2 recordedBy (part 3 of 4) OK
collector3 recordedBy (part 4 of 4) OK
collectno recordNumber OK
collmonth dateCollected (part 2 of 3) OK
collday dateCollected (part 3 of 3) OK
collyear dateCollected (part 1 of 3) OK
country country OK, but should probably convert into a standard character string Bob: Aaron's presentation is correct for now. We haven't yet implemented geovalidation for BIEN3. When we do, he will return both the verbatim and scrubbed political division names. ok
state stateProvince OK, but should probably convert into a standard character string Bob: Aaron's presentation is correct for now. We haven't yet implemented geovalidation for BIEN3. When we do, he will return both the verbatim and scrubbed political division names. ok
County county OK
Campus Omit Bob, what does this column mean? If it has anything to do with plants being collected on a campus (and therefore cultivated) we should definitely use it. Yes, this refers to occurrence on the UNCC campus, but I do not know the meaning of the codes. Presumably C = cultivated OK, this is important information in that case. Here's what I suggest. Aaron, if the column contains a non-null value, please put text "cultivated, collected on campus" in the column `cultivated_verbatim`. FIXED
leaves Omit Agree ok
flower reproductiveCOndition Just a code that will be meaningless; if we were to keep this we would also want fruit. I suggest we omit. Aaron, I suggest you get a formula from Bob for how to translate their codes for columns `flower` and `fruit` into the plain English terms "flower", "fruit", and "fertile" (the latter in the case of ferns, gymnosperms). This is definitely important information which we should not discard. And yes, I agree that we do not need to keep the information in the columns `leaves` and `root`. Flower = AFIJMN, Fruit = 2AHIMOQS, definitions unknown Bob: should we interpret flowers (or buds) present if flower is not null? And should we interpret fruits present if "fruit" is not null? I would like to capture this phenology information if at all possible. FIXED: Mapped using translation
fruit Omit No. See above. ok
root Omit Agree.
locality locality (in part) OK? If we could figure out a way to separate the specimen description, I would prefer to append everything from all the `comment` columns pertaining to the locality description in this field. But can't think of a way to do that. Bob, suggestions? Too inconsistent to for a simple answer
habitat locality (in part) OK??? I might place habitat as yet another comment and not part of locality We have in general been appending the habitat description to the verbatim locality description. We do this because of the inconsistent practices of herbaria. Some keep them in separate fields, and some (like UNCC and ARIZ) keep them in separate fields.
comment1 Missing -- need to preserve; not sure variable name I agree that this information is important, but the four denormalized columns unpredictable mix locality description with habitat description with the specimen description. It is crucially important for both trait data mining and for detecting cultivated specimens, that the specimen description be preserved. But I'm at a loss as to what Aaron should do here. Bob, can you suggest an algorithm whereby Aaron could separate the locality description from the specimen description? Hopelessly inconsistent. I would just place all of habitat and comments1-4 in a general notes field OK Aaron, dump everything in these fields to another field, and I will modify my script to check that field for the word cultivated but nothing else. Is there another DwC field we can use other than occurrenceRemarks (we need to keep locality descriptions out of that field). These are concatenated in occurrenceRemarks
comment2 Missing -- need to preserve; not sure variable name
comment3 Missing -- need to preserve; not sure variable name
comment4 Missing -- need to preserve; not sure variable name
loanto OMITTED OK
inorout OMITTED OK
sheetno OMITTED OK
cultivated cultivated_bien Translation seems inconsistent. Where do the ocassional seros come from? Aaron, please see my validation for source "TEX". I suggested we add the column `cultivatedVerbatim` to vegCore and vegBIEN, and display this column in all specimen validation views. As with TEX, the contents of UNCC's column `cultivated` should be dumped to `cultivatedVerbatim`.This means that for the validation extract you display, cultivatedVerbatim would be blank for all records except Accession Numbers 39691 and 19561. It wold be helpful to do some minimal translation. Bob, do the meanings of the codes "X" and "Y" differ in your field `CULTIVATED`? `cultivated_bien` is the results of post-import validation by BIEN; it is the results of combining parsing of the specimen description as well as using any cultivated flags supplied by the data provider, if any. Aaron, could you remind me why some values are blank (NULL) whereas others are zero? And why isn't Accession number 19561 flagged with a 1? Aaron, if this column contains a non-null value, please put text "cultivated" in column `cultivated_verbatim`, unless you have already populated `cultivated_verbatim` due to non-null value in column `Campus`. FIXED: all non-NULL values will be assumed to be cultivated
.
"Where do the ocassional seros come from?": FIXED: The zeros were being populated as the result of locality parsing, and should have been NULL
filler OMITTED OK
decimalLatitude NA
decimalLongitude NA
coordinateUncertaintyInMeters NA
elevationInMeters NA
identifiedBy NA
dateIdentified NA
identificationRemarks NA
identificationRemarks Header error. This is familyName_matched. ??? I don't understand Bob. Looks corrected to me. All values for this column are blank in the extract I am looking at. Looks ok now. Previously the header was showing as a duplicate of identificationRemarks, but perhaps it was just my excel misbehaving. FIXED
taxonName_matched OK (should families be excluded here? I noticed a case where the cultivar names was dropped upon matching) Aaron's formulation is correct. taxonMatched contains the lowest taxon name matched, minus the authority. The TNRS doesn't process cultivar names, so they are dumped to `unmatchedTerms` (not displayed in specimen validation extracts) ok
scientificNameAuthorship_matched OK
higherPlantGroup_bien
family (I noticed a case wherefamilyName_matched was blank and a genus name was captured from the family_verbatim; suggests an error) Yes, I see that too. Accession number 39691. Genus "ELEAGNUS" was incorrectly placed in field `FAMILY` in the original data. I'm pretty sure there's no way the TNRS would have returned "Eleagnus" as a family. Aaron, is this a parsing error? ! Genera in the family column are not passed to TNRS
genus OK
taxonName OK
scientificNameAuthorship OK
growthForm NA?? Aaron: omit this from specimen validations. FIXED
threatened_bien source? Aaron: omit this from specimen validations.Sorry, I should have told you that earlier.Bob: not yet implemented, but we will us IUCN. FIXED
cultivatedBasis_bien NA?? Bob: for specimens flagged as cultivated (isCultivatedBien=1) this will contain a standard bit of text describing the reason it was flagged as cultivated (e.g., "Key words found in specimens description", etc.)

query:

SELECT *
FROM "UNCC"."Specimen" 
LEFT JOIN /*r7724.*/analytical_specimen ON
    analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium
    OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL
    OR analytical_specimen."institutionCode" = 'UNCC' AND "Specimen".herbarium = 'UNCCD'
)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;

2013-2-20

See UNCC.2013-2-20.100_rows.xls and UNCC.2013-2-20.100_rows.csv
Input and output data are in separate tabs. Refer to the VegCore data dictionary for column definitions.

query:

SELECT *
FROM "UNCC"."Specimen" 
LEFT JOIN /*r7634.*/analytical_specimen ON
    analytical_specimen."datasource" = 'UNCC'
AND (analytical_specimen."institutionCode" = "Specimen".herbarium OR analytical_specimen."institutionCode" IS NULL AND "Specimen".herbarium IS NULL)
AND analytical_specimen."collectionCode" IS NULL
AND analytical_specimen."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 100;

import command:

make inputs/UNCC/import_scrub by_col=1 n=100; bin/make_analytical_db

2013-1-24

See UNCC.2013-1-24.50_rows.xls and UNCC.2013-1-24.50_rows.csv
Input and output data are in separate tabs

query: (using one-time import)

SELECT *
FROM "UNCC"."Specimen" 
LEFT JOIN /*r7354.*/analytical_stem_view ON
    analytical_stem_view."datasource" = 'UNCC'
AND (analytical_stem_view."institutionCode" = "Specimen".herbarium OR analytical_stem_view."institutionCode" IS NULL AND "Specimen".herbarium IS NULL)
AND analytical_stem_view."collectionCode" IS NULL
AND analytical_stem_view."catalogNumber" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 50;

2013-1-22

See UNCC.50_rows.xls and UNCC.50_rows.csv
Input and output data are in separate tabs

Bob Peet's comments: (UNCC-validation-Peet.xlsx from e-mail on 2013-1-23)

BIEN output for validation Problem ? Field on UNCC dataset reported What is reported What I think this BIEN field should be Response
datasource !! none Index Herbariorum code for this collection = UNCC Where we obtained the information. Perhaps in all cases here "Robert Peet, NCU" or some code for me. This is actually the short name of the organization that the data is from
institutionCode !! herbarium The name or acronym in use by the institution that donated the collection to UNCC Dc "The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record." For all records here this should be "UNCC" This is the herbarium field as indicated in the source data. Different rows use different acronyms for this field.
occurrenceID !! accession Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." Dc occurrenceID "An identifier for the Occurrence (as opposed to a particular digital record of the occurrence)…", Examples: "urn:lsid:nhm.ku.edu:Herps:32", "urn:catalog:FMNH:Mammal:145732". FIXED: I'm assuming you meant to switch the mapping from occurrenceID to catalogNumber rather than the other way around, as this field was previously mapped to occurrenceID
country country UNCC provides their own idiosyncratic country codes and this is the best place. Ideally these would be converted into ISO 3166 country codes and placed in Dc "CountryCode"
stateProvince state Idiosyncratic UNCC abbreviations for state/province
county county County
locationID !! accession Identical to OccurrenceID; Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." Dc "An identifier for the set of location information (data associated with dcterms:Location). May be a global unique identifier or an identifier specific to the data set." This is primarily for plots data and will be the same as occurrenceID for specimens data
locationName !! accession Identical to OccurrenceID; Dc catalogNumber "An identifier (preferably unique) for the record within the data set or collection." Omit Some input fields are mapped to multiple places in VegBIEN to be used as unique keys for the various tables
individualCode !! collectno Collection number of the collector: Dc = "recordNumber" Omit
dateCollected collyear + collmonth + collday Dc eventDate
higherPlantGroup
family family Dc family
genus genus Dc genus
speciesBinomial genus + species Dc genus + Dc specificEpithet
taxonName ? genus + species I cannot tell from the records if 'infrarank' and 'infraname' are concatenated here as they should be This is the scrubbed name provided by TNRS
scientificNameAuthorship authors Dc scientificNameAuithorship
scientificNameWithMorphospecies ? genus + species In the example dataset 34950 has a morphospecies listed as a infraname and this is lost in the cleaning (perhaps not possible to parse) This is because Brad's formula for scientificNameWithMorphospecies uses the specificEpithet instead of the morphospecies when available, and this name also contains a specificEpithet (manihot)
threatened ? ? Source is not clear. This field is populated by the import process
cultivated ? cultivated ? I cannot check this without more data This field is populated by the import process
cultivatedBasis ? ? Source not clear This field is populated by the import process
RecordedBy collector + collector1 + collector2 + collector3 Dc RecordedBy
recordNumber collectno Dc recordNumber; duplicates individualCode, which should be omitted
TaxonOccurrenceID !! accession Duplicates occurrenceID; not sure what this is supposed to be This is primarily for plots data and will be the same as occurrenceID for specimens data
MISSING FROM BIEN VALIDATION
!! locality Dc verbatimLocality FIXED: These are now concatenated in the locality field
!! habitat Dc locationRemarks
!! comment1 Dc locationRemarks
!! comment2 Dc locationRemarks
!! comment3 Dc locationRemarks

query:

SELECT *
FROM "UNCC"."Specimen" 
LEFT JOIN r7317.analytical_stem_view ON
    analytical_stem_view."datasource" = 'UNCC'
AND analytical_stem_view."occurrenceID" = "Specimen".accession
ORDER BY "Specimen".row_num
LIMIT 50;

VegBank

see VegBank validation

Herbaria used

741 names

query:

SELECT name from "public.2012-11-28-18-18-51.r6487".sourcename order BY name;