Spot-checking workflow issues¶
input/output row matchup¶
The reason we need a custom validation query for each datasource is because each datasource has a different set of input columns that uniquely identify a row:
specimens | ACAD | occurrenceID |
NY | specimenHolderInstitutions + collection + accessionNumber | |
REMIB | specimenHolderInstitutions + accessionNumber (no collection) | |
plots | SALVIAS | individualObservationID |
Madidi | locationName + subplot + individualCode | |
VegBank | taxonOccurrenceID + aggregateOrganismObservationID |
e.g. NY uses the following join:
"NY"."Ecatalog_all"
LEFT JOIN analytical_specimen ON
analytical_specimen."datasource" = 'NY'
AND analytical_specimen."specimenHolderInstitutions" = "Ecatalog_all"."specimenHolderInstitutions"
AND analytical_specimen."collection" = "Ecatalog_all"."collection"
AND analytical_specimen."accessionNumber" = "Ecatalog_all"."accessionNumber"
Workflow improvement¶
I would suggest making edits directly to a wiki, and using a wiki platform with a built-in graphical editor. We could:
- use Google spreadsheets (just for validation tables)
- install our own instance of Redmine which has the necessary graphical editor plugin
- Redmine is easier to use and more capable than TWiki, but is missing a sorely-needed graphical editor in the default install
- install TWiki (for any format of comments)
- give everyone doing validation an account on iPlant's wiki (which uses Confluence, below)
- buy Atlassian Confluence, which is $10 for up to 10 users (but $600 for more users than that with an academic license)
Brad Boyle's comments¶
E-mail on 2013-1-24¶
Having now participated in spot-checking validation (SCV) for my own data, as well as having observed the process for other data providers from the sidelines, I have a few general suggestions that might improve the SCV process. Bob, I particularly welcome your comments are you have been through this process as a data provider.
Remember that the purpose of the SVC is to allow the data provider to confirm via visual checks that the data they provided was imported faithfully and completely into the database. I've notice that data providers are often confused by the format of the data export. This confusion wastes your time and theirs. To reduce this confusion, I suggest the following changes:
1) FIXED: Use different, custom exports (views) for checking specimens versus plots, omitting any irrelevant fields.
The specimen view should contain only those columns relevant only to specimens; plot views only those columns relevant to plots. For example, `individualCode` is meaningless for specimens and should be omitted from specimen SVC exports. Aaron, based on Bob's and my suggestions about what field to including and what fields to omit, you should now have basic validation schemas for plots and specimens (but also see suggestion #2).
2) FIXED: If possible, further customize the view for a particular data provider to display ALL fields originally provided.
Example: if the original data dump included a localityDescription, include that field in the validation export.
This is already done by putting the original fields in a separate tab, whose rows match up with the output data. As the analytical DB includes only a subset of the core DB and the input data, it will not contain all the fields originally provided.
3) FIXED: Rename columns to distinguish new BIEN validations from original data. I suggest using the suffix "bien_". Place all bien validation fields together, rightmost in the spreadsheet. This will avoid a lot of confusion for the data provider.
Example: `isCultivated_bien` is a new field populated algorithmically by BIEN; it is NOT the original cultivated field (if any) supplied by the original data provider.
4) If necessary, rename columns for clarity.
For example, it would be helpful to return the original scientific name (from the data provider), along with the matched scientific name and the accepted scientific name (inferred by BIEN using the TNRS). Bearing in mind suggestion number 3, I would rename these as follows: verbatimScientificName, scientificNameMatched_bien, scientificNameAccepted_bien.
This will require adding joins for each type of taxondetermination, as these names are now stored in separate taxondeterminations.
5) FIXED: Provide the original data together with the validation export in separate files, or as different worksheets in the same Excel spreadsheet.
Aaron, I believe you are now doing this for all SCV exports, correct?
Yes, the corresponding input data is included in a separate tab.
6) FIXED: Provide sufficient records to allow the user to inspect "edge cases" in their data set.
For example, it is important to allow the user to check that names of infraspecific taxa are being composed correctly. I'm not sure that there is an "ideal" number, but would suggest perhaps 100 records. Another option, once we have data provider pages on our website, would be to give the data provider the option of viewing the complete set of records from the raw and imported SCV dumps.
E-mail on 2013-1-16¶
In general these [REMIB, ACAD, U] look good. However, some shortcomings of the validation export prevent me from doing a complete validation.
Please see my suggested changes below, which apply to all specimen validations. Once you have made these changes, send me new dumps for these sources and I will complete the validation.
1. With regard to the new format that combines the original data dump and the imported records into one file,
> I am using a new validation format that works well for single-table specimens data, where input rows are LEFT JOINed with the corresponding output rows so that data can be compared within a row. Note that input data is on the left, and output data is on the right, starting with the datasource column.FIXED: This should work as long as you remind the data provider every time that the original records are on the left of column "datasource". But I can almost guarantee that someone will get mixed up. I think it would be safer and more transparent to send separate files for the pre and post-import records in separate files. They are now in separate tabs of the spreadsheet
2. Querying analytical_stem in its current form does not produce a result which is adequate for validating specimen data. You will need to make a new view appropriate for specimen data. Please modify analytical_stem to satisfy conditions 2.1, 2.2, and 2.3, below. Then create a new view which queries analytical_stem to produce a final table which also satisfies conditions 2.4 and 2.5.
2.1. FIXED: The following column is incorrect:
scientificName - The family is not part of the scientific name. Do not include family, unless the specimen is determined to family only and there are no other lower taxa.
2.2. The following columns from the pre-import side of the table are missing from the post-import side. Please include them.
dateIdentifiedidentificationRemarks
reproductiveCondition (I thought we had a corresponding column `phenology` in the vegbien schema, but I do not see it): There is not yet a field for this in the core DBhabit (I'm not sure where this lives in vegbien. Are you appending it to location.locationnarrative?) This is in growthFormlocation.locationnarrative [added later] See locality
2.3. Please add the following column, internal to BIEN but needed for validation:
scientificNameMatched - scientific name plus author matched by the TNRS. You will need to give it a custom name as this is not a DwC element. Taxon name and author are provided in separate columns instead, since the taxon name is now formed using a specific formula instead of passing through the TNRS name
2.4. FIXED: Omit the following columns. They are irrelevant to specimen records: Empty fields are now omitted, causing most of these fields to be omitted for specimens data
projectID
locationID - this is an internal BIEN database identifier and will be meaningless to the data provider. Instead include the full locality description (I assume this is location.locationnarrative?)
plotArea_ha
samplingProtocol
speciesBinomialWithMorphospecies: merged with scientificNameWithMorphospecies
scientificNameWithMorphospecies: keeping instead of scientificName
coverPercent
diameterBreastHeight_cm
height_m
tag
organismX_m
organismY_m
authorTaxonCode
individualObservationID
authorStemCode
2.5. FIXED: Column order. On the vegbien side of the table (after datasource) please reorder the following columns to make them easier to read by the data provider:
identifiedBy, dateIdentified, identificationRemarks - keep these columns together in this order, place them after scientificNameAuthorship