Validation queries against NYBG specimen data - Table `nybg_raw` in mysql db `bien2_staging` - a verbatim import of NYSpecimenDataAmericas.csv, a DwC format dump of specimen records from NYBG specimen database, except for the three columns key, IdentifiedDate and CollectedDate, which where added and populated by me after importing raw data. I will ignores those columns in the following tests. 1. Count unique families, genera and species - the following is both NULL- and empty-string safe, as the the table has no empty strings, just NULLs SELECT ( SELECT COUNT(DISTINCT Family) FROM nybg_raw WHERE Family IS NOT NULL ) AS families, ( SELECT COUNT(DISTINCT Genus) FROM nybg_raw WHERE Genus IS NOT NULL ) AS genera, ( SELECT COUNT(DISTINCT Genus, Species) FROM nybg_raw WHERE Genus IS NOT NULL AND Species IS NOT NULL ) AS species; +----------+--------+---------+ | families | genera | species | +----------+--------+---------+ | 741 | 6379 | 40145 | +----------+--------+---------+ 1 row in set (2.78 sec) 2. Confirm that species match between original and VegBIEN. When joined to the equivalent query from VegBIEN, the following query should return 40145 records. SELECT DISTINCT CONCAT_WS(' ', Genus, Species) AS Species FROM nybg_raw WHERE Genus IS NOT NULL AND Species IS NOT NULL; 3. Confirm that genera match between original and VegBIEN. When joined to the equivalent query from VegBIEN, the following query should return 6379 records. SELECT DISTINCT Genus AS Genera FROM nybg_raw WHERE Genus IS NOT NULL; 4. Confirm that families match between original and VegBIEN. When joined to the equivalent query from VegBIEN, the following query should return 741 records. SELECT DISTINCT Family as Families FROM nybg_raw WHERE Family IS NOT NULL; 5. Visually inspect to make sure the scientific name components look correct. - ScientificName should = Genus + Species + Subspecies concatenated with intervening spaces (and a rank indicator such as 'var.', 'subsp.', 'f.', etc preceding the subspecies, all followed by the ScientificNameAuthor. It should be NULL if Genus is NULL. SELECT ScientificName, Genus, Species, Subspecies, ScientificNameAuthor FROM nybg_raw LIMIT 25; +---------------------------------------------------+---------------+------------------+------------+-------------------------+ | ScientificName | Genus | Species | Subspecies | ScientificNameAuthor | +---------------------------------------------------+---------------+------------------+------------+-------------------------+ | Cariniana estrellensis (Raddi) Kuntze | Cariniana | estrellensis | NULL | (Raddi) Kuntze | | Scrophulariaceae | NULL | NULL | NULL | NULL | | Myrsinaceae | NULL | NULL | NULL | NULL | | Urticaceae | NULL | NULL | NULL | NULL | | Myrsinaceae | NULL | NULL | NULL | NULL | | Myrsinaceae | NULL | NULL | NULL | NULL | | Myrsinaceae | NULL | NULL | NULL | NULL | | Aizoaceae | NULL | NULL | NULL | NULL | | Elaphoglossum ciliatum (C. Presl) T. Moore | Elaphoglossum | ciliatum | NULL | (C. Presl) T. Moore | | Hippocrateaceae | NULL | NULL | NULL | NULL | | Olacaceae | NULL | NULL | NULL | NULL | | Taxiphyllum taxirameum (Mitt.) M. Fleisch. | Taxiphyllum | taxirameum | NULL | (Mitt.) M. Fleisch. | | Monimiaceae | NULL | NULL | NULL | NULL | | Myristicaceae | NULL | NULL | NULL | NULL | | Myristicaceae | NULL | NULL | NULL | NULL | | Polygonaceae | NULL | NULL | NULL | NULL | | Porpidia albocaerulescens (Wulfen) Hertel & Knoph | Porpidia | albocaerulescens | NULL | (Wulfen) Hertel & Knoph | | Polygonaceae | NULL | NULL | NULL | NULL | | Polygonaceae | NULL | NULL | NULL | NULL | | Nyctaginaceae | NULL | NULL | NULL | NULL | | Caesalpiniaceae | NULL | NULL | NULL | NULL | | Polygonaceae | NULL | NULL | NULL | NULL | | Nyctaginaceae | NULL | NULL | NULL | NULL | | Miconia cornifolia (Desr.) Naudin | Miconia | cornifolia | NULL | (Desr.) Naudin | | Cryptantha flavoculata (A. Nelson) Payson | Cryptantha | flavoculata | NULL | (A. Nelson) Payson | +---------------------------------------------------+---------------+------------------+------------+-------------------------+ 25 rows in set (0.00 sec) Looks good. Let's check a case where Subspecies is not null: SELECT ScientificName, Genus, Species, Subspecies, ScientificNameAuthor FROM nybg_raw WHERE Subspecies IS NOT NULL LIMIT 15; +---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+ | ScientificName | Genus | Species | Subspecies | ScientificNameAuthor | +---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+ | Phyllachora graminis (Pers. & Fr.) Fuckel var. graminis | Phyllachora | graminis | graminis | NULL | | Gnaphalium attenuatum DC. var. sylvicola McVaugh | Gnaphalium | attenuatum | sylvicola | McVaugh | | Lepidium alyssoides A. Gray var. alyssoides | Lepidium | alyssoides | alyssoides | NULL | | Costus productus Gleason var. productus | Costus | productus | productus | NULL | | Sorocea muriculata Miq. subsp. muriculata | Sorocea | muriculata | muriculata | NULL | | Rollinia laurifolia Schltdl. var. reflexa R. E. Fr. | Rollinia | laurifolia | reflexa | R. E. Fr. | | Esenbeckia grandiflora Mart. subsp. grandiflora | Esenbeckia | grandiflora | grandiflora | NULL | | Esenbeckia grandiflora Mart. subsp. grandiflora | Esenbeckia | grandiflora | grandiflora | NULL | | Esenbeckia grandiflora Mart. subsp. grandiflora | Esenbeckia | grandiflora | grandiflora | NULL | | Renealmia thyrsoidea (Ruiz & Pav.) Poepp. & Endl. subsp. thyrsoidea | Renealmia | thyrsoidea | thyrsoidea | NULL | | Renealmia thyrsoidea (Ruiz & Pav.) Poepp. & Endl. subsp. thyrsoidea | Renealmia | thyrsoidea | thyrsoidea | NULL | | Costus spiralis (Jacq.) Roscoe var. spiralis | Costus | spiralis | spiralis | NULL | | Dimerocostus strobilaceus Kuntze subsp. gutierrezii (Kuntze) Maas | Dimerocostus | strobilaceus | gutierrezii | (Kuntze) Maas | | Brosimum acutifolium Huber subsp. obovatum (Ducke) C. C. Berg | Brosimum | acutifolium | obovatum | (Ducke) C. C. Berg | | Delphinium scaposum Greene var. andersonii (A. Gray) S. L. Welsh | Delphinium | scaposum | andersonii | (A. Gray) S. L. Welsh | +---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+ 15 rows in set (0.00 sec) Huh, well, close enough. They follow the format of including the species author. In the case of nominate subspecies (subspecific epithet the same as specific epithet) the subspecies author is by definition the same, so does not need to be repeated (complicated story I won't go into right now). The main problem here is that is the species author is not found in any field of these records, only in the concatednated ScientificName field. This shouldn't affect import of the taxonomic data, but something to bear in mind, especially when parsing these names with the TNRS. 5. Count number of records by institution The majority should be from 'NY'; these are standard herbarium acronyms SELECT InstitutionCode, COUNT(*) AS records FROM nybg_raw GROUP BY InstitutionCode; +-----------------+---------+ | InstitutionCode | records | +-----------------+---------+ | NY | 327995 | | RBR | 4 | | TEX | 1 | | UB | 7 | | US | 32 | | Z | 1 | +-----------------+---------+ 6 rows in set (1.98 sec) When joined to the equivalent query from VegBIEN should return 5 records. 6. Check CollectionCode - If records are from a single institution (such as this one) there should be only one value in this field. It will usually be "Herbarium" but not always. The main point is only one record. - Note: This test will not be valid if data are from an aggregator such as GBIF. SELECT DISTINCT CollectionCode FROM nybg_raw; +----------------+ | CollectionCode | +----------------+ | Herbarium | +----------------+ 1 row in set (0.94 sec) 7. Check that standard political divisions make sense - This is a visual inspection, but worth doing SELECT DISTINCT Country, StateProvince, County FROM nybg_raw LIMIT 25; +--------------------------+-----------------+----------------+ | Country | StateProvince | County | +--------------------------+-----------------+----------------+ | Peru | Madre de Dios | Man? | | Belize | Belize District | NULL | | Brazil | Bahia | Uru?uca | | Ecuador | Morona-Santiago | Sucua | | Brazil | Acre | Porto Valter | | Brazil | Bahia | Barreiras | | Brazil | Pernambuco | NULL | | Brazil | Bahia | Candeal | | Colombia | Risaralda | NULL | | Brazil | Acre | Porto Acre | | Brazil | Acre | Brasil?ia | | United States of America | Kentucky | Rowan Co. | | Brazil | Acre | Sena Madureira | | United States of America | Arkansas | Crawford Co. | | Brazil | Bahia | Una | | Brazil | Mato Grosso | NULL | | Ecuador | Morona-Santiago | NULL | | Grenada | Saint David | NULL | | United States of America | Nevada | Lander Co. | | Colombia | Caldas | NULL | | Bolivia | La Paz | NULL | | United States of America | Nevada | Douglas Co. | | Costa Rica | San Jos? | P?rez Zeled?n | | Peru | Jun?n | NULL | | Brazil | S?o Paulo | NULL | +--------------------------+-----------------+----------------+ 25 rows in set (0.00 sec) OK, looks reasonable. Note the character set issues though. Without inspecting in more detail, I'm not sure if this is from the original, an import-to-MySQL problem, or something else. 8. Check that CatalogNumber is unique - This DwC element is supposed to uniquely identify each record. In most herbarium DwC extracts, this *should* be the accession number given to the specimen. - If unique within the institution, a GUID for a given specimen can be formed by concatenating InstitutionCode+CollectionCode+CatalogNumber - However, for a variety of reasons, accession numbers do not always end up being unique (sometimes the same number accidentally gets stamped on two different specimens - Nonetheless, it is useful to check if CatalogNumber could be used as the identifier for specimens. First count & compare to total records: SELECT COUNT(*) as totalRecords, COUNT(DISTINCT CatalogNumber) as uniqueCatalogNumbers FROM nybg_raw; +--------------+----------------------+ | totalRecords | uniqueCatalogNumbers | +--------------+----------------------+ | 328040 | 327637 | +--------------+----------------------+ 1 row in set (1.86 sec) Hmmm, not looking good. SELECT CatalogNumber, COUNT(*) AS records FROM nybg_raw GROUP BY CatalogNumber HAVING records>1; +---------------+---------+ | CatalogNumber | records | +---------------+---------+ | NULL | 401 | | 788791 | 2 | | 799350 | 2 | +---------------+---------+ 3 rows in set (0.00 sec) Interesting. We could just exclude the 401 specimens which have not been assigned CatalogNumbers, and for the two numbers with duplicate entries, use only the most-recently added record. But first, lets look at non-null duplicates. SELECT b.CatalogNumber, DateLastModified, ScientificName, Collector, FieldNumber, CollectedDate FROM nybg_raw AS a JOIN ( SELECT CatalogNumber, COUNT(*) AS records FROM nybg_raw WHERE CatalogNumber IS NOT NULL GROUP BY CatalogNumber HAVING records>1 ) AS b ON a.CatalogNumber=b.CatalogNumber ORDER BY b.CatalogNumber ASC, DateLastModified DESC; +---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+ | CatalogNumber | DateLastModified | ScientificName | Collector | FieldNumber | CollectedDate | +---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+ | 788791 | 2009-3-12T12:11:32.000EST | Psilotum nudum (L.) P. Beauv. | W. J. Burchell | 3147 | 1826-09-24 | | 788791 | 2008-4-2T13:09:20.000EST | Lycopodium jussiaei Desv. ex Poir. | L. R. Landrum | 4523 | 1982-04-19 | | 799350 | 2008-4-8T14:31:41.000EST | Posoqueria latifolia (Rudge) Roem. & Schult. | P. G. Delprete | 6988 | 1998-12-05 | | 799350 | 2008-4-8T14:07:19.000EST | Posoqueria latifolia (Rudge) Roem. & Schult. | P. G. Delprete | 6988 | 1998-12-05 | +---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+ 4 rows in set (4.65 sec) Unfortunately, this case is not simple. CatalogNumber 799350 is indeed a duplicate entry of the same specimen (as identified by identical Collector+FieldNumber. In this case, the earlier record could be discarded. CatalogNumber 788791, however, refers to two different species (note different Collector+FieldNumberCollector+FieldNumber). The solution in this case is either to assign one of them an artificial CatalogNumber (e.g., 788791 and 788791b) or use some entirely different ID, such as 'UniqueNYInternalRecordNumber' (not a DwC element, apparently an artificial primary key generate by their database). Now let's check the records with NULL CatalogNumber: SELECT DateLastModified, ScientificName, Collector, FieldNumber, CollectedDate FROM nybg_raw WHERE CatalogNumber IS NULL ORDER BY CatalogNumber ASC, DateLastModified DESC LIMIT 12; +---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+ | DateLastModified | ScientificName | Collector | FieldNumber | CollectedDate | +---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+ | 2009-4-22T14:56:10.000EST | Dennstaedtia punctilobula (Michx.) T. Moore | D. E. Atha | 6820 | 2008-10-05 | | 2009-4-22T14:56:08.000EST | Hedeoma pulegioides (L.) Pers. | D. E. Atha | 6623 | 2008-08-09 | | 2009-4-14T15:47:40.000EST | Arisaema triphyllum (L.) Schott | D. E. Atha | 6063 | 2007-10-21 | | 2009-3-12T15:57:21.000EST | Otidea | R. E. Halling | 7355 | 1994-10-14 | | 2009-3-12T15:56:37.000EST | Phaeocollybia oligoporpa Singer | R. E. Halling | 7452 | 1995-06-22 | | 2009-3-12T15:56:36.000EST | Lactarius gerardii var. subrubescens (A. H. Sm. & Hesler) Hesler & A. H. Sm. | R. E. Halling | 7318 | 1994-06-21 | | 2009-3-12T15:56:31.000EST | Boletus sensibilis Peck | R. E. Halling | 7224 | 1994-06-03 | | 2009-3-12T15:56:28.000EST | Lentinellus ursinus (Fr.) K?hner | R. E. Halling | 7196 | 1993-11-24 | | 2009-3-12T15:55:59.000EST | Lactarius gerardii var. subrubescens (A. H. Sm. & Hesler) Hesler & A. H. Sm. | R. E. Halling | 7313 | 1994-06-20 | | 2009-3-12T15:55:59.000EST | Phaeocollybia oligoporpa Singer | R. E. Halling | 7423 | 1995-06-16 | | 2009-3-12T15:55:47.000EST | Lentinellus ursinus (Fr.) K?hner | R. E. Halling | 7188 | 1993-11-23 | | 2009-3-12T15:55:40.000EST | Xeromphalina kauffmanii A. H. Sm. | R. E. Halling | 7186 | 1993-11-23 | +---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+ 12 rows in set (0.53 sec) These all look like good specimens. In light of this, I would favor using UniqueNYInternalRecordNumber as the identifier instead of CatalogNumber. By the way, these examples of anomalous CatalogNumbers (=accession numbers) are valuable information which could be returned to the data provider to help them clean up their collection and database. 9. Check that Latitude and Longitude are reasonable values - Strictly speak, the DwC Geospatial extension requires that these fields be labeled DecimalLatitude and DecimalLongitude. - Check that they are within the range of reasonable decimal values SELECT ( SELECT COUNT(*) FROM nybg_raw WHERE Latitude IS NOT NULL ) AS allLats, ( SELECT COUNT(*) FROM nybg_raw WHERE Latitude REGEXP '^(-|\\+){0,1}([0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+|[0-9]+)$' ) AS decimalLats, ( SELECT COUNT(*) FROM nybg_raw WHERE Longitude IS NOT NULL ) AS allLongs, ( SELECT COUNT(*) FROM nybg_raw WHERE Longitude REGEXP '^(-|\\+){0,1}([0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+|[0-9]+)$' ) AS decimalLongs; Pass, they are all numbers. Now let's look at some actual values: SELECT Latitude, Longitude FROM nybg_raw WHERE Latitude IS NOT NULL AND Longitude IS NOT NULL LIMIT 10; +--------------+--------------+ | Latitude | Longitude | +--------------+--------------+ | -11.80000000 | -71.40000000 | | 17.40000000 | -88.50000000 | | -14.40000000 | -39.00000000 | | -2.50000000 | -78.10000000 | | -8.10000000 | -72.80000000 | | -11.87000000 | -45.45000000 | | -7.60000000 | -35.50000000 | | -11.90000000 | -39.10000000 | | 4.70000000 | -75.60000000 | | -9.80000000 | -67.60000000 | +--------------+--------------+ 10 rows in set (0.00 sec) Looks good. Now for the acceptable range test. Each should return 0 records: SELECT ( SELECT COUNT(*) FROM nybg_raw WHERE Latitude>90 OR Latitude<-90 ) AS badLats, ( SELECT COUNT(*) FROM nybg_raw WHERE Longitude>180 OR Longitude<-180 ) AS badLongs; +---------+----------+ | badLats | badLongs | +---------+----------+ | 0 | 3 | +---------+----------+ 1 row in set (1.16 sec) Not bad, I usually find more. Those 3 bad longitudes will need to be fixed, but that's data-cleaning. 10. Check date values valid - there are two sets of dates that matter: collection date, and identification date - the live in separate day, month, year fields - collection date will usually (but not always) have a value, identification often does not First, just look at some values: SELECT DayCollected, MonthCollected, YearCollected FROM nybg_raw WHERE DayCollected IS NOT NULL OR MonthCollected IS NOT NULL OR YearCollected IS NOT NULL LIMIT 10; +--------------+----------------+---------------+ | DayCollected | MonthCollected | YearCollected | +--------------+----------------+---------------+ | 20 | 8 | 1984 | | 17 | 1 | 1994 | | 1 | 7 | 1991 | | 20 | 1 | 1989 | | 31 | 5 | 1994 | | 3 | 11 | 1987 | | 29 | 1 | 1999 | | 15 | 1 | 1997 | | 25 | 5 | 1989 | | 26 | 1 | 1995 | +--------------+----------------+---------------+ 10 rows in set (0.05 sec) Looks reasonable. Do the same for DayIdentified, MonthIdentified, YearIdentified (not shown). Now, test for non-integers: SELECT ( SELECT COUNT(*) FROM nybg_raw WHERE NOT(DayCollected REGEXP '^-?[0-9]+$') OR NOT(MonthCollected REGEXP '^-?[0-9]+$') OR NOT(YearCollected REGEXP '^-?[0-9]+$') ) AS badDateCollected, ( SELECT COUNT(*) FROM nybg_raw WHERE NOT(DayIdentified REGEXP '^-?[0-9]+$') OR NOT(MonthIdentified REGEXP '^-?[0-9]+$') OR NOT(YearIdentified REGEXP '^-?[0-9]+$') ) AS badDateIdentified; +------------------+-------------------+ | badDateCollected | badDateIdentified | +------------------+-------------------+ | 0 | 0 | +------------------+-------------------+ 1 row in set (1.88 sec) Awesome. Good to go. Some last check; these should all be done by eye; sorry, I have no idea how to automate this. 11. Check Locality - This should look like a verbal description of some location or vegetation SELECT Locality FROM nybg_raw WHERE Locality IS NOT NULL LIMIT 10; +--------------------------------------------------------------------------------------------------+ | Locality | +--------------------------------------------------------------------------------------------------+ | Parque Nacional del Manu. R?o Manu: Cocha Casha Station | | Belize Zoo, in savanna plot immediately behind zoo. Mile 31 on Western Highway | | 7.3 km N of Serra Grande on rd to Itacar?. Fazenda Lagoa do Conjunto Fazenda Santa Cruz | | Centro Shuar Yukutais, 8km SW of Sucua | | Bacia do Alto Juru?, Rio Juru?-mirim, margem esquerda, seringal Lucanha, colocacao Vista Alegre. | | Area do entorno da Cachoeira do Acaba-vida | | Sao Vicente F?rrer, Mata do Estado | | 8 km al N de Tanquinho, camino a Ichu | | Mun. Pereira, Parque Regional Ucumar?, ca. 22 km ESE of Pereira | | Reserva Florestal de Humait?. Travess?o da direita. | +--------------------------------------------------------------------------------------------------+ 10 rows in set (0.00 sec) Looks reasonable; 14. Check PlantFungusDescription - Should contain a free-text description of the plant in the specimen - Can be null - Just make sure values make sense if not null SELECT PlantFungusDescription FROM nybg_raw WHERE PlantFungusDescription IS NOT NULL LIMIT 10; +------------------------------------------------------------------------------------------------------------------------------------------+ | PlantFungusDescription | +------------------------------------------------------------------------------------------------------------------------------------------+ | with rough brown bark, almost no buttressing, 1 m diam; fallen fruit and leaves | | Tree height: 12.0 m, bole, 7.5 m | | Shrub to 2.5m, cauliflorous | | Arvore 8 m; infloresc?ncia pendente, raque vin?cea e c?lice persistente; fruto globoso verde com listas vin?ceas casca interna vermelha. | | Arbusto de ca. 1,6 m alt., com folhas semisuculentas, discolores. Botoes branco-esverdeados | | Arvore ca. de 7,0 m de altura. Bot?es e p?tals r?seos. Estames amarelos | | Postrada, hojas crasas, flores rosadas, ovario s?pero | | Arbusto com 3 m; frutos imaturos verdes com manchas brancas, globosos. | | Liana, frutos imaturos vermelhos. | | Arvore, botoes e flores. Corola e ?rgaos reprodutivos amarelos. | +------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows in set (0.00 sec) Looks good. 13. Check Habitat, Vegetation - Check that these look like verbal descriptors of habitat or vegatation - In many herbarium DwC dumps, these fields will be null, depending on herbarium policy for what goes in them, or how they record data from labels. - Here you are most checking that, if not null, the fields contain reasonable-looking values. SELECT Habitat, Vegetation FROM nybg_raw WHERE Habitat IS NOT NULL OR Vegetation IS NOT NULL LIMIT 10; +------------------------------------------------------+-----------------------------------+ | Habitat | Vegetation | +------------------------------------------------------+-----------------------------------+ | Floodplain forest | NULL | | Southern Bahian moist forest | NULL | | Floresta de v?rzea. | NULL | | NULL | Cerrado | | Interior da mata | NULL | | NULL | Caatinga | | Terra firme. | NULL | | Floresta prim?ria de terra firme | Floresta prim?ria de terra firme. | | mixed hardwoods over exposed sandstone and limestone | NULL | | Floresta prim?ria de terra firme. | Floresta Prim?ria de Terra Firme. | +------------------------------------------------------+-----------------------------------+ 10 rows in set (0.00 sec) Both look reasonable.