Project

General

Profile

NY aggregating validations ยป NYBG_validations.txt

Aaron Marcuse-Kubitza, 12/13/2011 12:07 PM

 
1
Validation queries against NYBG specimen data
2

    
3
- Table `nybg_raw` in mysql db `bien2_staging`
4
- a verbatim import of NYSpecimenDataAmericas.csv, a DwC format dump of 
5
  specimen records from NYBG specimen database, except for the three
6
  columns key, IdentifiedDate and CollectedDate, which where added and 
7
  populated by me after importing raw data. I will ignores those columns 
8
  in the following tests.
9
 
10
1. Count unique families, genera and species
11
- the following is both NULL- and empty-string safe, as the the table has
12
no empty strings, just NULLs
13

    
14
SELECT 
15
(
16
SELECT COUNT(DISTINCT Family) FROM nybg_raw WHERE Family IS NOT NULL
17
) AS families, 
18
(
19
SELECT COUNT(DISTINCT Genus) FROM nybg_raw WHERE Genus IS NOT NULL
20
) AS genera,
21
(
22
SELECT COUNT(DISTINCT Genus, Species) FROM nybg_raw 
23
WHERE Genus IS NOT NULL AND Species IS NOT NULL
24
) AS species;
25
+----------+--------+---------+
26
| families | genera | species |
27
+----------+--------+---------+
28
|      741 |   6379 |   40145 | 
29
+----------+--------+---------+
30
1 row in set (2.78 sec)
31

    
32
2. Confirm that species match between original and VegBIEN.
33

    
34
When joined to the equivalent query from VegBIEN, the following query
35
should return 40145 records.
36

    
37
SELECT DISTINCT CONCAT_WS(' ', Genus, Species) AS Species
38
FROM nybg_raw 
39
WHERE Genus IS NOT NULL AND Species IS NOT NULL;
40

    
41
3. Confirm that genera match between original and VegBIEN.
42

    
43
When joined to the equivalent query from VegBIEN, the following query
44
should return 6379 records.
45

    
46
SELECT DISTINCT Genus AS Genera
47
FROM nybg_raw 
48
WHERE Genus IS NOT NULL;
49

    
50
4. Confirm that families match between original and VegBIEN.
51

    
52
When joined to the equivalent query from VegBIEN, the following query
53
should return 741 records.
54

    
55
SELECT DISTINCT Family as Families
56
FROM nybg_raw 
57
WHERE Family IS NOT NULL;
58

    
59
5. Visually inspect to make sure the scientific name components look correct. 
60
- ScientificName should = Genus + Species + Subspecies concatenated with intervening spaces (and
61
a rank indicator such as 'var.', 'subsp.', 'f.', etc preceding the subspecies, all followed by the 
62
ScientificNameAuthor. It should be NULL if Genus is NULL.
63

    
64
SELECT ScientificName, Genus, Species, Subspecies, ScientificNameAuthor
65
FROM nybg_raw
66
LIMIT 25;
67
+---------------------------------------------------+---------------+------------------+------------+-------------------------+
68
| ScientificName                                    | Genus         | Species          | Subspecies | ScientificNameAuthor    |
69
+---------------------------------------------------+---------------+------------------+------------+-------------------------+
70
| Cariniana estrellensis (Raddi) Kuntze             | Cariniana     | estrellensis     | NULL       | (Raddi) Kuntze          | 
71
| Scrophulariaceae                                  | NULL          | NULL             | NULL       | NULL                    | 
72
| Myrsinaceae                                       | NULL          | NULL             | NULL       | NULL                    | 
73
| Urticaceae                                        | NULL          | NULL             | NULL       | NULL                    | 
74
| Myrsinaceae                                       | NULL          | NULL             | NULL       | NULL                    | 
75
| Myrsinaceae                                       | NULL          | NULL             | NULL       | NULL                    | 
76
| Myrsinaceae                                       | NULL          | NULL             | NULL       | NULL                    | 
77
| Aizoaceae                                         | NULL          | NULL             | NULL       | NULL                    | 
78
| Elaphoglossum ciliatum (C. Presl) T. Moore        | Elaphoglossum | ciliatum         | NULL       | (C. Presl) T. Moore     | 
79
| Hippocrateaceae                                   | NULL          | NULL             | NULL       | NULL                    | 
80
| Olacaceae                                         | NULL          | NULL             | NULL       | NULL                    | 
81
| Taxiphyllum taxirameum (Mitt.) M. Fleisch.        | Taxiphyllum   | taxirameum       | NULL       | (Mitt.) M. Fleisch.     | 
82
| Monimiaceae                                       | NULL          | NULL             | NULL       | NULL                    | 
83
| Myristicaceae                                     | NULL          | NULL             | NULL       | NULL                    | 
84
| Myristicaceae                                     | NULL          | NULL             | NULL       | NULL                    | 
85
| Polygonaceae                                      | NULL          | NULL             | NULL       | NULL                    | 
86
| Porpidia albocaerulescens (Wulfen) Hertel & Knoph | Porpidia      | albocaerulescens | NULL       | (Wulfen) Hertel & Knoph | 
87
| Polygonaceae                                      | NULL          | NULL             | NULL       | NULL                    | 
88
| Polygonaceae                                      | NULL          | NULL             | NULL       | NULL                    | 
89
| Nyctaginaceae                                     | NULL          | NULL             | NULL       | NULL                    | 
90
| Caesalpiniaceae                                   | NULL          | NULL             | NULL       | NULL                    | 
91
| Polygonaceae                                      | NULL          | NULL             | NULL       | NULL                    | 
92
| Nyctaginaceae                                     | NULL          | NULL             | NULL       | NULL                    | 
93
| Miconia cornifolia (Desr.) Naudin                 | Miconia       | cornifolia       | NULL       | (Desr.) Naudin          | 
94
| Cryptantha flavoculata (A. Nelson) Payson         | Cryptantha    | flavoculata      | NULL       | (A. Nelson) Payson      | 
95
+---------------------------------------------------+---------------+------------------+------------+-------------------------+
96
25 rows in set (0.00 sec)
97

    
98
Looks good. Let's check a case where Subspecies is not null:
99

    
100
SELECT ScientificName, Genus, Species, Subspecies, ScientificNameAuthor
101
FROM nybg_raw
102
WHERE Subspecies IS NOT NULL
103
LIMIT 15;
104
+---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+
105
| ScientificName                                                      | Genus        | Species      | Subspecies  | ScientificNameAuthor  |
106
+---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+
107
| Phyllachora graminis (Pers. & Fr.) Fuckel var. graminis             | Phyllachora  | graminis     | graminis    | NULL                  | 
108
| Gnaphalium attenuatum DC. var. sylvicola McVaugh                    | Gnaphalium   | attenuatum   | sylvicola   | McVaugh               | 
109
| Lepidium alyssoides A. Gray var. alyssoides                         | Lepidium     | alyssoides   | alyssoides  | NULL                  | 
110
| Costus productus Gleason var. productus                             | Costus       | productus    | productus   | NULL                  | 
111
| Sorocea muriculata Miq. subsp. muriculata                           | Sorocea      | muriculata   | muriculata  | NULL                  | 
112
| Rollinia laurifolia Schltdl. var. reflexa R. E. Fr.                 | Rollinia     | laurifolia   | reflexa     | R. E. Fr.             | 
113
| Esenbeckia grandiflora Mart. subsp. grandiflora                     | Esenbeckia   | grandiflora  | grandiflora | NULL                  | 
114
| Esenbeckia grandiflora Mart. subsp. grandiflora                     | Esenbeckia   | grandiflora  | grandiflora | NULL                  | 
115
| Esenbeckia grandiflora Mart. subsp. grandiflora                     | Esenbeckia   | grandiflora  | grandiflora | NULL                  | 
116
| Renealmia thyrsoidea (Ruiz & Pav.) Poepp. & Endl. subsp. thyrsoidea | Renealmia    | thyrsoidea   | thyrsoidea  | NULL                  | 
117
| Renealmia thyrsoidea (Ruiz & Pav.) Poepp. & Endl. subsp. thyrsoidea | Renealmia    | thyrsoidea   | thyrsoidea  | NULL                  | 
118
| Costus spiralis (Jacq.) Roscoe var. spiralis                        | Costus       | spiralis     | spiralis    | NULL                  | 
119
| Dimerocostus strobilaceus Kuntze subsp. gutierrezii (Kuntze) Maas   | Dimerocostus | strobilaceus | gutierrezii | (Kuntze) Maas         | 
120
| Brosimum acutifolium Huber subsp. obovatum (Ducke) C. C. Berg       | Brosimum     | acutifolium  | obovatum    | (Ducke) C. C. Berg    | 
121
| Delphinium scaposum Greene var. andersonii (A. Gray) S. L. Welsh    | Delphinium   | scaposum     | andersonii  | (A. Gray) S. L. Welsh | 
122
+---------------------------------------------------------------------+--------------+--------------+-------------+-----------------------+
123
15 rows in set (0.00 sec)
124

    
125
Huh, well, close enough. They follow the format of including the species author. In the case
126
of nominate subspecies (subspecific epithet the same as specific epithet) the subspecies 
127
author is by definition the same, so does not need to be repeated (complicated story I won't
128
go into right now). The main problem here is that is the species author is not found in any field 
129
of these records, only in the concatednated ScientificName field. This shouldn't affect import 
130
of the taxonomic data, but something to bear in mind, especially when parsing these names with
131
the TNRS.
132

    
133
5. Count number of records by institution
134
The majority should be from 'NY'; these are standard herbarium acronyms
135

    
136
SELECT InstitutionCode, COUNT(*) AS records
137
FROM nybg_raw
138
GROUP BY InstitutionCode;
139
+-----------------+---------+
140
| InstitutionCode | records |
141
+-----------------+---------+
142
| NY              |  327995 | 
143
| RBR             |       4 | 
144
| TEX             |       1 | 
145
| UB              |       7 | 
146
| US              |      32 | 
147
| Z               |       1 | 
148
+-----------------+---------+
149
6 rows in set (1.98 sec)
150

    
151
When joined to the equivalent query from VegBIEN should return 5 records.
152

    
153
6. Check CollectionCode
154
- If records are from a single institution (such as this one) there should be 
155
only one value in this field. It will usually be "Herbarium" but not always.
156
The main point is only one record.
157
- Note: This test will not be valid if data are from an aggregator such as GBIF.
158

    
159
SELECT DISTINCT CollectionCode 
160
FROM nybg_raw;
161
+----------------+
162
| CollectionCode |
163
+----------------+
164
| Herbarium      | 
165
+----------------+
166
1 row in set (0.94 sec)
167

    
168
7. Check that standard political divisions make sense
169
- This is a visual inspection, but worth doing
170

    
171
SELECT DISTINCT Country, StateProvince, County 
172
FROM nybg_raw
173
LIMIT 25;
174
+--------------------------+-----------------+----------------+
175
| Country                  | StateProvince   | County         |
176
+--------------------------+-----------------+----------------+
177
| Peru                     | Madre de Dios   | Man?           | 
178
| Belize                   | Belize District | NULL           | 
179
| Brazil                   | Bahia           | Uru?uca        | 
180
| Ecuador                  | Morona-Santiago | Sucua          | 
181
| Brazil                   | Acre            | Porto Valter   | 
182
| Brazil                   | Bahia           | Barreiras      | 
183
| Brazil                   | Pernambuco      | NULL           | 
184
| Brazil                   | Bahia           | Candeal        | 
185
| Colombia                 | Risaralda       | NULL           | 
186
| Brazil                   | Acre            | Porto Acre     | 
187
| Brazil                   | Acre            | Brasil?ia      | 
188
| United States of America | Kentucky        | Rowan Co.      | 
189
| Brazil                   | Acre            | Sena Madureira | 
190
| United States of America | Arkansas        | Crawford Co.   | 
191
| Brazil                   | Bahia           | Una            | 
192
| Brazil                   | Mato Grosso     | NULL           | 
193
| Ecuador                  | Morona-Santiago | NULL           | 
194
| Grenada                  | Saint David     | NULL           | 
195
| United States of America | Nevada          | Lander Co.     | 
196
| Colombia                 | Caldas          | NULL           | 
197
| Bolivia                  | La Paz          | NULL           | 
198
| United States of America | Nevada          | Douglas Co.    | 
199
| Costa Rica               | San Jos?        | P?rez Zeled?n  | 
200
| Peru                     | Jun?n           | NULL           | 
201
| Brazil                   | S?o Paulo       | NULL           | 
202
+--------------------------+-----------------+----------------+
203
25 rows in set (0.00 sec)
204

    
205
OK, looks reasonable. Note the character set issues though. Without inspecting
206
in more detail, I'm not sure if this is from the original, an import-to-MySQL 
207
problem, or something else.
208

    
209
8. Check that CatalogNumber is unique
210

    
211
- This DwC element is supposed to uniquely identify each record. In most herbarium
212
DwC extracts, this *should* be the accession number given to the specimen.
213
- If unique within the institution, a GUID for a given specimen can be formed by
214
concatenating InstitutionCode+CollectionCode+CatalogNumber
215
- However, for a variety of reasons, accession numbers do not always end up being 
216
unique (sometimes the same number accidentally gets stamped on two different 
217
specimens
218
- Nonetheless, it is useful to check if CatalogNumber could be used as 
219
the identifier for specimens.
220

    
221
First count & compare to total records:
222

    
223
SELECT COUNT(*) as totalRecords, COUNT(DISTINCT CatalogNumber) as uniqueCatalogNumbers
224
FROM nybg_raw;
225
+--------------+----------------------+
226
| totalRecords | uniqueCatalogNumbers |
227
+--------------+----------------------+
228
|       328040 |               327637 | 
229
+--------------+----------------------+
230
1 row in set (1.86 sec)
231

    
232
Hmmm, not looking good.
233

    
234
SELECT CatalogNumber, COUNT(*) AS records
235
FROM nybg_raw
236
GROUP BY CatalogNumber
237
HAVING records>1;
238
+---------------+---------+
239
| CatalogNumber | records |
240
+---------------+---------+
241
| NULL          |     401 | 
242
| 788791        |       2 | 
243
| 799350        |       2 | 
244
+---------------+---------+
245
3 rows in set (0.00 sec)
246

    
247
Interesting. We could just exclude the 401 specimens which have not been
248
assigned CatalogNumbers, and for the two numbers with duplicate entries, use only the
249
most-recently added record. But first, lets look at non-null duplicates.
250

    
251
SELECT b.CatalogNumber, DateLastModified, ScientificName, Collector, FieldNumber, 
252
CollectedDate
253
FROM nybg_raw AS a JOIN (
254
SELECT CatalogNumber, COUNT(*) AS records
255
FROM nybg_raw
256
WHERE CatalogNumber IS NOT NULL
257
GROUP BY CatalogNumber
258
HAVING records>1
259
) AS b
260
ON a.CatalogNumber=b.CatalogNumber
261
ORDER BY b.CatalogNumber ASC, DateLastModified DESC;
262
+---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+
263
| CatalogNumber | DateLastModified          | ScientificName                               | Collector      | FieldNumber | CollectedDate |
264
+---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+
265
| 788791        | 2009-3-12T12:11:32.000EST | Psilotum nudum (L.) P. Beauv.                | W. J. Burchell | 3147        | 1826-09-24    | 
266
| 788791        | 2008-4-2T13:09:20.000EST  | Lycopodium jussiaei Desv. ex Poir.           | L. R. Landrum  | 4523        | 1982-04-19    | 
267
| 799350        | 2008-4-8T14:31:41.000EST  | Posoqueria latifolia (Rudge) Roem. & Schult. | P. G. Delprete | 6988        | 1998-12-05    | 
268
| 799350        | 2008-4-8T14:07:19.000EST  | Posoqueria latifolia (Rudge) Roem. & Schult. | P. G. Delprete | 6988        | 1998-12-05    | 
269
+---------------+---------------------------+----------------------------------------------+----------------+-------------+---------------+
270
4 rows in set (4.65 sec)
271

    
272
Unfortunately, this case is not simple. 
273

    
274
CatalogNumber 799350 is indeed a duplicate entry of the same specimen (as identified by 
275
identical Collector+FieldNumber. In this case, the earlier record could be discarded.
276

    
277
CatalogNumber 788791, however, refers to two different species (note different 
278
Collector+FieldNumberCollector+FieldNumber). The solution in this case is either to assign one
279
of them an artificial CatalogNumber (e.g., 788791 and 788791b) or use some entirely different
280
ID, such as 'UniqueNYInternalRecordNumber' (not a DwC element, apparently an artificial primary 
281
key generate by their database).
282

    
283
Now let's check the records with NULL CatalogNumber:
284

    
285
SELECT DateLastModified, ScientificName, Collector, FieldNumber, 
286
CollectedDate
287
FROM nybg_raw 
288
WHERE CatalogNumber IS NULL
289
ORDER BY CatalogNumber ASC, DateLastModified DESC
290
LIMIT 12;
291
+---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+
292
| DateLastModified          | ScientificName                                                               | Collector     | FieldNumber | CollectedDate |
293
+---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+
294
| 2009-4-22T14:56:10.000EST | Dennstaedtia punctilobula (Michx.) T. Moore                                  | D. E. Atha    | 6820        | 2008-10-05    | 
295
| 2009-4-22T14:56:08.000EST | Hedeoma pulegioides (L.) Pers.                                               | D. E. Atha    | 6623        | 2008-08-09    | 
296
| 2009-4-14T15:47:40.000EST | Arisaema triphyllum (L.) Schott                                              | D. E. Atha    | 6063        | 2007-10-21    | 
297
| 2009-3-12T15:57:21.000EST | Otidea                                                                       | R. E. Halling | 7355        | 1994-10-14    | 
298
| 2009-3-12T15:56:37.000EST | Phaeocollybia oligoporpa Singer                                              | R. E. Halling | 7452        | 1995-06-22    | 
299
| 2009-3-12T15:56:36.000EST | Lactarius gerardii var. subrubescens (A. H. Sm. & Hesler) Hesler & A. H. Sm. | R. E. Halling | 7318        | 1994-06-21    | 
300
| 2009-3-12T15:56:31.000EST | Boletus sensibilis Peck                                                      | R. E. Halling | 7224        | 1994-06-03    | 
301
| 2009-3-12T15:56:28.000EST | Lentinellus ursinus (Fr.) K?hner                                             | R. E. Halling | 7196        | 1993-11-24    | 
302
| 2009-3-12T15:55:59.000EST | Lactarius gerardii var. subrubescens (A. H. Sm. & Hesler) Hesler & A. H. Sm. | R. E. Halling | 7313        | 1994-06-20    | 
303
| 2009-3-12T15:55:59.000EST | Phaeocollybia oligoporpa Singer                                              | R. E. Halling | 7423        | 1995-06-16    | 
304
| 2009-3-12T15:55:47.000EST | Lentinellus ursinus (Fr.) K?hner                                             | R. E. Halling | 7188        | 1993-11-23    | 
305
| 2009-3-12T15:55:40.000EST | Xeromphalina kauffmanii A. H. Sm.                                            | R. E. Halling | 7186        | 1993-11-23    | 
306
+---------------------------+------------------------------------------------------------------------------+---------------+-------------+---------------+
307
12 rows in set (0.53 sec)
308

    
309
These all look like good specimens. In light of this, I would favor using
310
UniqueNYInternalRecordNumber as the identifier instead of CatalogNumber.
311

    
312
By the way, these examples of anomalous CatalogNumbers (=accession numbers) are valuable 
313
information which could be returned to the data provider to help them clean up their collection
314
and database.
315

    
316
9. Check that Latitude and Longitude are reasonable values
317
- Strictly speak, the DwC Geospatial extension requires that these fields be labeled DecimalLatitude
318
and DecimalLongitude. 
319
- Check that they are within the range of reasonable decimal values
320

    
321
SELECT 
322
(
323
SELECT COUNT(*)
324
FROM nybg_raw
325
WHERE Latitude IS NOT NULL
326
) AS allLats,
327
(
328
SELECT COUNT(*)
329
FROM nybg_raw
330
WHERE Latitude REGEXP '^(-|\\+){0,1}([0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+|[0-9]+)$'
331
) AS decimalLats,
332
(
333
SELECT COUNT(*)
334
FROM nybg_raw
335
WHERE Longitude IS NOT NULL
336
) AS allLongs,
337
(
338
SELECT COUNT(*)
339
FROM nybg_raw
340
WHERE Longitude REGEXP '^(-|\\+){0,1}([0-9]+\\.[0-9]*|[0-9]*\\.[0-9]+|[0-9]+)$'
341
) AS decimalLongs;
342

    
343
Pass, they are all numbers. Now let's look at some actual values:
344

    
345
SELECT Latitude, Longitude 
346
FROM nybg_raw
347
WHERE  Latitude IS NOT NULL AND  Longitude IS NOT NULL
348
LIMIT 10;
349
+--------------+--------------+
350
| Latitude     | Longitude    |
351
+--------------+--------------+
352
| -11.80000000 | -71.40000000 | 
353
|  17.40000000 | -88.50000000 | 
354
| -14.40000000 | -39.00000000 | 
355
|  -2.50000000 | -78.10000000 | 
356
|  -8.10000000 | -72.80000000 | 
357
| -11.87000000 | -45.45000000 | 
358
|  -7.60000000 | -35.50000000 | 
359
| -11.90000000 | -39.10000000 | 
360
|   4.70000000 | -75.60000000 | 
361
|  -9.80000000 | -67.60000000 | 
362
+--------------+--------------+
363
10 rows in set (0.00 sec)
364

    
365
Looks good. Now for the acceptable range test. Each should return 0 records:
366

    
367
SELECT 
368
(
369
SELECT COUNT(*)
370
FROM nybg_raw
371
WHERE Latitude>90 OR Latitude<-90
372
) AS badLats,
373
(
374
SELECT COUNT(*)
375
FROM nybg_raw
376
WHERE Longitude>180 OR Longitude<-180
377
) AS badLongs;
378
+---------+----------+
379
| badLats | badLongs |
380
+---------+----------+
381
|       0 |        3 | 
382
+---------+----------+
383
1 row in set (1.16 sec)
384

    
385
Not bad, I usually find more. Those 3 bad longitudes will need to be fixed, but that's 
386
data-cleaning.
387

    
388
10. Check date values valid
389
- there are two sets of dates that matter: collection date, and identification date
390
- the live in separate day, month, year fields
391
- collection date will usually (but not always) have a value, identification often does not
392

    
393
First, just look at some values:
394

    
395
SELECT DayCollected, MonthCollected, YearCollected 
396
FROM nybg_raw
397
WHERE DayCollected IS NOT NULL OR MonthCollected IS NOT NULL OR YearCollected IS NOT NULL
398
LIMIT 10;
399
+--------------+----------------+---------------+
400
| DayCollected | MonthCollected | YearCollected |
401
+--------------+----------------+---------------+
402
|           20 |              8 |          1984 | 
403
|           17 |              1 |          1994 | 
404
|            1 |              7 |          1991 | 
405
|           20 |              1 |          1989 | 
406
|           31 |              5 |          1994 | 
407
|            3 |             11 |          1987 | 
408
|           29 |              1 |          1999 | 
409
|           15 |              1 |          1997 | 
410
|           25 |              5 |          1989 | 
411
|           26 |              1 |          1995 | 
412
+--------------+----------------+---------------+
413
10 rows in set (0.05 sec)
414

    
415
Looks reasonable. Do the same for DayIdentified, MonthIdentified, YearIdentified (not shown).
416

    
417
Now, test for non-integers:
418

    
419
SELECT
420
(
421
SELECT COUNT(*)
422
FROM nybg_raw
423
WHERE NOT(DayCollected REGEXP '^-?[0-9]+$')
424
OR NOT(MonthCollected REGEXP '^-?[0-9]+$')
425
OR NOT(YearCollected REGEXP '^-?[0-9]+$')
426
) AS badDateCollected,
427
(
428
SELECT COUNT(*)
429
FROM nybg_raw
430
WHERE NOT(DayIdentified REGEXP '^-?[0-9]+$')
431
OR NOT(MonthIdentified REGEXP '^-?[0-9]+$')
432
OR NOT(YearIdentified REGEXP '^-?[0-9]+$')
433
) AS badDateIdentified;
434
+------------------+-------------------+
435
| badDateCollected | badDateIdentified |
436
+------------------+-------------------+
437
|                0 |                 0 | 
438
+------------------+-------------------+
439
1 row in set (1.88 sec)
440

    
441
Awesome. Good to go.
442

    
443
Some last check; these should all be done by eye; sorry, I have no idea how to automate this.
444

    
445
11. Check Locality
446
- This should look like a verbal description of some location or vegetation
447

    
448
SELECT Locality
449
FROM nybg_raw
450
WHERE Locality IS NOT NULL
451
LIMIT 10;
452
+--------------------------------------------------------------------------------------------------+
453
| Locality                                                                                         |
454
+--------------------------------------------------------------------------------------------------+
455
| Parque Nacional del Manu. R?o Manu: Cocha Casha Station                                          | 
456
| Belize Zoo, in savanna plot immediately behind zoo. Mile 31 on Western Highway                   | 
457
| 7.3 km N of Serra Grande on rd to Itacar?.  Fazenda Lagoa do Conjunto Fazenda Santa Cruz         | 
458
| Centro Shuar Yukutais, 8km SW of Sucua                                                           | 
459
| Bacia do Alto Juru?, Rio Juru?-mirim, margem esquerda, seringal Lucanha, colocacao Vista Alegre. | 
460
| Area do entorno da Cachoeira do Acaba-vida                                                       | 
461
| Sao Vicente F?rrer, Mata do Estado                                                               | 
462
| 8 km al N de Tanquinho, camino a Ichu                                                            | 
463
| Mun. Pereira, Parque Regional Ucumar?, ca. 22 km ESE of Pereira                                  | 
464
| Reserva Florestal de Humait?. Travess?o da direita.                                              | 
465
+--------------------------------------------------------------------------------------------------+
466
10 rows in set (0.00 sec)
467

    
468
Looks reasonable;
469

    
470
14. Check PlantFungusDescription
471
- Should contain a free-text description of the plant in the specimen
472
- Can be null
473
- Just make sure values make sense if not null
474

    
475
SELECT PlantFungusDescription
476
FROM nybg_raw
477
WHERE PlantFungusDescription IS NOT NULL
478
LIMIT 10;
479

    
480
+------------------------------------------------------------------------------------------------------------------------------------------+
481
| PlantFungusDescription                                                                                                                   |
482
+------------------------------------------------------------------------------------------------------------------------------------------+
483
| with rough brown bark, almost no buttressing, 1 m diam; fallen fruit and leaves                                                          | 
484
| Tree height: 12.0 m, bole, 7.5 m                                                                                                         | 
485
| Shrub to 2.5m, cauliflorous                                                                                                              | 
486
| Arvore 8 m; infloresc?ncia pendente, raque vin?cea e c?lice persistente; fruto globoso verde com listas vin?ceas casca interna vermelha. | 
487
| Arbusto de ca. 1,6 m alt., com folhas semisuculentas, discolores.  Botoes branco-esverdeados                                             | 
488
| Arvore ca. de 7,0 m de altura.  Bot?es e p?tals r?seos. Estames amarelos                                                                 | 
489
| Postrada, hojas crasas, flores rosadas, ovario s?pero                                                                                    | 
490
| Arbusto com 3 m; frutos imaturos verdes com manchas brancas, globosos.                                                                   | 
491
| Liana, frutos imaturos vermelhos.                                                                                                        | 
492
| Arvore, botoes e flores. Corola e ?rgaos reprodutivos amarelos.                                                                          | 
493
+------------------------------------------------------------------------------------------------------------------------------------------+
494
10 rows in set (0.00 sec)
495

    
496
Looks good.
497

    
498
13. Check Habitat, Vegetation
499
- Check that these look like verbal descriptors of habitat or vegatation
500
- In many herbarium DwC dumps, these fields will be null, depending on herbarium
501
policy for what goes in them, or how they record data from labels.
502
- Here you are most checking that, if not null, the fields contain reasonable-looking values.
503

    
504
SELECT Habitat, Vegetation
505
FROM nybg_raw
506
WHERE Habitat IS NOT NULL OR Vegetation IS NOT NULL
507
LIMIT 10;
508
+------------------------------------------------------+-----------------------------------+
509
| Habitat                                              | Vegetation                        |
510
+------------------------------------------------------+-----------------------------------+
511
| Floodplain forest                                    | NULL                              | 
512
| Southern Bahian moist forest                         | NULL                              | 
513
| Floresta de v?rzea.                                  | NULL                              | 
514
| NULL                                                 | Cerrado                           | 
515
| Interior da mata                                     | NULL                              | 
516
| NULL                                                 | Caatinga                          | 
517
| Terra firme.                                         | NULL                              | 
518
| Floresta prim?ria de terra firme                     | Floresta prim?ria de terra firme. | 
519
| mixed hardwoods over exposed sandstone and limestone | NULL                              | 
520
| Floresta prim?ria de terra firme.                    | Floresta Prim?ria de Terra Firme. | 
521
+------------------------------------------------------+-----------------------------------+
522
10 rows in set (0.00 sec)
523

    
524
Both look reasonable.
525

    
526

    
527

    
528

    
529

    
530

    
531

    
532

    
533

    
534

    
    (1-1/1)