Project

General

Profile

1
e-mail from Brad Boyle on 2013-2-28:
2
-----
3
[Re filtering non-plants from the GBIF extract:]
4

    
5
I suggest using a white list / black list approach to filter by institution rather than filtering individual records. I know of no cases of botanical specimens mixed with zoological specimens in the same collection. To implement this approach, you will need information on source institutions, the number of observations from each institution ("totalObs"), and the number observations satisfying the following condition:
6

    
7
WHERE family LIKE "%aceae%" OR family IN ("Compositae","Gramineae","Palmae","Guttiferae","Cruciferae","Labiatae","Umbelliferae","Leguminosae")
8

    
9
(let's call this result "plantObs"). You will also need the list of herbarium acronyms from Index Herbariorum. I'll leave it up to you to decide whether to do this after importing to a database, or before.
10

    
11
1. Keep all records from institutions with an acronym that joins to Index Herbariorum (white list)
12
2. Of the remainder, keep all records from any institution where plantObs / totalObs > .80 (white list)
13
3. Of the remainder, reject all records from any institution where plantObs / totalObs < .20 (black list)
14
4. I would suggest also rejecting all records from any institution where .80 > plantObs / totalObs < .20 (black list), but if any institutions fall into this category, please send me a list of these institutions, along with the counts of plantObs and totalObs for each one.
15

    
16
My guess is that the distribution will be basically bimodal, with the majority of institutions falling into or very close to plantObs / totalObs = 1 or plantObs / totalObs = 0. I'll be interested (and surprised) if any institutions fall in the middle.
17

    
18
The only hitch I can imagine would be if a herbarium data provider doesn't use its acronym in institutionCode. But steps 3-4 should mop up any herbaria not caught by step 1. NULL family would obviously be a problem, but I'd be surprised if any institutions do not include family in their databases.-----
19
-----
(9-9/10)