Revision 4648
Added by Aaron Marcuse-Kubitza almost 12 years ago
filter_out_ci | ||
---|---|---|
1 | 1 |
#!/usr/bin/env python |
2 | 2 |
# Finds spreadsheet rows where a column is not in a vocabulary. |
3 |
# The vocabulary should not have a header. CSVs without a header are supported. |
|
3 | 4 |
# Case- and punctuation-insensitive. |
4 | 5 |
|
5 | 6 |
import csv |
... | ... | |
18 | 19 |
vocab = set() |
19 | 20 |
stream = open(vocab_path, 'rb') |
20 | 21 |
reader = csv.reader(stream) |
21 |
reader.next() # skip header |
|
22 | 22 |
for row in reader: vocab.add(simplify(row[0])) |
23 | 23 |
stream.close() |
24 | 24 |
|
25 | 25 |
# Filter input |
26 | 26 |
reader = csv.reader(sys.stdin) |
27 | 27 |
writer = csv.writer(sys.stdout) |
28 |
writer.writerow(reader.next()) # pass through header |
|
29 | 28 |
for row in reader: |
30 | 29 |
term = simplify(row[col_num]) |
31 | 30 |
if term not in vocab: writer.writerow(row) |
Also available in: Unified diff
filter_out_ci: Filter header instead of passing it through, in order to properly support CSVs without a header, such as the unmapped_terms.csv and new_terms.csv files. For CSVs with a header, the header of the vocabulary should be removed before passing it to filter_out_ci.