/ - Diff - BIEN 3 - NCEAS Projects

« Previous | Next »

Revision 9518

Added by Aaron Marcuse-Kubitza over 11 years ago

bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed.

this also removes clutter in tnrs_db, making it clearer what operations it performs that the library function tnrs.repeated_tnrs_request() does not (namely, interfacing with the DB and profiling the TNRS request).

inputs/.TNRS/tnrs/tnrs.make
1	1	#!/bin/bash
2	2	# Runs tnrs_db on VegBIEN
3		# Usage: make inputs/.TNRS/tnrs/tnrs-remake [log=] [wait=1]
	3	# Usage: make inputs/.TNRS/tnrs/tnrs-remake [log=]
4	4
5	5	# Handle being run as a .make script
6	6	exec >&2

     #!/usr/bin/env python
     # Scrubs the taxonlabels in VegBIEN using TNRS.
     # Runs continuously until no new rows are added after max_pause.
     import os.path
     import StringIO
-...
     import strings
     import tnrs
     # Config
     pause = 2*60*60 # sec; = 2 hr
     max_pause = 9*60*60 # sec; = 9 hr; must be >= max partition import time (1.5 hr)
     assert pause <= max_pause
     tnrs_input = sql_gen.Table('tnrs_input_name')
     tnrs_data = sql_gen.Table('tnrs')
-...
         env_names = []
         db_config = opts.get_env_vars(sql.db_config_names, None, env_names)
         verbosity = float(opts.get_env_var('verbosity', 3, env_names))
         wait = opts.env_flag('wait', False, env_names)
         if not 'engine' in db_config: raise SystemExit('Usage: '
             +opts.env_usage(env_names)+' '+sys.argv[0]+' 2>>log')
-...
         tnrs_profiler = profiling.ItersProfiler(iter_text='name')
         # Iterate over unscrubbed verbatim taxonlabels
         total_pause = 0
         while True:
             # Fetch next set
             cur = sql.select(db, tnrs_input, limit=tnrs.max_names, cacheable=False)
             this_ct = cur.rowcount
             log('Processing '+str(this_ct)+' taxonlabels')
             if this_ct == 0:
                 if not wait: break
                 log('Waited '+str(total_pause)+' sec total')
                 total_pause += pause
                 if total_pause > max_pause: break
                 log('Waiting '+str(pause)+' sec...')
                 time.sleep(pause) # wait for more rows
                 continue # try again
             if this_ct == 0: break
             # otherwise, rows found
             total_pause = 0
             names = list(sql.values(cur))
             def process():

Also available in: Unified diff

Project

General

Profile

Revision 9518

Added by Aaron Marcuse-Kubitza over 11 years ago