csvs.py: stream_info(): Added parse_header option. reader_and_header(): Use stream_info()'s new parse_header option.
csv2ddl: Renamed schema name env var from datasrc to schema to reflect what it is, and to make the script general beyond importing inputs
input.Makefile: Moved Installation, Staging tables after Existing maps discovery because they depend on it. Staging tables: Create a staging table for each table a map spreadsheet is available for. Put double quotes around the schema name so its case is preserved.
Added csv2ddl to make a PostgreSQL CREATE TABLE statement from a CSV header
sql.py: Input validation: Moved section after Database connections because some of its functions require a connection. Added esc_name_by_module() and esc_name_by_engine(), and use esc_name_by_module() in esc_name().
input.Makefile: Installation: Create a schema for the datasource in VegBIEN as part of the installation process. This will be used to hold staging tables.
input.Makefile: Changed install, uninstall to depend on src/install, src/uninstall targets, which in turn depend on db, rm_db. This will allow us to add additional install actions for all input types.
sql.py: DbConn: Cache the constructed CacheCursor itself, rather than the dict that's used to create it
sql.py: pkey(): Changed to use the connection-wide caching mechanism rather than its own custom cache. DbConn.__getstate__(): Don't pickle the debug callback.
sql.py: DbConn: Added is_cached(). run_query(): Use new DbConn.is_cached() to avoid creating a savepoint if the query is cached.
sql.py: DbConn: Also cache cursor.description
sql.py: DbConn: Cache query results as a dict subset of the cursor's key attributes, so that additional attributes can easily be cached by adding them to the subset list
dicts.py: Added AttrsDictView
util.py: NamedTuple.__iter__(): Removed unnecessary **attrs param
sql.py: _query_lookup(): Fixed bug where params was cast to a tuple, even though it could also be a dict. index_cols(): Changed to use the connection-wide caching mechanism rather than its own custom cache.
util.py: NamedTuple: Made it usable as a hashable dict (with string keys) by adding iter() and getitem()
dicts.py: Added make_hashable()
sql.py: DbConn: Only cache exceptions for inserts since they are not idempotent, but an invalid insert will always be invalid. If a cached result in an exception, re-raise it in a separate method other than the constructor to ensure that the cursor object is still created, and that its query instance var is set.
sql.py: insert(): Cache insert queries by default. This works because any DuplicateKeyException, etc. would be cached as well. This saves many inserts for rows that we already know are in the database.
sql.py: DbConn.run_query(): Cache exceptions raised by queries as well
sql.py: DbConn.run_query(): When debug logging, label queries with their cache status (hit/miss/non-cacheable)
sql.py: DbConn.run_query(): Also debug-log queries that produce exceptions
sql.py: DbConn: Allow creator to provide a log function to call on debug messages, instead of using stderr directly
bin/map: Pass debug mode to DbConn so that SQL query debugging works again
sql.py: DbConn: DbCursor: Fixed bug where caching was always turned on, by passing the cacheable setting to it from run_query(). Turned caching back on (uncommented it) since it's now working.
bin/map: map_rows()/map_table(): Pass kw_args to process_rows() so rows_start can be specified when using them. DB inputs: Skip the pre-start rows in the SQL query itself, so that they don't need to be iterated over by the cursor in the main loop.
bin/map: Fixed bug introduced in r1718 where the row # would not be incremented if i < start, causing an semi-infinite loop that only ended when the input rows were exhausted. process_rows(): Added optional rows_start parameter to use if the input rows already have the pre-start rows skipped.
input.Makefile: Sources: cat: Changed Usage message to use "--silent" make option
input.Makefile: Sources: cat: Added Usage message with instructions for removing echoed make commands
run_*query(): Fixed bug where INSERTs, etc. were cached by making callers (such as select()) explicitly turn on caching. DbConn.run_query(): Fixed bug where cur.mogrify() was not supported under MySQL by making the cache key a tuple of the unmogrified query and its params instead of the mogrified string query. CacheCursor: Store attributes of the original cursor that we use, such as query and rowcount.
sql.py: Made row() and value() cache the result by fetching all rows before returning the first row
iters.py: Added func_iter() and consume_iter()
sql.py: Cache the results of queries (when all rows are read)
Proxy.py: Fixed infinite recursion bug by removing setattr() (which prevents the class and subclasses from storing instance variables using "self." syntax)
sql.py: DbConn: Added run_query(). run_raw_query(): Use new DbConn.run_query().
Added Proxy.py
parallel.py: MultiProducerPool: Added code to create a shared Namespace object, commented out. Updated share() doc comment to reflect that it will writably share the values as well.
bin/map: Share locals() with the pool at various times to try to get as many unpicklable values into the shared vars as possible
dicts.py: Turned id_dict() factory function into IdDict class. parallel.py: MultiProducerPool: Added share_vars(). main_loop(): Only consider the program to be done if the queue is empty and there are no running tasks.
collection.py: rmap(): Treat only built-in sequences specially instead of iterables. Pass whether the value is a leaf to the func. Added option to only recurse up to a certain # of levels.
Added lists.py
collection.py: rmap(): Fixed bugs: Made it recursive. Use iters.is_iterable() instead of isinstance(value, list) to work on all iterables. Use value and not nonexistent var list_.
iters.py: Added is_iterable()
parallel.py: prepickle(): Pickle all objects in vars_id_dict_ by ID, not just unpicklable ones. This ensures that a DB connection created in the main process will be shared with subprocesses by reference (id()) instead of by value, so that each process can take advantage of e.g. shared caches in the connection object. Note that this may require some synchronization.
parallel.py: MultiProducerPool.main_loop(): Got rid of no longer correct doc comment
bin/map: Share on_error with the pool
parallel.py: MultiProducerPool: Pickle objects by ID if they're accessible to the main_loop process. This should allow e.g. DB connections and pools to be pickled, if they were defined in the main process.
Added dicts.py with id_dict() and MergeDict
Added collection.py with rmap()
db_xml.py: put(): Moved pool.apply_async() from put_child() to put_(), and don't use lambdas because they can't be pickled
parallel.py: MultiProducerPool.apply_async(): Prepickle all function args. Try pickling the args before the queue pickles them, to get better debugging output.
sql.py: with_savepoint(): Use new rand.rand_int()
rand.py: rand_int() Fixed bug where newly-created objects did not have unique IDs because they were on the stack. So, we have to use random.randint() anyway.
Added rand.py
sql.py: DbConn: Made it picklable by establishing a connection on demand
bin/map: Also consume asynchronous tasks before closing the DB connection (this is where most if not all tasks will be consumed)
Runnable.py: Made it picklable
Added eval_.py
Added Runnable
db_xml.py: put(): Added parallel processing support for inserting children with fkeys to parent asynchronously
parallel.py: Fixed bugs: Added self param to instance methods and inner classes where needed
parallel.py: Changed to use multi-producer pool, which requires calling pool.main_loop()
parallel.py: Pool: Added doc comment
parallel.py: Pool: apply_async(): Return a result object like multiprocessing.Pool.apply_async()
bin/map: Use new parallel.py for parallel processing
Added parallel.py for parallel processing
bin/map: Use dummy synchronous Pool implementation if not using parallel processing
bin/map: Use multiprocessing instead of pp for parallel processing because it's easier to use (it uses the Python threading API and doesn't require providing all the functions a task calls). Allow the user to set the cpus option to to use all system CPUs (needed because in test mode, the default is 0 CPUs to turn off parallel processing).
disown_all, stop_imports: Use /bin/bash instead of /bin/sh because array subscripting is used
input.Makefile: Editing import: Use $(datasrc) instead of $(db) since $(db) is only set for DB-source inputs
input.Makefile: Import: If profile is on and test mode is on, output formatted profile stats to stdout
sql.py: index_cols(): Cache return values in db.index_cols
bin/map: Don't import pp unless cpus != 0 because it's slow and doesn't need to happen if we're not using parallelization. cpus option defaults to 0 in test mode so tests run faster.
sql.py: pkey(): Use pkeys cache from db object instead of parameter
sql.py: Wrapped db connection inside an object that can also store the cache of the pkeys and index_cols
bin/map: If cpus is 0, run without Parallel Python
bin/map: Set up Parallel Python with an env-var-customizable # CPUs
root Makefile: python-Linux: Added `sudo pip install pp`
root Makefile: python-Linux: Added python-parallel to installs
mappings: Build VegX-VegBIEN.organisms.csv from VegX-VegBIEN.stems.csv instead of vice versa. This entails switching the roots around so stem points to organism instead of the other way around, which is a complex operation. Re-rooted VegX-VegBIEN.organisms.csv at /plantobservation instead of /taxonoccurrence to avoid traveling up the hierarchy to taxonoccurrence and back down again to plantobservation, etc. as would otherwise have been the case.
bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it
xpath.py: get(): forward (parent-to-child) pointers: If last target object exists but doesn't have an ID attr (which indicates a bug), recover gracefully by just assuming the ID is 0. (Any bug will be noticeable in the output, which needs to be generated through workarounds like this in order to be able to debug.)
VegX mappings: Updated stemParent mapping for VegX 1.5.3
VegX mappings: Changed taxonDetermination of role identifier to instead have explicitly no role, because data providers' VegX files generally do not provide role information and we don't want the default taxonDetermination XPaths to require this
inputs/CTFS/maps/VegX.organisms.csv: Connected plot to plotObservation by using new support for backward (child-to-parent) pointers whose target is a text element containing an ID
xml_dom.py: get_id(): If the node doesn't have an ID, assumes the node itself is the ID. This enables backward (child-to-parent) pointers whose target is a text element containing an ID, rather than a regular element with an ID attribute.
VegX mappings: Map locationevent.sourceaccessioncode to plotUniqueIdentifier since this field is no longer being used by authorlocationcode
VegX mappings: Map the authorlocationcode to plotName instead of plotUniqueIdentifier because it's a better fit
inputs/CTFS/maps/VegX.organisms.csv: Fixed bug in Species taxonConcept mapping where the role was computer instead of identifier
xml_dom.py: value(): Skip comment nodes. This fixes a bug where comments inside text elements would prevent the value from being retrieved.
inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>
inputs/CTFS/maps/VegX.organisms.csv: Added taxonConcept mappings
mappings/VegX-VegBIEN.organisms.csv: Added species taxonConcept mapping for identifier role
Added expand_xpath to expand XPath abbreviations
VegX mappings: Renamed taxonNameUsageConceptsID to taxonNameUsageConceptID (no plural) to match VegX 1.5.3
inputs/CTFS/maps/VegX.organisms.csv: Corrected CensusNumber input mapping
mappings/Makefile: Generate self maps for all core maps