/ - Changes - BIEN 3 - NCEAS Projects

root @ 1927

#	Date	Author	Comment
1927	04/21/2012 01:22 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: For non-cacheable queries, use a plain cursor() instead of a DbCursor to avoid the overhead of saving the result and wrapping the cursor
1926	04/20/2012 05:20 PM	Aaron Marcuse-Kubitza	Moved db_config_names from bin/map to sql.py so it can be used by other scripts as well
1925	04/20/2012 04:52 PM	Aaron Marcuse-Kubitza	csv2ddl: Also print a COPY FROM statement
1924	04/20/2012 04:47 PM	Aaron Marcuse-Kubitza	input.Makefile: Fixed bug where input type was considered to be different things if both $(inputFiles) and $(dbExport) are non-empty. Now, $(inputFiles) takes precedence so that the presence of any input files will cause a DB dump to be ignored. This ensures that a (slower) input DB is not used over a (faster) flat file.
1923	04/20/2012 04:21 PM	Aaron Marcuse-Kubitza	csvs.py: stream_info(): Added parse_header option. reader_and_header(): Use stream_info()'s new parse_header option.
1922	04/20/2012 03:53 PM	Aaron Marcuse-Kubitza	csv2ddl: Renamed schema name env var from datasrc to schema to reflect what it is, and to make the script general beyond importing inputs
1921	04/20/2012 03:32 PM	Aaron Marcuse-Kubitza	input.Makefile: Moved Installation, Staging tables after Existing maps discovery because they depend on it. Staging tables: Create a staging table for each table a map spreadsheet is available for. Put double quotes around the schema name so its case is preserved.
1920	04/20/2012 03:29 PM	Aaron Marcuse-Kubitza	Added csv2ddl to make a PostgreSQL CREATE TABLE statement from a CSV header
1919	04/20/2012 03:28 PM	Aaron Marcuse-Kubitza	sql.py: Input validation: Moved section after Database connections because some of its functions require a connection. Added esc_name_by_module() and esc_name_by_engine(), and use esc_name_by_module() in esc_name().
1918	04/20/2012 02:18 PM	Aaron Marcuse-Kubitza	input.Makefile: Installation: Create a schema for the datasource in VegBIEN as part of the installation process. This will be used to hold staging tables.
1917	04/20/2012 01:57 PM	Aaron Marcuse-Kubitza	input.Makefile: Changed install, uninstall to depend on src/install, src/uninstall targets, which in turn depend on db, rm_db. This will allow us to add additional install actions for all input types.
1916	04/19/2012 07:17 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Cache the constructed CacheCursor itself, rather than the dict that's used to create it
1915	04/19/2012 07:06 PM	Aaron Marcuse-Kubitza	sql.py: pkey(): Changed to use the connection-wide caching mechanism rather than its own custom cache. DbConn.__getstate__(): Don't pickle the debug callback.
1914	04/19/2012 07:00 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added is_cached(). run_query(): Use new DbConn.is_cached() to avoid creating a savepoint if the query is cached.
1913	04/19/2012 06:52 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Also cache cursor.description
1912	04/19/2012 06:50 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Cache query results as a dict subset of the cursor's key attributes, so that additional attributes can easily be cached by adding them to the subset list
1911	04/19/2012 06:48 PM	Aaron Marcuse-Kubitza	dicts.py: Added AttrsDictView
1910	04/19/2012 06:47 PM	Aaron Marcuse-Kubitza	util.py: NamedTuple.__iter__(): Removed unnecessary **attrs param
1909	04/19/2012 06:30 PM	Aaron Marcuse-Kubitza	sql.py: _query_lookup(): Fixed bug where params was cast to a tuple, even though it could also be a dict. index_cols(): Changed to use the connection-wide caching mechanism rather than its own custom cache.
1908	04/19/2012 06:28 PM	Aaron Marcuse-Kubitza	util.py: NamedTuple: Made it usable as a hashable dict (with string keys) by adding iter() and getitem()
1907	04/19/2012 06:27 PM	Aaron Marcuse-Kubitza	dicts.py: Added make_hashable()
1906	04/17/2012 09:59 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Only cache exceptions for inserts since they are not idempotent, but an invalid insert will always be invalid. If a cached result in an exception, re-raise it in a separate method other than the constructor to ensure that the cursor object is still created, and that its query instance var is set.
1905	04/17/2012 09:11 PM	Aaron Marcuse-Kubitza	sql.py: insert(): Cache insert queries by default. This works because any DuplicateKeyException, etc. would be cached as well. This saves many inserts for rows that we already know are in the database.
1904	04/17/2012 09:06 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Cache exceptions raised by queries as well
1903	04/17/2012 08:48 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): When debug logging, label queries with their cache status (hit/miss/non-cacheable)
1902	04/17/2012 08:25 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Also debug-log queries that produce exceptions
1901	04/17/2012 08:18 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Allow creator to provide a log function to call on debug messages, instead of using stderr directly
1900	04/17/2012 08:01 PM	Aaron Marcuse-Kubitza	bin/map: Pass debug mode to DbConn so that SQL query debugging works again
1899	04/17/2012 07:49 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: DbCursor: Fixed bug where caching was always turned on, by passing the cacheable setting to it from run_query(). Turned caching back on (uncommented it) since it's now working.
1898	04/17/2012 07:21 PM	Aaron Marcuse-Kubitza	bin/map: map_rows()/map_table(): Pass kw_args to process_rows() so rows_start can be specified when using them. DB inputs: Skip the pre-start rows in the SQL query itself, so that they don't need to be iterated over by the cursor in the main loop.
1897	04/17/2012 07:07 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug introduced in r1718 where the row # would not be incremented if i < start, causing an semi-infinite loop that only ended when the input rows were exhausted. process_rows(): Added optional rows_start parameter to use if the input rows already have the pre-start rows skipped.
1896	04/17/2012 05:49 PM	Aaron Marcuse-Kubitza	input.Makefile: Sources: cat: Changed Usage message to use "--silent" make option
1895	04/17/2012 05:45 PM	Aaron Marcuse-Kubitza	input.Makefile: Sources: cat: Added Usage message with instructions for removing echoed make commands
1894	04/17/2012 05:17 PM	Aaron Marcuse-Kubitza	run_*query(): Fixed bug where INSERTs, etc. were cached by making callers (such as select()) explicitly turn on caching. DbConn.run_query(): Fixed bug where cur.mogrify() was not supported under MySQL by making the cache key a tuple of the unmogrified query and its params instead of the mogrified string query. CacheCursor: Store attributes of the original cursor that we use, such as query and rowcount.
1893	04/17/2012 04:38 PM	Aaron Marcuse-Kubitza	sql.py: Made row() and value() cache the result by fetching all rows before returning the first row
1892	04/17/2012 04:37 PM	Aaron Marcuse-Kubitza	iters.py: Added func_iter() and consume_iter()
1891	04/17/2012 04:11 PM	Aaron Marcuse-Kubitza	sql.py: Cache the results of queries (when all rows are read)
1890	04/17/2012 03:48 PM	Aaron Marcuse-Kubitza	Proxy.py: Fixed infinite recursion bug by removing setattr() (which prevents the class and subclasses from storing instance variables using "self." syntax)
1889	04/16/2012 10:19 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added run_query(). run_raw_query(): Use new DbConn.run_query().
1888	04/16/2012 10:18 PM	Aaron Marcuse-Kubitza	Added Proxy.py
1887	04/16/2012 09:32 PM	Aaron Marcuse-Kubitza	parallel.py: MultiProducerPool: Added code to create a shared Namespace object, commented out. Updated share() doc comment to reflect that it will writably share the values as well.
1886	04/16/2012 08:49 PM	Aaron Marcuse-Kubitza	bin/map: Share locals() with the pool at various times to try to get as many unpicklable values into the shared vars as possible
1885	04/16/2012 08:45 PM	Aaron Marcuse-Kubitza	dicts.py: Turned id_dict() factory function into IdDict class. parallel.py: MultiProducerPool: Added share_vars(). main_loop(): Only consider the program to be done if the queue is empty and there are no running tasks.
1884	04/16/2012 08:00 PM	Aaron Marcuse-Kubitza	collection.py: rmap(): Treat only built-in sequences specially instead of iterables. Pass whether the value is a leaf to the func. Added option to only recurse up to a certain # of levels.
1883	04/16/2012 07:10 PM	Aaron Marcuse-Kubitza	Added lists.py
1882	04/16/2012 04:40 PM	Aaron Marcuse-Kubitza	collection.py: rmap(): Fixed bugs: Made it recursive. Use iters.is_iterable() instead of isinstance(value, list) to work on all iterables. Use value and not nonexistent var list_.
1881	04/16/2012 04:38 PM	Aaron Marcuse-Kubitza	iters.py: Added is_iterable()
1880	04/16/2012 04:11 PM	Aaron Marcuse-Kubitza	parallel.py: prepickle(): Pickle all objects in vars_id_dict_ by ID, not just unpicklable ones. This ensures that a DB connection created in the main process will be shared with subprocesses by reference (id()) instead of by value, so that each process can take advantage of e.g. shared caches in the connection object. Note that this may require some synchronization.
1879	04/16/2012 04:06 PM	Aaron Marcuse-Kubitza	parallel.py: MultiProducerPool.main_loop(): Got rid of no longer correct doc comment
1878	04/16/2012 04:05 PM	Aaron Marcuse-Kubitza	bin/map: Share on_error with the pool
1877	04/16/2012 04:05 PM	Aaron Marcuse-Kubitza	parallel.py: MultiProducerPool: Pickle objects by ID if they're accessible to the main_loop process. This should allow e.g. DB connections and pools to be pickled, if they were defined in the main process.
1876	04/14/2012 09:31 PM	Aaron Marcuse-Kubitza	Added dicts.py with id_dict() and MergeDict
1875	04/14/2012 09:30 PM	Aaron Marcuse-Kubitza	Added collection.py with rmap()
1874	04/14/2012 07:38 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Moved pool.apply_async() from put_child() to put_(), and don't use lambdas because they can't be pickled
1873	04/14/2012 07:35 PM	Aaron Marcuse-Kubitza	parallel.py: MultiProducerPool.apply_async(): Prepickle all function args. Try pickling the args before the queue pickles them, to get better debugging output.
1872	04/14/2012 07:33 PM	Aaron Marcuse-Kubitza	sql.py: with_savepoint(): Use new rand.rand_int()
1871	04/14/2012 07:33 PM	Aaron Marcuse-Kubitza	rand.py: rand_int() Fixed bug where newly-created objects did not have unique IDs because they were on the stack. So, we have to use random.randint() anyway.
1870	04/14/2012 07:27 PM	Aaron Marcuse-Kubitza	Added rand.py
1869	04/14/2012 06:56 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Made it picklable by establishing a connection on demand
1868	04/14/2012 06:54 PM	Aaron Marcuse-Kubitza	bin/map: Also consume asynchronous tasks before closing the DB connection (this is where most if not all tasks will be consumed)
1867	04/14/2012 06:44 PM	Aaron Marcuse-Kubitza	Runnable.py: Made it picklable
1866	04/14/2012 06:44 PM	Aaron Marcuse-Kubitza	Added eval_.py
1865	04/14/2012 05:35 PM	Aaron Marcuse-Kubitza	Added Runnable
1864	04/14/2012 03:05 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Added parallel processing support for inserting children with fkeys to parent asynchronously
1863	04/14/2012 03:03 PM	Aaron Marcuse-Kubitza	parallel.py: Fixed bugs: Added self param to instance methods and inner classes where needed
1862	04/14/2012 02:32 PM	Aaron Marcuse-Kubitza	parallel.py: Changed to use multi-producer pool, which requires calling pool.main_loop()
1861	04/14/2012 01:04 PM	Aaron Marcuse-Kubitza	parallel.py: Pool: Added doc comment
1860	04/14/2012 01:03 PM	Aaron Marcuse-Kubitza	parallel.py: Pool: apply_async(): Return a result object like multiprocessing.Pool.apply_async()
1859	04/14/2012 12:53 PM	Aaron Marcuse-Kubitza	bin/map: Use new parallel.py for parallel processing
1858	04/14/2012 12:51 PM	Aaron Marcuse-Kubitza	Added parallel.py for parallel processing
1857	04/14/2012 12:37 PM	Aaron Marcuse-Kubitza	bin/map: Use dummy synchronous Pool implementation if not using parallel processing
1856	04/14/2012 12:18 PM	Aaron Marcuse-Kubitza	bin/map: Use multiprocessing instead of pp for parallel processing because it's easier to use (it uses the Python threading API and doesn't require providing all the functions a task calls). Allow the user to set the cpus option to to use all system CPUs (needed because in test mode, the default is 0 CPUs to turn off parallel processing).
1855	04/13/2012 04:41 PM	Aaron Marcuse-Kubitza	disown_all, stop_imports: Use /bin/bash instead of /bin/sh because array subscripting is used
1854	04/13/2012 04:38 PM	Aaron Marcuse-Kubitza	input.Makefile: Editing import: Use $(datasrc) instead of $(db) since $(db) is only set for DB-source inputs
1853	04/13/2012 04:31 PM	Aaron Marcuse-Kubitza	input.Makefile: Import: If profile is on and test mode is on, output formatted profile stats to stdout
1852	04/13/2012 03:00 PM	Aaron Marcuse-Kubitza	sql.py: index_cols(): Cache return values in db.index_cols
1851	04/13/2012 02:56 PM	Aaron Marcuse-Kubitza	bin/map: Don't import pp unless cpus != 0 because it's slow and doesn't need to happen if we're not using parallelization. cpus option defaults to 0 in test mode so tests run faster.
1850	04/13/2012 02:52 PM	Aaron Marcuse-Kubitza	sql.py: pkey(): Use pkeys cache from db object instead of parameter
1849	04/13/2012 02:44 PM	Aaron Marcuse-Kubitza	sql.py: Wrapped db connection inside an object that can also store the cache of the pkeys and index_cols
1848	04/13/2012 02:27 PM	Aaron Marcuse-Kubitza	bin/map: If cpus is 0, run without Parallel Python
1847	04/13/2012 02:19 PM	Aaron Marcuse-Kubitza	bin/map: Set up Parallel Python with an env-var-customizable # CPUs
1846	04/13/2012 02:18 PM	Aaron Marcuse-Kubitza	bin/map: Set up Parallel Python with an env-var-customizable # CPUs
1845	04/13/2012 12:58 PM	Aaron Marcuse-Kubitza	root Makefile: python-Linux: Added `sudo pip install pp`
1844	04/13/2012 12:47 PM	Aaron Marcuse-Kubitza	root Makefile: python-Linux: Added python-parallel to installs
1843	04/13/2012 12:19 PM	Aaron Marcuse-Kubitza	mappings: Build VegX-VegBIEN.organisms.csv from VegX-VegBIEN.stems.csv instead of vice versa. This entails switching the roots around so stem points to organism instead of the other way around, which is a complex operation. Re-rooted VegX-VegBIEN.organisms.csv at /plantobservation instead of /taxonoccurrence to avoid traveling up the hierarchy to taxonoccurrence and back down again to plantobservation, etc. as would otherwise have been the case.
1842	04/13/2012 11:43 AM	Aaron Marcuse-Kubitza	bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it
1841	04/13/2012 10:45 AM	Aaron Marcuse-Kubitza	bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it
1840	04/13/2012 10:44 AM	Aaron Marcuse-Kubitza	xpath.py: get(): forward (parent-to-child) pointers: If last target object exists but doesn't have an ID attr (which indicates a bug), recover gracefully by just assuming the ID is 0. (Any bug will be noticeable in the output, which needs to be generated through workarounds like this in order to be able to debug.)
1839	04/10/2012 05:18 PM	Aaron Marcuse-Kubitza	VegX mappings: Updated stemParent mapping for VegX 1.5.3
1838	04/10/2012 04:54 PM	Aaron Marcuse-Kubitza	VegX mappings: Changed taxonDetermination of role identifier to instead have explicitly no role, because data providers' VegX files generally do not provide role information and we don't want the default taxonDetermination XPaths to require this
1837	04/10/2012 04:34 PM	Aaron Marcuse-Kubitza	inputs/CTFS/maps/VegX.organisms.csv: Connected plot to plotObservation by using new support for backward (child-to-parent) pointers whose target is a text element containing an ID
1836	04/10/2012 04:33 PM	Aaron Marcuse-Kubitza	xml_dom.py: get_id(): If the node doesn't have an ID, assumes the node itself is the ID. This enables backward (child-to-parent) pointers whose target is a text element containing an ID, rather than a regular element with an ID attribute.
1835	04/10/2012 04:04 PM	Aaron Marcuse-Kubitza	VegX mappings: Map locationevent.sourceaccessioncode to plotUniqueIdentifier since this field is no longer being used by authorlocationcode
1834	04/10/2012 03:48 PM	Aaron Marcuse-Kubitza	VegX mappings: Map the authorlocationcode to plotName instead of plotUniqueIdentifier because it's a better fit
1833	04/10/2012 03:13 PM	Aaron Marcuse-Kubitza	inputs/CTFS/maps/VegX.organisms.csv: Fixed bug in Species taxonConcept mapping where the role was computer instead of identifier
1832	04/10/2012 03:11 PM	Aaron Marcuse-Kubitza	xml_dom.py: value(): Skip comment nodes. This fixes a bug where comments inside text elements would prevent the value from being retrieved.
1831	04/10/2012 03:02 PM	Aaron Marcuse-Kubitza	inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>
1830	04/10/2012 02:16 PM	Aaron Marcuse-Kubitza	inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>
1829	04/10/2012 01:59 PM	Aaron Marcuse-Kubitza	inputs/CTFS/maps/VegX.organisms.csv: Added taxonConcept mappings
1828	04/10/2012 01:59 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.organisms.csv: Added species taxonConcept mapping for identifier role

Project

General

Profile