Project

General

Profile

Statistics
| Revision:

# Date Author Comment
1927 04/21/2012 01:22 PM Aaron Marcuse-Kubitza

sql.py: DbConn: For non-cacheable queries, use a plain cursor() instead of a DbCursor to avoid the overhead of saving the result and wrapping the cursor

1926 04/20/2012 05:20 PM Aaron Marcuse-Kubitza

Moved db_config_names from bin/map to sql.py so it can be used by other scripts as well

1925 04/20/2012 04:52 PM Aaron Marcuse-Kubitza

csv2ddl: Also print a COPY FROM statement

1924 04/20/2012 04:47 PM Aaron Marcuse-Kubitza

input.Makefile: Fixed bug where input type was considered to be different things if both $(inputFiles) and $(dbExport) are non-empty. Now, $(inputFiles) takes precedence so that the presence of any input files will cause a DB dump to be ignored. This ensures that a (slower) input DB is not used over a (faster) flat file.

1923 04/20/2012 04:21 PM Aaron Marcuse-Kubitza

csvs.py: stream_info(): Added parse_header option. reader_and_header(): Use stream_info()'s new parse_header option.

1922 04/20/2012 03:53 PM Aaron Marcuse-Kubitza

csv2ddl: Renamed schema name env var from datasrc to schema to reflect what it is, and to make the script general beyond importing inputs

1921 04/20/2012 03:32 PM Aaron Marcuse-Kubitza

input.Makefile: Moved Installation, Staging tables after Existing maps discovery because they depend on it. Staging tables: Create a staging table for each table a map spreadsheet is available for. Put double quotes around the schema name so its case is preserved.

1920 04/20/2012 03:29 PM Aaron Marcuse-Kubitza

Added csv2ddl to make a PostgreSQL CREATE TABLE statement from a CSV header

1919 04/20/2012 03:28 PM Aaron Marcuse-Kubitza

sql.py: Input validation: Moved section after Database connections because some of its functions require a connection. Added esc_name_by_module() and esc_name_by_engine(), and use esc_name_by_module() in esc_name().

1918 04/20/2012 02:18 PM Aaron Marcuse-Kubitza

input.Makefile: Installation: Create a schema for the datasource in VegBIEN as part of the installation process. This will be used to hold staging tables.

1917 04/20/2012 01:57 PM Aaron Marcuse-Kubitza

input.Makefile: Changed install, uninstall to depend on src/install, src/uninstall targets, which in turn depend on db, rm_db. This will allow us to add additional install actions for all input types.

1916 04/19/2012 07:17 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Cache the constructed CacheCursor itself, rather than the dict that's used to create it

1915 04/19/2012 07:06 PM Aaron Marcuse-Kubitza

sql.py: pkey(): Changed to use the connection-wide caching mechanism rather than its own custom cache. DbConn.__getstate__(): Don't pickle the debug callback.

1914 04/19/2012 07:00 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added is_cached(). run_query(): Use new DbConn.is_cached() to avoid creating a savepoint if the query is cached.

1913 04/19/2012 06:52 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Also cache cursor.description

1912 04/19/2012 06:50 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Cache query results as a dict subset of the cursor's key attributes, so that additional attributes can easily be cached by adding them to the subset list

1911 04/19/2012 06:48 PM Aaron Marcuse-Kubitza

dicts.py: Added AttrsDictView

1910 04/19/2012 06:47 PM Aaron Marcuse-Kubitza

util.py: NamedTuple.__iter__(): Removed unnecessary **attrs param

1909 04/19/2012 06:30 PM Aaron Marcuse-Kubitza

sql.py: _query_lookup(): Fixed bug where params was cast to a tuple, even though it could also be a dict. index_cols(): Changed to use the connection-wide caching mechanism rather than its own custom cache.

1908 04/19/2012 06:28 PM Aaron Marcuse-Kubitza

util.py: NamedTuple: Made it usable as a hashable dict (with string keys) by adding iter() and getitem()

1907 04/19/2012 06:27 PM Aaron Marcuse-Kubitza

dicts.py: Added make_hashable()

1906 04/17/2012 09:59 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Only cache exceptions for inserts since they are not idempotent, but an invalid insert will always be invalid. If a cached result in an exception, re-raise it in a separate method other than the constructor to ensure that the cursor object is still created, and that its query instance var is set.

1905 04/17/2012 09:11 PM Aaron Marcuse-Kubitza

sql.py: insert(): Cache insert queries by default. This works because any DuplicateKeyException, etc. would be cached as well. This saves many inserts for rows that we already know are in the database.

1904 04/17/2012 09:06 PM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Cache exceptions raised by queries as well

1903 04/17/2012 08:48 PM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): When debug logging, label queries with their cache status (hit/miss/non-cacheable)

1902 04/17/2012 08:25 PM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Also debug-log queries that produce exceptions

1901 04/17/2012 08:18 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Allow creator to provide a log function to call on debug messages, instead of using stderr directly

1900 04/17/2012 08:01 PM Aaron Marcuse-Kubitza

bin/map: Pass debug mode to DbConn so that SQL query debugging works again

1899 04/17/2012 07:49 PM Aaron Marcuse-Kubitza

sql.py: DbConn: DbCursor: Fixed bug where caching was always turned on, by passing the cacheable setting to it from run_query(). Turned caching back on (uncommented it) since it's now working.

1898 04/17/2012 07:21 PM Aaron Marcuse-Kubitza

bin/map: map_rows()/map_table(): Pass kw_args to process_rows() so rows_start can be specified when using them. DB inputs: Skip the pre-start rows in the SQL query itself, so that they don't need to be iterated over by the cursor in the main loop.

1897 04/17/2012 07:07 PM Aaron Marcuse-Kubitza

bin/map: Fixed bug introduced in r1718 where the row # would not be incremented if i < start, causing an semi-infinite loop that only ended when the input rows were exhausted. process_rows(): Added optional rows_start parameter to use if the input rows already have the pre-start rows skipped.

1896 04/17/2012 05:49 PM Aaron Marcuse-Kubitza

input.Makefile: Sources: cat: Changed Usage message to use "--silent" make option

1895 04/17/2012 05:45 PM Aaron Marcuse-Kubitza

input.Makefile: Sources: cat: Added Usage message with instructions for removing echoed make commands

1894 04/17/2012 05:17 PM Aaron Marcuse-Kubitza

run_*query(): Fixed bug where INSERTs, etc. were cached by making callers (such as select()) explicitly turn on caching. DbConn.run_query(): Fixed bug where cur.mogrify() was not supported under MySQL by making the cache key a tuple of the unmogrified query and its params instead of the mogrified string query. CacheCursor: Store attributes of the original cursor that we use, such as query and rowcount.

1893 04/17/2012 04:38 PM Aaron Marcuse-Kubitza

sql.py: Made row() and value() cache the result by fetching all rows before returning the first row

1892 04/17/2012 04:37 PM Aaron Marcuse-Kubitza

iters.py: Added func_iter() and consume_iter()

1891 04/17/2012 04:11 PM Aaron Marcuse-Kubitza

sql.py: Cache the results of queries (when all rows are read)

1890 04/17/2012 03:48 PM Aaron Marcuse-Kubitza

Proxy.py: Fixed infinite recursion bug by removing setattr() (which prevents the class and subclasses from storing instance variables using "self." syntax)

1889 04/16/2012 10:19 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added run_query(). run_raw_query(): Use new DbConn.run_query().

1888 04/16/2012 10:18 PM Aaron Marcuse-Kubitza

Added Proxy.py

1887 04/16/2012 09:32 PM Aaron Marcuse-Kubitza

parallel.py: MultiProducerPool: Added code to create a shared Namespace object, commented out. Updated share() doc comment to reflect that it will writably share the values as well.

1886 04/16/2012 08:49 PM Aaron Marcuse-Kubitza

bin/map: Share locals() with the pool at various times to try to get as many unpicklable values into the shared vars as possible

1885 04/16/2012 08:45 PM Aaron Marcuse-Kubitza

dicts.py: Turned id_dict() factory function into IdDict class. parallel.py: MultiProducerPool: Added share_vars(). main_loop(): Only consider the program to be done if the queue is empty and there are no running tasks.

1884 04/16/2012 08:00 PM Aaron Marcuse-Kubitza

collection.py: rmap(): Treat only built-in sequences specially instead of iterables. Pass whether the value is a leaf to the func. Added option to only recurse up to a certain # of levels.

1883 04/16/2012 07:10 PM Aaron Marcuse-Kubitza

Added lists.py

1882 04/16/2012 04:40 PM Aaron Marcuse-Kubitza

collection.py: rmap(): Fixed bugs: Made it recursive. Use iters.is_iterable() instead of isinstance(value, list) to work on all iterables. Use value and not nonexistent var list_.

1881 04/16/2012 04:38 PM Aaron Marcuse-Kubitza

iters.py: Added is_iterable()

1880 04/16/2012 04:11 PM Aaron Marcuse-Kubitza

parallel.py: prepickle(): Pickle all objects in vars_id_dict_ by ID, not just unpicklable ones. This ensures that a DB connection created in the main process will be shared with subprocesses by reference (id()) instead of by value, so that each process can take advantage of e.g. shared caches in the connection object. Note that this may require some synchronization.

1879 04/16/2012 04:06 PM Aaron Marcuse-Kubitza

parallel.py: MultiProducerPool.main_loop(): Got rid of no longer correct doc comment

1878 04/16/2012 04:05 PM Aaron Marcuse-Kubitza

bin/map: Share on_error with the pool

1877 04/16/2012 04:05 PM Aaron Marcuse-Kubitza

parallel.py: MultiProducerPool: Pickle objects by ID if they're accessible to the main_loop process. This should allow e.g. DB connections and pools to be pickled, if they were defined in the main process.

1876 04/14/2012 09:31 PM Aaron Marcuse-Kubitza

Added dicts.py with id_dict() and MergeDict

1875 04/14/2012 09:30 PM Aaron Marcuse-Kubitza

Added collection.py with rmap()

1874 04/14/2012 07:38 PM Aaron Marcuse-Kubitza

db_xml.py: put(): Moved pool.apply_async() from put_child() to put_(), and don't use lambdas because they can't be pickled

1873 04/14/2012 07:35 PM Aaron Marcuse-Kubitza

parallel.py: MultiProducerPool.apply_async(): Prepickle all function args. Try pickling the args before the queue pickles them, to get better debugging output.

1872 04/14/2012 07:33 PM Aaron Marcuse-Kubitza

sql.py: with_savepoint(): Use new rand.rand_int()

1871 04/14/2012 07:33 PM Aaron Marcuse-Kubitza

rand.py: rand_int() Fixed bug where newly-created objects did not have unique IDs because they were on the stack. So, we have to use random.randint() anyway.

1870 04/14/2012 07:27 PM Aaron Marcuse-Kubitza

Added rand.py

1869 04/14/2012 06:56 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Made it picklable by establishing a connection on demand

1868 04/14/2012 06:54 PM Aaron Marcuse-Kubitza

bin/map: Also consume asynchronous tasks before closing the DB connection (this is where most if not all tasks will be consumed)

1867 04/14/2012 06:44 PM Aaron Marcuse-Kubitza

Runnable.py: Made it picklable

1866 04/14/2012 06:44 PM Aaron Marcuse-Kubitza

Added eval_.py

1865 04/14/2012 05:35 PM Aaron Marcuse-Kubitza

Added Runnable

1864 04/14/2012 03:05 PM Aaron Marcuse-Kubitza

db_xml.py: put(): Added parallel processing support for inserting children with fkeys to parent asynchronously

1863 04/14/2012 03:03 PM Aaron Marcuse-Kubitza

parallel.py: Fixed bugs: Added self param to instance methods and inner classes where needed

1862 04/14/2012 02:32 PM Aaron Marcuse-Kubitza

parallel.py: Changed to use multi-producer pool, which requires calling pool.main_loop()

1861 04/14/2012 01:04 PM Aaron Marcuse-Kubitza

parallel.py: Pool: Added doc comment

1860 04/14/2012 01:03 PM Aaron Marcuse-Kubitza

parallel.py: Pool: apply_async(): Return a result object like multiprocessing.Pool.apply_async()

1859 04/14/2012 12:53 PM Aaron Marcuse-Kubitza

bin/map: Use new parallel.py for parallel processing

1858 04/14/2012 12:51 PM Aaron Marcuse-Kubitza

Added parallel.py for parallel processing

1857 04/14/2012 12:37 PM Aaron Marcuse-Kubitza

bin/map: Use dummy synchronous Pool implementation if not using parallel processing

1856 04/14/2012 12:18 PM Aaron Marcuse-Kubitza

bin/map: Use multiprocessing instead of pp for parallel processing because it's easier to use (it uses the Python threading API and doesn't require providing all the functions a task calls). Allow the user to set the cpus option to to use all system CPUs (needed because in test mode, the default is 0 CPUs to turn off parallel processing).

1855 04/13/2012 04:41 PM Aaron Marcuse-Kubitza

disown_all, stop_imports: Use /bin/bash instead of /bin/sh because array subscripting is used

1854 04/13/2012 04:38 PM Aaron Marcuse-Kubitza

input.Makefile: Editing import: Use $(datasrc) instead of $(db) since $(db) is only set for DB-source inputs

1853 04/13/2012 04:31 PM Aaron Marcuse-Kubitza

input.Makefile: Import: If profile is on and test mode is on, output formatted profile stats to stdout

1852 04/13/2012 03:00 PM Aaron Marcuse-Kubitza

sql.py: index_cols(): Cache return values in db.index_cols

1851 04/13/2012 02:56 PM Aaron Marcuse-Kubitza

bin/map: Don't import pp unless cpus != 0 because it's slow and doesn't need to happen if we're not using parallelization. cpus option defaults to 0 in test mode so tests run faster.

1850 04/13/2012 02:52 PM Aaron Marcuse-Kubitza

sql.py: pkey(): Use pkeys cache from db object instead of parameter

1849 04/13/2012 02:44 PM Aaron Marcuse-Kubitza

sql.py: Wrapped db connection inside an object that can also store the cache of the pkeys and index_cols

1848 04/13/2012 02:27 PM Aaron Marcuse-Kubitza

bin/map: If cpus is 0, run without Parallel Python

1847 04/13/2012 02:19 PM Aaron Marcuse-Kubitza

bin/map: Set up Parallel Python with an env-var-customizable # CPUs

1846 04/13/2012 02:18 PM Aaron Marcuse-Kubitza

bin/map: Set up Parallel Python with an env-var-customizable # CPUs

1845 04/13/2012 12:58 PM Aaron Marcuse-Kubitza

root Makefile: python-Linux: Added `sudo pip install pp`

1844 04/13/2012 12:47 PM Aaron Marcuse-Kubitza

root Makefile: python-Linux: Added python-parallel to installs

1843 04/13/2012 12:19 PM Aaron Marcuse-Kubitza

mappings: Build VegX-VegBIEN.organisms.csv from VegX-VegBIEN.stems.csv instead of vice versa. This entails switching the roots around so stem points to organism instead of the other way around, which is a complex operation. Re-rooted VegX-VegBIEN.organisms.csv at /plantobservation instead of /taxonoccurrence to avoid traveling up the hierarchy to taxonoccurrence and back down again to plantobservation, etc. as would otherwise have been the case.

1842 04/13/2012 11:43 AM Aaron Marcuse-Kubitza

bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it

1841 04/13/2012 10:45 AM Aaron Marcuse-Kubitza

bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it

1840 04/13/2012 10:44 AM Aaron Marcuse-Kubitza

xpath.py: get(): forward (parent-to-child) pointers: If last target object exists but doesn't have an ID attr (which indicates a bug), recover gracefully by just assuming the ID is 0. (Any bug will be noticeable in the output, which needs to be generated through workarounds like this in order to be able to debug.)

1839 04/10/2012 05:18 PM Aaron Marcuse-Kubitza

VegX mappings: Updated stemParent mapping for VegX 1.5.3

1838 04/10/2012 04:54 PM Aaron Marcuse-Kubitza

VegX mappings: Changed taxonDetermination of role identifier to instead have explicitly no role, because data providers' VegX files generally do not provide role information and we don't want the default taxonDetermination XPaths to require this

1837 04/10/2012 04:34 PM Aaron Marcuse-Kubitza

inputs/CTFS/maps/VegX.organisms.csv: Connected plot to plotObservation by using new support for backward (child-to-parent) pointers whose target is a text element containing an ID

1836 04/10/2012 04:33 PM Aaron Marcuse-Kubitza

xml_dom.py: get_id(): If the node doesn't have an ID, assumes the node itself is the ID. This enables backward (child-to-parent) pointers whose target is a text element containing an ID, rather than a regular element with an ID attribute.

1835 04/10/2012 04:04 PM Aaron Marcuse-Kubitza

VegX mappings: Map locationevent.sourceaccessioncode to plotUniqueIdentifier since this field is no longer being used by authorlocationcode

1834 04/10/2012 03:48 PM Aaron Marcuse-Kubitza

VegX mappings: Map the authorlocationcode to plotName instead of plotUniqueIdentifier because it's a better fit

1833 04/10/2012 03:13 PM Aaron Marcuse-Kubitza

inputs/CTFS/maps/VegX.organisms.csv: Fixed bug in Species taxonConcept mapping where the role was computer instead of identifier

1832 04/10/2012 03:11 PM Aaron Marcuse-Kubitza

xml_dom.py: value(): Skip comment nodes. This fixes a bug where comments inside text elements would prevent the value from being retrieved.

1831 04/10/2012 03:02 PM Aaron Marcuse-Kubitza

inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>

1830 04/10/2012 02:16 PM Aaron Marcuse-Kubitza

inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>

1829 04/10/2012 01:59 PM Aaron Marcuse-Kubitza

inputs/CTFS/maps/VegX.organisms.csv: Added taxonConcept mappings

1828 04/10/2012 01:59 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.organisms.csv: Added species taxonConcept mapping for identifier role