Project

General

Profile

Activity

From 03/26/2012 to 04/24/2012

04/24/2012

06:32 PM Revision 1988: input.Makefile: Testing: Don't abort tester if only staging test fails, in case staging table missing
Aaron Marcuse-Kubitza
06:25 PM Revision 1987: input.Makefile: Testing: When cleaning up test outputs, remove everything that doesn't end in .ref
Aaron Marcuse-Kubitza
06:11 PM Revision 1986: input.Makefile: Testing: Added test/import.%.staging.out test to test the staging tables. Sources: cat: Updated Usage comment to include the "inputs/<datasrc>/" prefix the user would need to add when running make.
Aaron Marcuse-Kubitza
05:33 PM Revision 1985: bin/map: Fixed bug where mapping to same DB wouldn't work because by-column optimization wasn't implemented yet, by turning it off by default and allowing it to be enabled with an env var
Aaron Marcuse-Kubitza
05:25 PM Revision 1984: bin/map: DB inputs: Use by-column optimization if mapping to same DB (with skeleton code for optimization's implementation)
Aaron Marcuse-Kubitza
05:12 PM Revision 1983: input.Makefile: Mapping: Use the staging tables instead of any flat files if use_staged is specified
Aaron Marcuse-Kubitza
05:10 PM Revision 1982: bin/map: Support custom schema name. Support input table/schema override via env vars, in case the map spreadsheet was written for a different input format.
Aaron Marcuse-Kubitza
05:01 PM Revision 1981: sql.py: qual_name(): Fixed bugs where esc_name() nested func couldn't have same name as outer func, and esc_name() needed to be invoked without the module name because it's in the same module. select(): Support already-escaped table names.
Aaron Marcuse-Kubitza
04:16 PM Revision 1980: main Makefile: $(psqlAsAdmin): Tell sudo to preserve env vars so PGOPTIONS is passed to psql
Aaron Marcuse-Kubitza
03:33 PM Revision 1979: root map: Fill in defaults for inputs from VegBIEN, as well as outputs to it
Aaron Marcuse-Kubitza
02:59 PM Revision 1978: disown_all: Updated to use main function, local vars, $self, etc. like other bash scripts run using "."
Aaron Marcuse-Kubitza
02:55 PM Revision 1977: vegbien_dest: Fixed bug where it would give a usage error if run from a makefile rule, because the BASH_LINENO would be 0, by also checking if ${BASH_ARGV[0]} is ${BASH_SOURCE[0]}
Aaron Marcuse-Kubitza
02:28 PM Revision 1976: postgres_vegbien: Fixed bug where interpreter did not match vegbien_dest's new required interpreter of /bin/bash
Aaron Marcuse-Kubitza
02:23 PM Revision 1975: vegbien_dest: Changed interpreter to /bin/bash. Removed comment that it requires var bien_password.
Aaron Marcuse-Kubitza
02:20 PM Revision 1974: postgres_vegbien: Removed no longer needed retrieval of bien_password
Aaron Marcuse-Kubitza
02:20 PM Revision 1973: vegbien_dest: Get bien_password by searching relative to $self, which we now have a way to get in a bash script (${BASH_SOURCE[0]}), rather than requiring the caller to set it. Provide usage error if run without initial ".".
Aaron Marcuse-Kubitza
02:12 PM Revision 1972: input.Makefile: Staging tables: import/install-%: Use new quiet option to determine whether to tee output to terminal. Don't use log option because that's always set to true except in test mode, which doesn't apply to installs.
Aaron Marcuse-Kubitza
02:12 PM Revision 1971: input.Makefile: Staging tables: import/install-%: Use new quiet option to determine whether to tee output to terminal. Don't use log option because that's always set to true except in test mode, which doesn't apply to installs.
Aaron Marcuse-Kubitza
01:56 PM Revision 1970: main Makefile: PostgreSQL: Edit /etc/phppgadmin/apache.conf to replace "deny from all" with "allow from all", instead of uncommenting an "allow from all" that may not be there
Aaron Marcuse-Kubitza
01:35 PM Revision 1969: input.Makefile: Sources: Fixed bug where cat was defined before $(tables), by moving Sources after Existing maps discovery and putting just $(inputFiles) and $(dbExport) from Sources at the beginning of Existing maps discovery
Aaron Marcuse-Kubitza
01:05 PM Revision 1968: sql.py: Made truncate(), tables(), empty_db() schema-aware. Added qual_name(). tables(): Added option to filter tables by a LIKE pattern.
Aaron Marcuse-Kubitza
12:34 PM Revision 1967: main Makefile: VegBIEN DB: Install public schema in a separate step, so that it can be dropped without dropping the entire DB (which also contains staging tables that shouldn't be dropped when there is a schema change). Added schemas/install, schemas/uninstall, implicit schemas/reinstall to manage the public schema separately from the rest of the DB. Moved Subdir forwarding to the bottom so overridden targets are not forwarded. README.TXT: Since `make reinstall_db` would drop the entire DB, tell user to run new `make schemas/reinstall` instead to reinstall (main) DB from schema.
Aaron Marcuse-Kubitza
12:30 PM Revision 1966: schemas/postgresql.Mac.conf: Set unix_socket_directory to the new dir it seems to be using, which is now /tmp
Aaron Marcuse-Kubitza
11:43 AM Revision 1965: csv2db: Fixed bug where extra columns were not truncated in INSERT mode. Replace empty column names with the column # to avoid errors with CSVs that have trailing ","s, etc.
Aaron Marcuse-Kubitza
11:41 AM Revision 1964: streams.py: StreamIter: Define readline() as a separate method so it can be overridden, and all calls to self.next() will use the overridden readline(). This fixes a bug in ProgressInputStream where incremental counts would not be displayed and it would end with "not all input read" if the StreamIter interface was used instead of readline().
Aaron Marcuse-Kubitza

04/23/2012

09:57 PM Revision 1963: csv2db: Fall back to manually inserting each row (autodetecting the encoding for each field) if COPY FROM doesn't work
Aaron Marcuse-Kubitza
09:56 PM Revision 1962: streams.py: FilterStream: Inherit from StreamIter so that all descendants automatically have StreamIter functionality
Aaron Marcuse-Kubitza
09:42 PM Revision 1961: sql.py: insert(): Support using the default value for columns designated with the special value sql.default
Aaron Marcuse-Kubitza
09:21 PM Revision 1960: sql.py: insert(): Support rows that are just a list of values, with no columns. Support already-escaped table names.
Aaron Marcuse-Kubitza
08:54 PM Revision 1959: strings.py: Added contains_any()
Aaron Marcuse-Kubitza
08:54 PM Revision 1958: csvs.py: reader_and_header(): Use make_reader()
Aaron Marcuse-Kubitza
08:07 PM Revision 1957: Added reinstall_all to reinstall all inputs at once
Aaron Marcuse-Kubitza
08:06 PM Revision 1956: with_all: Documented that it must be run from the root svn directory
Aaron Marcuse-Kubitza
08:05 PM Revision 1955: input.Makefile: Staging tables: import/install-%: Only install staging table if input contains only CSV sources. Changed $(isXml) to $(isCsv) (negated) everywhere because rules almost always only run something if input contains only CSV sources, rather than if input contains XML sources.
Aaron Marcuse-Kubitza
07:21 PM Revision 1954: input.Makefile: Staging tables: import/install-%: Output load status to log file if log option is set
Aaron Marcuse-Kubitza
07:00 PM Revision 1953: Scripts that are meant to be run in the calling shell: Fixed bug where running the script inside another script would make the script think it was being run as a program, and abort with a usage error
Aaron Marcuse-Kubitza
06:56 PM Revision 1952: Scripts that are meant to be run in the calling shell: Fixed bug where running the script as a program (without initial ".") wouldn't be able to call return in something that was not a function. Converted all code to a <script_name>_main method so that return would work properly again. Converted all variables to local variables.
Aaron Marcuse-Kubitza
06:38 PM Revision 1951: env_password: return instead of exit if password not yet stored, in case user is running it from a shell without the initial "-" argument. (This would be the case if the user is just testing out the script, instead of using a command that env_password directs them to run.)
Aaron Marcuse-Kubitza
05:43 PM Revision 1950: env_password: Use ${BASH_SOURCE[0]} for $self and $self for $0. return instead of exit on usage error in case user is running it from a shell.
Aaron Marcuse-Kubitza
05:36 PM Revision 1949: stop_imports: Use ${BASH_SOURCE[0]} for $self and $self for $0
Aaron Marcuse-Kubitza
05:36 PM Revision 1948: import_all: Use new with_all. Use ${BASH_SOURCE[0]} for $self and $self for $0.
Aaron Marcuse-Kubitza
05:34 PM Revision 1947: Added with_all to run a make target on all inputs at once
Aaron Marcuse-Kubitza
05:05 PM Revision 1946: Made row #s 1-based to the user to match up with the staging table row #s
Aaron Marcuse-Kubitza
04:59 PM Revision 1945: bin/map: Fixed bug where limit passed to sql.select() was end instead of the # rows, causing extra rows to be fetched when start > 0. Documented that row #s start with 0.
Aaron Marcuse-Kubitza
04:19 PM Revision 1944: Removed no longer needed csv2ddl
Aaron Marcuse-Kubitza
04:19 PM Revision 1943: input.Makefile: Staging tables: import/install-%: Use new csv2db instead of csv2ddl/$(psqlAsBien), because it handles translating encodings properly
Aaron Marcuse-Kubitza
04:14 PM Revision 1942: Added csv2db to load a command's CSV output stream into a PostgreSQL table
Aaron Marcuse-Kubitza

04/21/2012

09:32 PM Revision 1941: schemas/postgresql.Mac.conf: Set unix_socket_directory to the appropriate Mac OS X dir, since otherwise, the socket is apparently not created and `make reinstall_db` doesn't work
Aaron Marcuse-Kubitza
09:30 PM Revision 1940: main Makefile: VegBIEN DB: db: Set LC_COLLATE and LC_CTYPE explicitly, to make it easier to change them
Aaron Marcuse-Kubitza
09:29 PM Revision 1939: Added ProgressInputStream
Aaron Marcuse-Kubitza
09:28 PM Revision 1938: exc.py: print_ex(): Added plain option to leave out traceback
Aaron Marcuse-Kubitza
06:48 PM Revision 1937: main Makefile: VegBIEN DB: db: Use template0 to allow encodings other than UTF-8. Because template0 doesn't have plpgsql on PostgreSQL before 9.x, add "CREATE PROCEDURAL LANGUAGE plpgsql;" manually in schemas/vegbien.sql.make, and filter it back out on PostgreSQL after 9.x using db_dump_localize.
Aaron Marcuse-Kubitza
06:39 PM Revision 1936: PostgreSQL-MySQL.csv: Remove "CREATE PROCEDURAL LANGUAGE" statements
Aaron Marcuse-Kubitza
06:36 PM Revision 1935: Added db_dump_localize to translate a PostgreSQL DB dump for the local server's version
Aaron Marcuse-Kubitza
06:32 PM Revision 1934: Added db_dump_localize to translate a PostgreSQL DB dump for the local server's version
Aaron Marcuse-Kubitza
03:42 PM Revision 1933: vegbien_dest: Added option to override the prefix of the created vars
Aaron Marcuse-Kubitza
03:35 PM Revision 1932: schemas/vegbien.sql.make: Fixed bug where data sources' schemas were also exported by exporting only the public schema. Note that this also removes the "CREATE OR REPLACE PROCEDURAL LANGUAGE plpgsql" statement, so that it doesn't have to be filtered out with `grep -v`.
Aaron Marcuse-Kubitza
03:19 PM Revision 1931: input.Makefile: input.Makefile: Use `$(catSrcs)|` instead of $(withCatSrcs) where possible
Aaron Marcuse-Kubitza
03:00 PM Revision 1930: sql.py: pkey(): Fixed bug where results were not being cached because the rows hadn't been explicitly fetched, by having DbConn.DbCursor.execute() fetch all rows if the rowcount is 0 and it's not an insert statement. DbConn.DbCursor: Made _is_insert an attribute rather than a method, which is set as soon as the query is known. Added consume_rows(). Moved Result retrieval section above Database connections because it's used by DbConn.
Aaron Marcuse-Kubitza
02:28 PM Revision 1929: sql.py: pkey(): Fixed bug where queries were not being cached. Use select() instead of run_query() so that caching is automatically turned on and table names are automatically escaped.
Aaron Marcuse-Kubitza
01:37 PM Revision 1928: streams.py: Added LineCountInputStream, which is faster than LineCountStream for input streams. Added InputStreamsOnlyException and raise it in all *InputStream classes' write() methods.
Aaron Marcuse-Kubitza
01:22 PM Revision 1927: sql.py: DbConn: For non-cacheable queries, use a plain cursor() instead of a DbCursor to avoid the overhead of saving the result and wrapping the cursor
Aaron Marcuse-Kubitza

04/20/2012

05:20 PM Revision 1926: Moved db_config_names from bin/map to sql.py so it can be used by other scripts as well
Aaron Marcuse-Kubitza
04:52 PM Revision 1925: csv2ddl: Also print a COPY FROM statement
Aaron Marcuse-Kubitza
04:47 PM Revision 1924: input.Makefile: Fixed bug where input type was considered to be different things if both $(inputFiles) and $(dbExport) are non-empty. Now, $(inputFiles) takes precedence so that the presence of any input files will cause a DB dump to be ignored. This ensures that a (slower) input DB is not used over a (faster) flat file.
Aaron Marcuse-Kubitza
04:21 PM Revision 1923: csvs.py: stream_info(): Added parse_header option. reader_and_header(): Use stream_info()'s new parse_header option.
Aaron Marcuse-Kubitza
03:53 PM Revision 1922: csv2ddl: Renamed schema name env var from datasrc to schema to reflect what it is, and to make the script general beyond importing inputs
Aaron Marcuse-Kubitza
03:32 PM Revision 1921: input.Makefile: Moved Installation, Staging tables after Existing maps discovery because they depend on it. Staging tables: Create a staging table for each table a map spreadsheet is available for. Put double quotes around the schema name so its case is preserved.
Aaron Marcuse-Kubitza
03:29 PM Revision 1920: Added csv2ddl to make a PostgreSQL CREATE TABLE statement from a CSV header
Aaron Marcuse-Kubitza
03:28 PM Revision 1919: sql.py: Input validation: Moved section after Database connections because some of its functions require a connection. Added esc_name_by_module() and esc_name_by_engine(), and use esc_name_by_module() in esc_name().
Aaron Marcuse-Kubitza
02:18 PM Revision 1918: input.Makefile: Installation: Create a schema for the datasource in VegBIEN as part of the installation process. This will be used to hold staging tables.
Aaron Marcuse-Kubitza
01:57 PM Revision 1917: input.Makefile: Changed install, uninstall to depend on src/install, src/uninstall targets, which in turn depend on db, rm_db. This will allow us to add additional install actions for all input types.
Aaron Marcuse-Kubitza

04/19/2012

07:17 PM Revision 1916: sql.py: DbConn: Cache the constructed CacheCursor itself, rather than the dict that's used to create it
Aaron Marcuse-Kubitza
07:06 PM Revision 1915: sql.py: pkey(): Changed to use the connection-wide caching mechanism rather than its own custom cache. DbConn.__getstate__(): Don't pickle the debug callback.
Aaron Marcuse-Kubitza
07:00 PM Revision 1914: sql.py: DbConn: Added is_cached(). run_query(): Use new DbConn.is_cached() to avoid creating a savepoint if the query is cached.
Aaron Marcuse-Kubitza
06:52 PM Revision 1913: sql.py: DbConn: Also cache cursor.description
Aaron Marcuse-Kubitza
06:50 PM Revision 1912: sql.py: DbConn: Cache query results as a dict subset of the cursor's key attributes, so that additional attributes can easily be cached by adding them to the subset list
Aaron Marcuse-Kubitza
06:48 PM Revision 1911: dicts.py: Added AttrsDictView
Aaron Marcuse-Kubitza
06:47 PM Revision 1910: util.py: NamedTuple.__iter__(): Removed unnecessary **attrs param
Aaron Marcuse-Kubitza
06:30 PM Revision 1909: sql.py: _query_lookup(): Fixed bug where params was cast to a tuple, even though it could also be a dict. index_cols(): Changed to use the connection-wide caching mechanism rather than its own custom cache.
Aaron Marcuse-Kubitza
06:28 PM Revision 1908: util.py: NamedTuple: Made it usable as a hashable dict (with string keys) by adding __iter__() and __getitem__()
Aaron Marcuse-Kubitza
06:27 PM Revision 1907: dicts.py: Added make_hashable()
Aaron Marcuse-Kubitza

04/17/2012

09:59 PM Revision 1906: sql.py: DbConn: Only cache exceptions for inserts since they are not idempotent, but an invalid insert will always be invalid. If a cached result in an exception, re-raise it in a separate method other than the constructor to ensure that the cursor object is still created, and that its query instance var is set.
Aaron Marcuse-Kubitza
09:11 PM Revision 1905: sql.py: insert(): Cache insert queries by default. This works because any DuplicateKeyException, etc. would be cached as well. This saves many inserts for rows that we already know are in the database.
Aaron Marcuse-Kubitza
09:06 PM Revision 1904: sql.py: DbConn.run_query(): Cache exceptions raised by queries as well
Aaron Marcuse-Kubitza
08:48 PM Revision 1903: sql.py: DbConn.run_query(): When debug logging, label queries with their cache status (hit/miss/non-cacheable)
Aaron Marcuse-Kubitza
08:25 PM Revision 1902: sql.py: DbConn.run_query(): Also debug-log queries that produce exceptions
Aaron Marcuse-Kubitza
08:18 PM Revision 1901: sql.py: DbConn: Allow creator to provide a log function to call on debug messages, instead of using stderr directly
Aaron Marcuse-Kubitza
08:01 PM Revision 1900: bin/map: Pass debug mode to DbConn so that SQL query debugging works again
Aaron Marcuse-Kubitza
07:49 PM Revision 1899: sql.py: DbConn: DbCursor: Fixed bug where caching was always turned on, by passing the cacheable setting to it from run_query(). Turned caching back on (uncommented it) since it's now working.
Aaron Marcuse-Kubitza
07:21 PM Revision 1898: bin/map: map_rows()/map_table(): Pass kw_args to process_rows() so rows_start can be specified when using them. DB inputs: Skip the pre-start rows in the SQL query itself, so that they don't need to be iterated over by the cursor in the main loop.
Aaron Marcuse-Kubitza
07:07 PM Revision 1897: bin/map: Fixed bug introduced in r1718 where the row # would not be incremented if i < start, causing an semi-infinite loop that only ended when the input rows were exhausted. process_rows(): Added optional rows_start parameter to use if the input rows already have the pre-start rows skipped.
Aaron Marcuse-Kubitza
05:49 PM Revision 1896: input.Makefile: Sources: cat: Changed Usage message to use "--silent" make option
Aaron Marcuse-Kubitza
05:45 PM Revision 1895: input.Makefile: Sources: cat: Added Usage message with instructions for removing echoed make commands
Aaron Marcuse-Kubitza
05:17 PM Revision 1894: run_*query(): Fixed bug where INSERTs, etc. were cached by making callers (such as select()) explicitly turn on caching. DbConn.run_query(): Fixed bug where cur.mogrify() was not supported under MySQL by making the cache key a tuple of the unmogrified query and its params instead of the mogrified string query. CacheCursor: Store attributes of the original cursor that we use, such as query and rowcount.
Aaron Marcuse-Kubitza
04:38 PM Revision 1893: sql.py: Made row() and value() cache the result by fetching all rows before returning the first row
Aaron Marcuse-Kubitza
04:37 PM Revision 1892: iters.py: Added func_iter() and consume_iter()
Aaron Marcuse-Kubitza
04:11 PM Revision 1891: sql.py: Cache the results of queries (when all rows are read)
Aaron Marcuse-Kubitza
03:48 PM Revision 1890: Proxy.py: Fixed infinite recursion bug by removing __setattr__() (which prevents the class and subclasses from storing instance variables using "self." syntax)
Aaron Marcuse-Kubitza

04/16/2012

10:19 PM Revision 1889: sql.py: DbConn: Added run_query(). run_raw_query(): Use new DbConn.run_query().
Aaron Marcuse-Kubitza
10:18 PM Revision 1888: Added Proxy.py
Aaron Marcuse-Kubitza
09:32 PM Revision 1887: parallel.py: MultiProducerPool: Added code to create a shared Namespace object, commented out. Updated share() doc comment to reflect that it will writably share the values as well.
Aaron Marcuse-Kubitza
08:49 PM Revision 1886: bin/map: Share locals() with the pool at various times to try to get as many unpicklable values into the shared vars as possible
Aaron Marcuse-Kubitza
08:45 PM Revision 1885: dicts.py: Turned id_dict() factory function into IdDict class. parallel.py: MultiProducerPool: Added share_vars(). main_loop(): Only consider the program to be done if the queue is empty *and* there are no running tasks.
Aaron Marcuse-Kubitza
08:00 PM Revision 1884: collection.py: rmap(): Treat only built-in sequences specially instead of iterables. Pass whether the value is a leaf to the func. Added option to only recurse up to a certain # of levels.
Aaron Marcuse-Kubitza
07:10 PM Revision 1883: Added lists.py
Aaron Marcuse-Kubitza
04:40 PM Revision 1882: collection.py: rmap(): Fixed bugs: Made it recursive. Use iters.is_iterable() instead of isinstance(value, list) to work on all iterables. Use value and not nonexistent var list_.
Aaron Marcuse-Kubitza
04:38 PM Revision 1881: iters.py: Added is_iterable()
Aaron Marcuse-Kubitza
04:11 PM Revision 1880: parallel.py: prepickle(): Pickle *all* objects in vars_id_dict_ by ID, not just unpicklable ones. This ensures that a DB connection created in the main process will be shared with subprocesses by reference (id()) instead of by value, so that each process can take advantage of e.g. shared caches in the connection object. Note that this may require some synchronization.
Aaron Marcuse-Kubitza
04:06 PM Revision 1879: parallel.py: MultiProducerPool.main_loop(): Got rid of no longer correct doc comment
Aaron Marcuse-Kubitza
04:05 PM Revision 1878: bin/map: Share on_error with the pool
Aaron Marcuse-Kubitza
04:05 PM Revision 1877: parallel.py: MultiProducerPool: Pickle objects by ID if they're accessible to the main_loop process. This should allow e.g. DB connections and pools to be pickled, if they were defined in the main process.
Aaron Marcuse-Kubitza

04/14/2012

09:31 PM Revision 1876: Added dicts.py with id_dict() and MergeDict
Aaron Marcuse-Kubitza
09:30 PM Revision 1875: Added collection.py with rmap()
Aaron Marcuse-Kubitza
07:38 PM Revision 1874: db_xml.py: put(): Moved pool.apply_async() from put_child() to put_(), and don't use lambdas because they can't be pickled
Aaron Marcuse-Kubitza
07:35 PM Revision 1873: parallel.py: MultiProducerPool.apply_async(): Prepickle all function args. Try pickling the args before the queue pickles them, to get better debugging output.
Aaron Marcuse-Kubitza
07:33 PM Revision 1872: sql.py: with_savepoint(): Use new rand.rand_int()
Aaron Marcuse-Kubitza
07:33 PM Revision 1871: rand.py: rand_int() Fixed bug where newly-created objects did not have unique IDs because they were on the stack. So, we have to use random.randint() anyway.
Aaron Marcuse-Kubitza
07:27 PM Revision 1870: Added rand.py
Aaron Marcuse-Kubitza
06:56 PM Revision 1869: sql.py: DbConn: Made it picklable by establishing a connection on demand
Aaron Marcuse-Kubitza
06:54 PM Revision 1868: bin/map: Also consume asynchronous tasks before closing the DB connection (this is where most if not all tasks will be consumed)
Aaron Marcuse-Kubitza
06:44 PM Revision 1867: Runnable.py: Made it picklable
Aaron Marcuse-Kubitza
06:44 PM Revision 1866: Added eval_.py
Aaron Marcuse-Kubitza
05:35 PM Revision 1865: Added Runnable
Aaron Marcuse-Kubitza
03:05 PM Revision 1864: db_xml.py: put(): Added parallel processing support for inserting children with fkeys to parent asynchronously
Aaron Marcuse-Kubitza
03:03 PM Revision 1863: parallel.py: Fixed bugs: Added self param to instance methods and inner classes where needed
Aaron Marcuse-Kubitza
02:32 PM Revision 1862: parallel.py: Changed to use multi-producer pool, which requires calling pool.main_loop()
Aaron Marcuse-Kubitza
01:04 PM Revision 1861: parallel.py: Pool: Added doc comment
Aaron Marcuse-Kubitza
01:03 PM Revision 1860: parallel.py: Pool: apply_async(): Return a result object like multiprocessing.Pool.apply_async()
Aaron Marcuse-Kubitza
12:53 PM Revision 1859: bin/map: Use new parallel.py for parallel processing
Aaron Marcuse-Kubitza
12:51 PM Revision 1858: Added parallel.py for parallel processing
Aaron Marcuse-Kubitza
12:37 PM Revision 1857: bin/map: Use dummy synchronous Pool implementation if not using parallel processing
Aaron Marcuse-Kubitza
12:18 PM Revision 1856: bin/map: Use multiprocessing instead of pp for parallel processing because it's easier to use (it uses the Python threading API and doesn't require providing all the functions a task calls). Allow the user to set the cpus option to to use all system CPUs (needed because in test mode, the default is 0 CPUs to turn off parallel processing).
Aaron Marcuse-Kubitza

04/13/2012

04:41 PM Revision 1855: disown_all, stop_imports: Use /bin/bash instead of /bin/sh because array subscripting is used
Aaron Marcuse-Kubitza
04:38 PM Revision 1854: input.Makefile: Editing import: Use $(datasrc) instead of $(db) since $(db) is only set for DB-source inputs
Aaron Marcuse-Kubitza
04:31 PM Revision 1853: input.Makefile: Import: If profile is on and test mode is on, output formatted profile stats to stdout
Aaron Marcuse-Kubitza
03:00 PM Revision 1852: sql.py: index_cols(): Cache return values in db.index_cols
Aaron Marcuse-Kubitza
02:56 PM Revision 1851: bin/map: Don't import pp unless cpus != 0 because it's slow and doesn't need to happen if we're not using parallelization. cpus option defaults to 0 in test mode so tests run faster.
Aaron Marcuse-Kubitza
02:52 PM Revision 1850: sql.py: pkey(): Use pkeys cache from db object instead of parameter
Aaron Marcuse-Kubitza
02:44 PM Revision 1849: sql.py: Wrapped db connection inside an object that can also store the cache of the pkeys and index_cols
Aaron Marcuse-Kubitza
02:27 PM Revision 1848: bin/map: If cpus is 0, run without Parallel Python
Aaron Marcuse-Kubitza
02:19 PM Revision 1847: bin/map: Set up Parallel Python with an env-var-customizable # CPUs
Aaron Marcuse-Kubitza
02:18 PM Revision 1846: bin/map: Set up Parallel Python with an env-var-customizable # CPUs
Aaron Marcuse-Kubitza
12:58 PM Revision 1845: root Makefile: python-Linux: Added `sudo pip install pp`
Aaron Marcuse-Kubitza
12:47 PM Revision 1844: root Makefile: python-Linux: Added python-parallel to installs
Aaron Marcuse-Kubitza
12:19 PM Revision 1843: mappings: Build VegX-VegBIEN.organisms.csv from VegX-VegBIEN.stems.csv instead of vice versa. This entails switching the roots around so stem points to organism instead of the other way around, which is a complex operation. Re-rooted VegX-VegBIEN.organisms.csv at /plantobservation instead of /taxonoccurrence to avoid traveling up the hierarchy to taxonoccurrence and back down again to plantobservation, etc. as would otherwise have been the case.
Aaron Marcuse-Kubitza
11:43 AM Revision 1842: bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it
Aaron Marcuse-Kubitza
10:45 AM Revision 1841: bin/map: When determining if outer elements are types, look for /*s/ anywhere in the string instead of just at the beginning, because there might be root attrs (namespaces), etc. before it
Aaron Marcuse-Kubitza
10:44 AM Revision 1840: xpath.py: get(): forward (parent-to-child) pointers: If last target object exists but doesn't have an ID attr (which indicates a bug), recover gracefully by just assuming the ID is 0. (Any bug will be noticeable in the output, which needs to be generated through workarounds like this in order to be able to debug.)
Aaron Marcuse-Kubitza

04/10/2012

05:18 PM Revision 1839: VegX mappings: Updated stemParent mapping for VegX 1.5.3
Aaron Marcuse-Kubitza
04:54 PM Revision 1838: VegX mappings: Changed taxonDetermination of role identifier to instead have explicitly no role, because data providers' VegX files generally do not provide role information and we don't want the default taxonDetermination XPaths to require this
Aaron Marcuse-Kubitza
04:34 PM Revision 1837: inputs/CTFS/maps/VegX.organisms.csv: Connected plot to plotObservation by using new support for backward (child-to-parent) pointers whose target is a text element containing an ID
Aaron Marcuse-Kubitza
04:33 PM Revision 1836: xml_dom.py: get_id(): If the node doesn't have an ID, assumes the node itself is the ID. This enables backward (child-to-parent) pointers whose target is a text element containing an ID, rather than a regular element with an ID attribute.
Aaron Marcuse-Kubitza
04:04 PM Revision 1835: VegX mappings: Map locationevent.sourceaccessioncode to plotUniqueIdentifier since this field is no longer being used by authorlocationcode
Aaron Marcuse-Kubitza
03:48 PM Revision 1834: VegX mappings: Map the authorlocationcode to plotName instead of plotUniqueIdentifier because it's a better fit
Aaron Marcuse-Kubitza
03:13 PM Revision 1833: inputs/CTFS/maps/VegX.organisms.csv: Fixed bug in Species taxonConcept mapping where the role was computer instead of identifier
Aaron Marcuse-Kubitza
03:11 PM Revision 1832: xml_dom.py: value(): Skip comment nodes. This fixes a bug where comments inside text elements would prevent the value from being retrieved.
Aaron Marcuse-Kubitza
03:02 PM Revision 1831: inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>
Aaron Marcuse-Kubitza
02:16 PM Revision 1830: inputs/CTFS/test: Accepted test outputs for new VegX_CTFS_row_120000_bci.0.test.organisms.xml instead of VegX_CTFS_row_180000.0.test.organisms.xml, which didn't have <taxonNameUsageConcepts> that match up with <individualOrganisms>
Aaron Marcuse-Kubitza
01:59 PM Revision 1829: inputs/CTFS/maps/VegX.organisms.csv: Added taxonConcept mappings
Aaron Marcuse-Kubitza
01:59 PM Revision 1828: mappings/VegX-VegBIEN.organisms.csv: Added species taxonConcept mapping for identifier role
Aaron Marcuse-Kubitza
01:33 PM Revision 1827: Added expand_xpath to expand XPath abbreviations
Aaron Marcuse-Kubitza
12:43 PM Revision 1826: VegX mappings: Renamed taxonNameUsageConceptsID to taxonNameUsageConceptID (no plural) to match VegX 1.5.3
Aaron Marcuse-Kubitza
12:33 PM Revision 1825: inputs/CTFS/maps/VegX.organisms.csv: Corrected CensusNumber input mapping
Aaron Marcuse-Kubitza
12:24 PM Revision 1824: mappings/Makefile: Generate self maps for all core maps
Aaron Marcuse-Kubitza
12:19 PM Revision 1823: mappings/Makefile: VegX-VegBIEN.stems.csv: Removed $(rootAttrs) from out root because stems don't use tcs namespace elements (stems don't have taxonDeterminations separate from the main organism)
Aaron Marcuse-Kubitza
12:13 PM Revision 1822: VegX mappings: taxonConcept mappings: Added "tcs:" namespace prefix to appropriate elements. This will make the taxonConcept XPaths compatible with CTFS VegX.
Aaron Marcuse-Kubitza

04/09/2012

06:52 PM Revision 1821: input.Makefile: Vars/functions: Make: $(subMake): When forwarding to another dir based off of $(root), forward to $(root) rather than directly to the dir of the target. This ensures that any special targets that are only defined in the root Makefile still get run, even when the target is in a subdir with its own Makefile.
Aaron Marcuse-Kubitza
06:41 PM Revision 1820: inputs/CTFS/test: Accepted initial test outputs. A lot of leaves are still unmapped with the default mappings.
Aaron Marcuse-Kubitza
06:40 PM Revision 1819: inputs/CTFS/maps: Added initial maps
Aaron Marcuse-Kubitza
06:39 PM Revision 1818: VegX mappings: taxonConcept mappings: Added "tcs:" namespace prefix to appropriate elements. This will make the taxonConcept XPaths compatible with CTFS VegX.
Aaron Marcuse-Kubitza
06:13 PM Revision 1817: input.Makefile: Maps building: full via maps (maps/$(via).%.full.csv): $(makeFullCsv): Sort all maps so that rows are re-ordered whether or not a core self map exists. This way, if a core self map is created, it will not cause the sort order of the generated via-format XMLs to change. This makes it easier to accept any changes to test outputs that result from adding a core self map.
Aaron Marcuse-Kubitza
05:53 PM Revision 1816: mappings/Makefile: VegX: Added VegX.self.organisms.csv. Added root attrs to chRoot maps, commented out since it's not ready to be checked in yet.
Aaron Marcuse-Kubitza
05:34 PM Revision 1815: xpath.py: get(): Run xml_dom.by_tag_name() with ignore_namespace=False (possibly later set to True)
Aaron Marcuse-Kubitza
05:32 PM Revision 1814: xml_dom.py: Comments: Added clean_comment() and mk_comment(). Searching child nodes: by_tag_name(): Added ignore_namespace option to ignore namespace of node name.
Aaron Marcuse-Kubitza
05:26 PM Revision 1813: root Makefile: Added %-remake target
Aaron Marcuse-Kubitza
04:53 PM Revision 1812: mappings/Makefile: Renamed joinMaps to dwcMaps and chrootMaps to vegxMaps. Added commented-out code to create VegX.self.organisms.csv (not ready to check in yet because it affects many dependent maps).
Aaron Marcuse-Kubitza
02:52 PM Revision 1811: input.Makefile: Removed no longer needed $(noEmptyMap)
Aaron Marcuse-Kubitza
12:40 PM Revision 1810: xml_func.py: process(): Use new xml_dom.mk_comment()
Aaron Marcuse-Kubitza
12:40 PM Revision 1809: xml_dom.py: Added clean_comment() and mk_comment() to properly sanitize comment contents (comments can't contain '--')
Aaron Marcuse-Kubitza
12:14 PM Revision 1808: Added inputs/TRTE
Aaron Marcuse-Kubitza

04/03/2012

08:26 PM Revision 1807: inputs/QMOR/test: Added initial accepted test outputs
Aaron Marcuse-Kubitza
08:26 PM Revision 1806: inputs/QMOR/maps: Added maps
Aaron Marcuse-Kubitza
08:20 PM Revision 1805: Added inputs/QMOR
Aaron Marcuse-Kubitza
08:14 PM Revision 1804: inputs/MT/test: Added initial accepted test outputs
Aaron Marcuse-Kubitza
08:14 PM Revision 1803: inputs/MT/maps: Added maps
Aaron Marcuse-Kubitza
08:13 PM Revision 1802: mappings/Makefile: DwC-VegBIEN.specimens.csv: Don't call remove_empty to produce it, because join now deals with empty mappings correctly by still raising a warning. Removed no longer needed intermediate DwC.ci-VegBIEN.specimens.csv.
Aaron Marcuse-Kubitza
08:09 PM Revision 1801: join: Also print "No join mapping" warning if a join mapping was found but it was empty. The warning in that case is actually "No non-empty join mapping" to distinguish it from a mapping that's missing entirely. input.Makefile: missing_mappings: Support new "No join mapping" error message.
Aaron Marcuse-Kubitza
08:08 PM Revision 1800: join: Also print "No join mapping" warning if a join mapping was found but it was empty. The warning in that case is actually "No non-empty join mapping" to distinguish it from a mapping that's missing entirely. input.Makefile: missing_mappings: Support new "No join mapping" error message.
Aaron Marcuse-Kubitza
07:33 PM Revision 1799: Added inputs/MT
Aaron Marcuse-Kubitza
07:26 PM Revision 1798: Added disown_all to disown all running jobs
Aaron Marcuse-Kubitza
07:26 PM Revision 1797: stop_imports: Call jobspecs relative to $selfDir, rather than assuming it will be run from the svn root dir
Aaron Marcuse-Kubitza
07:18 PM Revision 1796: union: Call maps.merge_headers() using **dict(prefer=header_num) instead of just prefer=header_num in order to work on Python 2.5.2 (which nimoy is running)
Aaron Marcuse-Kubitza
07:00 PM Revision 1795: inputs/ACAD/test: Accepted initial test outputs
Aaron Marcuse-Kubitza
07:00 PM Revision 1794: Added inputs/ACAD/maps/ maps
Aaron Marcuse-Kubitza
06:59 PM Revision 1793: Accepted new test outputs resulting from the addition of the id -> occurrenceID mapping in mappings/DwC1-DwC2.specimens.csv
Aaron Marcuse-Kubitza
06:57 PM Revision 1792: inputs/SALVIAS*/maps: Cleaned up maps for the first time since all via maps became subject to cleanup
Aaron Marcuse-Kubitza
06:55 PM Revision 1791: input.Makefile: Removed no longer needed default "maps/.$(via).%.csv.last_cleanup" rule
Aaron Marcuse-Kubitza
06:54 PM Revision 1790: input.Makefile: Maps building: Via maps cleanup: Added `env ignore=1` since with the switch to subtracting $(coreMap), all inputs will attempt to subtract some map, even if it's not subtractable
Aaron Marcuse-Kubitza
06:47 PM Revision 1789: input.Makefile: Don't clean src maps, only build them
Aaron Marcuse-Kubitza
06:45 PM Revision 1788: inputs/ARIZ/maps/DwC.specimens.csv: Re-cleaned up to take advantage of additional entries now removed by subtract
Aaron Marcuse-Kubitza
06:36 PM Revision 1787: input.Makefile: Maps building: Via maps cleanup: Subtract $(coreMap) instead of $(coreSelfMap) so that entries whose input and output maps to the same place are subtracted as well
Aaron Marcuse-Kubitza
06:35 PM Revision 1786: subtract: Also remove mappings whose input and output maps to the same non-empty value in map_1
Aaron Marcuse-Kubitza
06:32 PM Revision 1785: util.py: Added all_equal(), all_equal_ignore_none(), have_same_value()
Aaron Marcuse-Kubitza
05:45 PM Revision 1784: mappings/DwC1-DwC2.specimens.csv: Added id -> occurrenceID mapping
Aaron Marcuse-Kubitza
05:43 PM Revision 1783: inputs/SALVIAS-CSV/maps/VegX.%.full.csv: Regenerated using new src maps
Aaron Marcuse-Kubitza
05:41 PM Revision 1782: mappings/DwC1-DwC2.specimens.csv: Added mappings from dcterms elements without namespace to with namespace
Aaron Marcuse-Kubitza
05:40 PM Revision 1781: inputs/SALVIAS-CSV: Built maps/src.%.csv
Aaron Marcuse-Kubitza
05:24 PM Revision 1780: Added inputs/ACAD/maps/src.specimens.csv
Aaron Marcuse-Kubitza
05:23 PM Revision 1779: input.Makefile: Maps building: Autogen src maps with known table names. Sources: $(withCatSrcs): Fixed bug where substitution pattern did not contain %.
Aaron Marcuse-Kubitza
05:22 PM Revision 1778: Added src_map to make a source map spreadsheet from a CSV header
Aaron Marcuse-Kubitza
04:32 PM Revision 1777: input.Makefile: Split Maps section into "Existing maps discovery" and "Maps building" sections. Sources: Added cat, cat-% to cat out sources.
Aaron Marcuse-Kubitza
04:17 PM Revision 1776: input.Makefile: Factored out sources-related code to new Sources section
Aaron Marcuse-Kubitza
04:08 PM Revision 1775: input.Makefile: $(srcMaps): Removed `$(filter-out maps/src.join.%.csv,...)` because maps/src.join.%.csv are no longer created
Aaron Marcuse-Kubitza
03:47 PM Revision 1774: README.TXT: Schema changes: Split updating graphical ERD exports into separate section. Update graphical ERD exports: Added schemas/vegbien.ERD.core.pdf .
Aaron Marcuse-Kubitza
03:42 PM Revision 1773: README.TXT: Added Datasource setup section with instructions to add a new datasource
Aaron Marcuse-Kubitza
03:38 PM Revision 1772: Added inputs/ACAD
Aaron Marcuse-Kubitza
03:37 PM Revision 1771: input.Makefile: Only setSvnIgnore the input dir, since it already exists and doesn't need to be added (inputs/Makefile adds it)
Aaron Marcuse-Kubitza
03:23 PM Revision 1770: inputs/*/maps/DwC.specimens.csv: Removed extranenous XML meta info from DwC column root, since it now just needs to be present in the core via map mappings/DwC-VegBIEN.specimens.csv
Aaron Marcuse-Kubitza
03:22 PM Revision 1769: union: Use new maps.merge_headers() to write properly combined header
Aaron Marcuse-Kubitza
03:21 PM Revision 1768: maps.py: join_combinable(): Fixed roots_combinable() to run on col names instead of roots, which were passed in. merge_mappings(): Factored out mapping column combining into merge_mapping_cols(), which handles an optional prefer param as well to take the header_num env var. Added merge_headers().
Aaron Marcuse-Kubitza
03:17 PM Revision 1767: util.py: Added sort_by_len(), shortest(), longest()
Aaron Marcuse-Kubitza
02:12 PM Revision 1766: join: Use new maps.join_combinable() to check if column names match
Aaron Marcuse-Kubitza
02:11 PM Revision 1765: maps.py: Added cols_combinable() and use it in combinable(). Added join_combinable() and associates helper functions. Added documentation labels to each section.
Aaron Marcuse-Kubitza
01:13 PM Revision 1764: xml_parse.py: ConsecXmlInputStream: Removed read() because that's now defined in streams.FilterStream
Aaron Marcuse-Kubitza
01:11 PM Revision 1763: xml_parse.py: parse_next(): Strip control characters from input stream because they mess up the parser
Aaron Marcuse-Kubitza
01:10 PM Revision 1762: streams.py: FilterStream: Forward all reads to readline()
Aaron Marcuse-Kubitza
01:08 PM Revision 1761: strings.py: Added is_ctrl() and strip_ctrl()
Aaron Marcuse-Kubitza
08:34 AM Revision 1760: xml_parse.py: parse_next(): On parser error, advance to next XML document since the rest of the current document is corrupted
Aaron Marcuse-Kubitza
08:33 AM Revision 1759: streams.py: Added consume(). Added documentation labels to each section.
Aaron Marcuse-Kubitza
08:23 AM Revision 1758: bin/map: For XML inputs, wrap sys.stdin in a LineCountStream and use new xml_parse.docs_iter() on_error() to add input line # to XML parsing exceptions
Aaron Marcuse-Kubitza
08:21 AM Revision 1757: xml_parse.py: Added on_error() handler to parse_next() (passed through by docs_iter()), so that the caller can add useful info like the input line # to the exception message, and decide not to suppress rather than re-raising the exception
Aaron Marcuse-Kubitza
07:19 AM Revision 1756: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field identificationLabel2 to identificationLabel. Distinguish what are now two identificationLabel fields of the same name by tagging each one with [@id=2] or [@id=1]. inputs/SALVIAS-CSV/maps/VegX.organisms.csv: Merge tag1/stem_tag1 and tag2/stem_tag2 using _alt, since they are never set to different values when both are not NULL (although sometimes just one or just the other is not NULL).
Aaron Marcuse-Kubitza

04/02/2012

05:37 PM Revision 1755: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field tag2 to identificationLabel2 to reflect that it will become a second instance of identificationLabel
Aaron Marcuse-Kubitza
05:31 PM Revision 1754: VegX-VegBIEN.organisms.csv: Re-mapped individualOrganismObservation user-defined field lineCover to already existing volumeCanopy
Aaron Marcuse-Kubitza
05:29 PM Revision 1753: VegX-VegBIEN.organisms.csv: Re-mapped individualOrganismObservation user-defined field cover to already existing attribute.coverPercent
Aaron Marcuse-Kubitza
05:13 PM Revision 1752: VegX-VegBIEN.organisms.csv: Re-mapped individualOrganismObservation user-defined field count to already existing aggregateOrganismObservation.aggregateValue
Aaron Marcuse-Kubitza
04:44 PM Revision 1751: vegbien.ERD.mwb: Fixed lines
Aaron Marcuse-Kubitza
01:50 PM Revision 1750: README.TXT: Documented that `make reinstall_db` will delete your VegBIEN DB
Aaron Marcuse-Kubitza
01:48 PM Revision 1749: README.TXT: Documented that `make empty_db` will delete your VegBIEN DB
Aaron Marcuse-Kubitza
01:44 PM Revision 1748: root Makefile: empty_db: Confirm deletion just like for rm_db. rm_db: put $(confirmRmDb) on a separate line and move the $(error) call to the main $(confirm) macro since you always want to abort make if the user cancels (not just not run that command).
Aaron Marcuse-Kubitza
01:34 PM Revision 1747: root Makefile: rm_db: If user cancels, abort in case target was reinstall_db to prevent installing
Aaron Marcuse-Kubitza
01:28 PM Revision 1746: root Makefile: core, rm_core: Fixed bug where no longer existing prerequisites postgres_user, rm_postgres_user were not removed
Aaron Marcuse-Kubitza
01:25 PM Revision 1745: root Makefile: rm_db: Confirm deletion with user. Merged postgres_user, rm_postgres_user into db, rm_db so that deletion confirmation applies to user deletion as well (which would indirectly cause the DB to be deleted).
Aaron Marcuse-Kubitza
01:04 PM Revision 1744: README.TXT: Testing: Updated to add missing mappings
Aaron Marcuse-Kubitza
01:03 PM Revision 1743: root Makefile: test-all: Added missing_mappings
Aaron Marcuse-Kubitza
01:00 PM Revision 1742: Moved maps validation targets from main Makefile to input.Makefile. main Makefile: maps validation: Summarize the output of the inputs' maps validations.
Aaron Marcuse-Kubitza
12:22 PM Revision 1741: Makefile: Also find missing input mappings, in addition to missing join mappings
Aaron Marcuse-Kubitza
12:21 PM Revision 1740: join: Also produce warnings for no input mapping (if no comment explaining why no input mapping), in addition to no join mapping
Aaron Marcuse-Kubitza
12:21 PM Revision 1739: join: Also produce warnings for no input mapping (if no comment explaining why no input mapping), in addition to no join mapping
Aaron Marcuse-Kubitza
12:20 PM Revision 1738: inputs/NY/maps/DwC.specimens.csv: Documented why there is no input mapping for key
Aaron Marcuse-Kubitza
11:29 AM Revision 1737: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined fields stem* to remove the stem* prefix to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:23 AM Revision 1736: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation/plotObservation user-defined fields sourceaccessioncode to sourceAccessionCode to be consistent with VegX case sensitivity
Aaron Marcuse-Kubitza
11:19 AM Revision 1735: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field interceptCm to lineCover to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:18 AM Revision 1734: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field individualCode to authorPlantCode to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:17 AM Revision 1733: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field htFirstBranchM to heightFirstBranch to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:15 AM Revision 1732: VegX-VegBIEN.organisms.csv: Renamed individualOrganismObservation user-defined field coverPercent to cover to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:12 AM Revision 1731: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field siltPercent to silt to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:11 AM Revision 1730: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field sandPercent to sand to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:10 AM Revision 1729: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field pottasium to potassium to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:08 AM Revision 1728: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field organicPercent to organic to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:07 AM Revision 1727: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field clayPercent to clay to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:06 AM Revision 1726: VegX-VegBIEN.organisms.csv: Renamed abioticObservation user-defined field cationCap to cationExchangeCapacity to be consistent with VegBIEN
Aaron Marcuse-Kubitza
11:02 AM Revision 1725: VegX-VegBIEN.organisms.csv: Renamed plotObservation user-defined field precipMm to precipitation to be consistent with VegBIEN
Aaron Marcuse-Kubitza
10:56 AM Revision 1724: VegX-VegBIEN.organisms.csv: Changed plotObservation user-defined field plotMethodology to /simpleUserdefined[name=method]/*ID/method/name
Aaron Marcuse-Kubitza
10:46 AM Task #304 (Resolved): Complete full dataset imports to VegBIEN via VegX of NYBG and SALVIAS
Aaron Marcuse-Kubitza
10:45 AM Task #319 (Resolved): Update statistics/lists of user-defined fields in use in VegX and VegBIEN
* *[[VegX]]*: "Convert user-defined fields to first-class fields"
* *[[VegBIEN schema]]*: "Remaining user-defined fi...
Aaron Marcuse-Kubitza
10:43 AM Task #320: Convert user-defined VegX fields to first-class fields
user-defined fields to convert: *[[VegX]]*: "Convert user-defined fields to first-class fields" Aaron Marcuse-Kubitza
10:42 AM Task #321 (Resolved): Convert user-defined VegBIEN fields to first-class fields
Aaron Marcuse-Kubitza
10:42 AM Task #373 (Resolved): map all specimens data in raw_data
Aaron Marcuse-Kubitza
09:47 AM Revision 1723: schemas/postgresql.nimoy.conf: Increased default_statistics_target to 8.4 default value to improve execution query plans
Aaron Marcuse-Kubitza
09:43 AM Revision 1722: Added schemas/postgresql.Mac.conf (for tuning developers' local testing DBs)
Aaron Marcuse-Kubitza
09:42 AM Revision 1721: schemas/postgresql*.conf: Increased checkpoint_segments and checkpoint_completion_target so that checkpoints (performance intensive) are written less often and load-balanced better
Aaron Marcuse-Kubitza
08:55 AM Task #289 (Resolved): look for formal mapping mechanism
Aaron Marcuse-Kubitza
08:35 AM Revision 1720: xml_dom.py: Don't print whitespace from parsed XML document when pretty-printing XML. minidom modifications section: Added subsection labels for the class each modification applies to.
Aaron Marcuse-Kubitza
08:20 AM Revision 1719: Parser.py: Renamed SyntaxException to SyntaxError because it's an unexpected condition that should exit the program, a.k.a. an error
Aaron Marcuse-Kubitza
08:05 AM Revision 1718: bin/map: process_rows(): When iterating over each row, only retrieve the next row if the end (limit of # of rows) has not been reached. This prevents the next row from being fetched, possibly causing an entire additional consecutive XML document to be parsed, if the limit has already been reached. This is primarily useful for XML inputs with a ".0.top" segment prepended before the other documents, which contains just the first two nodes for fast parsing of this smaller XML document when only the first two nodes are needed for testing. Without this fix, the ".0.top" segment would have needed to contain the first three nodes instead.
Aaron Marcuse-Kubitza
07:55 AM Revision 1717: inputs/XAL: Accepted initial test outputs
Aaron Marcuse-Kubitza
07:54 AM Revision 1716: inputs/XAL: Added maps
Aaron Marcuse-Kubitza
07:52 AM Revision 1715: bin/map: Extended consecutive XML document support to direct-XML inputs (without a map spreadsheet). Factored out consecutive XML document row-iteration code into helper method get_rows() which does the iters.flatten() and itertools.imap() calls.
Aaron Marcuse-Kubitza
07:37 AM Revision 1714: bin/map: Fixed bug in iteration over consecutive XML documents where only the first element of the first document was processed. Use of iters.flatten() and itertools.imap() fixes this problem so that the consecutive XML documents are regarded as a continuous stream of rows.
Aaron Marcuse-Kubitza
07:16 AM Revision 1713: bin/map: Use new xml_parse.docs_iter() to iterate over each consecutive XML document in stdin
Aaron Marcuse-Kubitza
07:16 AM Revision 1712: xml_parse.py: Added support for parsing consecutive XML documents in a stream
Aaron Marcuse-Kubitza
07:01 AM Revision 1711: Added iters.py
Aaron Marcuse-Kubitza

03/29/2012

10:33 PM Revision 1710: streams.py: Added FilterStream. Changed TracedStream to use FilterStream.
Aaron Marcuse-Kubitza
10:24 PM Revision 1709: Moved parse_str() from xml_dom.py to xml_parse.py
Aaron Marcuse-Kubitza
10:24 PM Revision 1708: Added xml_parse.py
Aaron Marcuse-Kubitza
10:21 PM Revision 1707: streams.py: CaptureStream: Ignore start_str when recording and end_str when not recording
Aaron Marcuse-Kubitza
10:13 PM Revision 1706: streams.py: CaptureStream: Get each match as a separate array elem instead of concatenated together
Aaron Marcuse-Kubitza
09:59 PM Revision 1705: ch_root, repl, map: Use new maps.col_info() instead of parsing col name manually. This allows maps with prefixes containing ":" to be supported, without the ":" being misinterpreted as the label-root separator.
Aaron Marcuse-Kubitza
09:57 PM Revision 1704: maps.py: Added col_info() to get label, root, prefixes from col_name. Added col_formats() for use by combinable(). Use new col_formats() in combinable(). Removed no longer needed col_label().
Aaron Marcuse-Kubitza
09:55 PM Revision 1703: input.Makefile: Use with_cat instead of with_cat_csv for XML sources
Aaron Marcuse-Kubitza
09:54 PM Revision 1702: Renamed inputs/XAL/src/digir.xml.make to digir.specimens.xml.make so it would generate an output file with the proper table name
Aaron Marcuse-Kubitza
08:53 PM Revision 1701: bin/map: Support concatenated XML documents for XML inputs
Aaron Marcuse-Kubitza
08:46 PM Revision 1700: bin/map: Merged XML inputs with and without a map into the in_is_xml section
Aaron Marcuse-Kubitza
08:33 PM Revision 1699: digir_client: Output profiling information
Aaron Marcuse-Kubitza
08:21 PM Revision 1698: Added inputs/XAL/src/digir.xml.make
Aaron Marcuse-Kubitza
08:21 PM Revision 1697: digir_client: Import http to take advantage of httplib modifications to deal with IncompleteRead errors
Aaron Marcuse-Kubitza
08:20 PM Revision 1696: Added http.py with httplib modifications to deal with IncompleteRead errors
Aaron Marcuse-Kubitza
07:46 PM Revision 1695: digir_client: Fixed bug where chunk size was being adjusted even if count == None (indicating no determinable last chunk), causing a type mismatch between None and the integer total
Aaron Marcuse-Kubitza
07:28 PM Revision 1694: input.Makefile: Removed no longer needed "ifneq ($(wildcard test/),)" guard around Testing section because all inputs now have a test subdir
Aaron Marcuse-Kubitza
07:25 PM Revision 1693: Added inputs/XAL
Aaron Marcuse-Kubitza
07:22 PM Revision 1692: digir_client: Made chunk_size a configurable env var. Removed schema env var because schema is always the same for DiGIR (can be different for TAPIR). Make sure output ends in a newline so that consecutive XML documents are on different lines.
Aaron Marcuse-Kubitza
07:13 PM Revision 1691: digir_client: Fixed bug where chunk_size records would always be retrieved even in the last chunk, which ignored any manual count the user might have set via the "n" option
Aaron Marcuse-Kubitza
07:07 PM Revision 1690: digir_client: Repeatedly retrieve data in chunks. Provide match count. Added section comments.
Aaron Marcuse-Kubitza
06:52 PM Revision 1689: xpath.py: Added get_value() to run get_1() and returns the value of any result node
Aaron Marcuse-Kubitza
06:51 PM Revision 1688: xml_dom.py: Added parse_str()
Aaron Marcuse-Kubitza
06:13 PM Revision 1687: digir_client: Use new streams.copy() to copy returned data to stdout
Aaron Marcuse-Kubitza
06:13 PM Revision 1686: streams.py: Added copy(). Added section comment for traced streams.
Aaron Marcuse-Kubitza
06:06 PM Revision 1685: digir_client: Label debugging output
Aaron Marcuse-Kubitza
05:54 PM Revision 1684: streams.py: Renamed LineCountOutputStream to LineCountStream since TracedStream now works on both input and output streams
Aaron Marcuse-Kubitza
05:52 PM Revision 1683: digir_client: Capture diagnostics for later use in determining next start/count values
Aaron Marcuse-Kubitza
05:51 PM Revision 1682: streams.py: Added CaptureStream to wrap a stream, capturing matching text. Renamed TracedOutputStream to TracedStream and made it work on both input and output streams. Made TracedStream inherit from WrapStream so that close() would be forwarded properly.
Aaron Marcuse-Kubitza
05:23 PM Revision 1681: bin/map: Changed XML input prefix handling to prepend prefix directly to XPath instead of separating it from the XPath with a "/". Changed get_with_prefix() to use new strings.with_prefixes().
Aaron Marcuse-Kubitza
05:21 PM Revision 1680: strings.py: Added with_prefixes()
Aaron Marcuse-Kubitza
04:56 PM Revision 1679: digir_client: Made schema customizable
Aaron Marcuse-Kubitza
04:35 PM Revision 1678: digir_client: Set header sendTime, source dynamically. In debug mode, print the request XML.
Aaron Marcuse-Kubitza
04:03 PM Revision 1677: Added local_ip to get local IP address
Aaron Marcuse-Kubitza
03:48 PM Revision 1676: bin/map: Added prefixes support for XML inputs
Aaron Marcuse-Kubitza

03/28/2012

11:12 PM Revision 1675: digir_client: Filter by darwin:Kingdom=PLANTAE because presumably all records will have this. Don't debug-print URL.
Aaron Marcuse-Kubitza
11:07 PM Revision 1674: Added initial bin/digir_client
Aaron Marcuse-Kubitza
07:58 PM Revision 1673: Renamed timeout.py to timeouts.py. Renamed timeout_ vars to timeout.
Aaron Marcuse-Kubitza
07:52 PM Revision 1672: opts.py: get_env_var(): default defaults to None
Aaron Marcuse-Kubitza
06:35 PM Revision 1671: inputs/SpeciesLink: Accepted test outputs for new TAPIR download
Aaron Marcuse-Kubitza
06:03 PM Revision 1670: bin/tapir/tapir2flat.php: Output to specieslink.specimens.csv instead of specieslink.txt so that the output file can be used right away without renaming
Aaron Marcuse-Kubitza
05:52 PM Revision 1669: inputs/REMIB/src/nodes.make: Stop after a configurable # of empty responses (indicating no more nodes), instead of at a preset node ID, because there seem to be many more nodes than are listed on the web form
Aaron Marcuse-Kubitza

03/27/2012

11:10 PM Revision 1668: input.Makefile: import/rotate: Add "." before the date
Aaron Marcuse-Kubitza
11:08 PM Revision 1667: input.Makefile: Added targets for editing import: import/rotate, import/rm
Aaron Marcuse-Kubitza
09:41 PM Revision 1666: bin/tapir/tapir2flat.php: Fixed XML parsing to strip control chars so DOMDocument::loadXML() wouldn't complain about "PCDATA invalid Char value 8 in Entity", etc.
Aaron Marcuse-Kubitza
09:07 PM Revision 1665: main Makefile: php-Darwin: Added instruction to set PHPRC if needed
Aaron Marcuse-Kubitza
09:03 PM Revision 1664: Added inputs/SpeciesLink/src/tapir.make
Aaron Marcuse-Kubitza
09:03 PM Revision 1663: input.Makefile: `src/%: src/%.make`: Don't tee recipe's stderr to make's stderr, because long-running make_scripts usually will be tracked using `tail -f`
Aaron Marcuse-Kubitza
09:00 PM Revision 1662: input.Makefile: `src/%: src/%.make`: Name the log file using the make_script name instead of the output file name
Aaron Marcuse-Kubitza
08:31 PM Revision 1661: cat_csv: If dialect == None, ignore that file because it's empty
Aaron Marcuse-Kubitza
08:30 PM Revision 1660: csvs.py: stream_info(): If header_line == '', set dialect to None rather than trying (and failing) to auto-detect it
Aaron Marcuse-Kubitza
08:19 PM Revision 1659: input.Makefile: Use new sort_filenames to putmultiple numbered sources in the correct order, dealing correctly with embedded numbers that aren't padded with leading zeros
Aaron Marcuse-Kubitza
08:18 PM Revision 1658: Added sort_filenames to sort a list of filenames, comparing embedded numbers numerically instead of lexicographically
Aaron Marcuse-Kubitza
07:18 PM Revision 1657: schemas/postgresql.conf: Decreased shared_buffers again because 4000MB wasn't enough less than 4GB SHMMAX
Aaron Marcuse-Kubitza
07:16 PM Revision 1656: schemas/postgresql.conf: Expressed shared_buffers in MB, since decimal GB doesn't seem to work anymore on 9.1
Aaron Marcuse-Kubitza
07:14 PM Revision 1655: schemas/postgresql.conf: Decreased shared_buffers to 3.9GB, slightly less than SHMMAX
Aaron Marcuse-Kubitza
07:11 PM Revision 1654: schemas/postgresql.conf: Optimized again using same changes as were applied to 8.4 version
Aaron Marcuse-Kubitza
07:10 PM Revision 1653: schemas/postgresql.conf: Replaced with original 9.1 version
Aaron Marcuse-Kubitza
07:03 PM Revision 1652: schemas/postgresql.conf: Optimized using analogous settings as postgresql.nimoy.conf
Aaron Marcuse-Kubitza
06:43 PM Revision 1651: inputs/REMIB/src/nodes.make: Don't abort entire import on empty response, because an empty response is also returned for nodes that are temporarily down, not just nodes that don't exist (assumed to be after the highest numbered node). Instead, stop import after 150 nodes if user did not specify an explicit # nodes.
Aaron Marcuse-Kubitza
05:50 PM Revision 1650: inputs/REMIB/src/nodes.make: Abort prefix on empty response using break, rather than just done = True, to avoid running any more code except the finally block. Moved metadata row validation outside metadata row retrieval try-except block.
Aaron Marcuse-Kubitza
05:41 PM Revision 1649: inputs/REMIB/src/nodes.make: If a read times out, abort the entire node rather than just the prefix to avoid waiting 20 sec for each of 26*26 prefixes
Aaron Marcuse-Kubitza
05:40 PM Revision 1648: profiling.py ItersProfiler, exc.py ExPercentTracker: Only output fraction of rows with errors if self.iter_ct > 0, to avoid divide-by-zero error
Aaron Marcuse-Kubitza
04:55 PM Revision 1647: inputs/REMIB/src/nodes.make: Fixed bug where row count was output in the middle of the row processing code, instead of after the first row is processed and the row count incremented. This removes "Processed 0 row(s)" messages at the beginning of every prefix.
Aaron Marcuse-Kubitza
04:40 PM Revision 1646: inputs/REMIB/src/nodes.make: Support custom starting node ID and # nodes processed via env vars
Aaron Marcuse-Kubitza
04:29 PM Revision 1645: Renamed inputs/REMIB/src/nodes.all.0.header.specimens.csv to node.0.header.specimens.csv so it would sort correctly with the new output file names
Aaron Marcuse-Kubitza
04:27 PM Revision 1644: Renamed inputs/REMIB/src/nodes.all.specimens.csv.make to inputs/REMIB/src/nodes.make since it will not be used to generate nodes.all.specimens.csv. However, it can still be used with the `src/%.make` make target, but will generate a dummy empty output file "nodes".
Aaron Marcuse-Kubitza
04:21 PM Revision 1643: inputs/REMIB/src/nodes.all.specimens.csv.make: Write each node to a separate output file
Aaron Marcuse-Kubitza
04:00 PM Revision 1642: inputs/REMIB/src/nodes.all.specimens.csv.make: Raise InputException instead of AssertionError if invalid metadata row, so that it will be caught and printed instead of aborting the program
Aaron Marcuse-Kubitza
03:56 PM Revision 1641: inputs/REMIB/src/nodes.all.specimens.csv.make: Moved header reading code inside TimeoutException try-except block since read sometimes times out before the header is even read
Aaron Marcuse-Kubitza
03:55 PM Revision 1640: schemas/postgresql.nimoy.conf: Increased shared_buffers to 1.5GB since kernel.shmmax has been increased to 2GB
Aaron Marcuse-Kubitza

03/26/2012

11:07 PM Revision 1639: Renamed inputs/REMIB/src/remib_raw.0.header.specimens.txt to nodes.all.0.header.specimens.csv
Aaron Marcuse-Kubitza
10:57 PM Revision 1638: inputs/REMIB/src/nodes.all.specimens.csv.make: Increased read timeout
Aaron Marcuse-Kubitza
10:55 PM Revision 1637: inputs/REMIB/src/nodes.all.specimens.csv.make: Timeout stuck reads because sometimes nodes are offline, etc.
Aaron Marcuse-Kubitza
10:53 PM Revision 1636: exc.py: str_(): Strip trailing whitespace. print_ex(): Since str_() now strips trailing whitespace, strings.ensure_newl() is no longer necessary.
Aaron Marcuse-Kubitza
10:43 PM Revision 1635: streams.py: Added TimeoutInputStream and WrapStream. Changed StreamIter to use new WrapStream.
Aaron Marcuse-Kubitza
10:42 PM Revision 1634: Added timeout.py
Aaron Marcuse-Kubitza
10:25 PM Revision 1633: inputs/REMIB/src/nodes.all.specimens.csv.make: Download from all prefixes of all nodes. Stop when a node produces an empty response (not even an error), which indicates no more nodes. Changed status messages.
Aaron Marcuse-Kubitza
10:17 PM Revision 1632: input.Makefile: `src/%: src/%.make`: Append stderr to log file
Aaron Marcuse-Kubitza
09:21 PM Revision 1631: Added inputs/REMIB/src/nodes.all.specimens.csv.make to download REMIB data for all nodes
Aaron Marcuse-Kubitza
09:20 PM Revision 1630: Added streams.py for I/O, which contains StreamIter, TracedOutputStream, and LineCountOutputStream
Aaron Marcuse-Kubitza
09:20 PM Revision 1629: term.py: Added clear_line. Corrected file comment.
Aaron Marcuse-Kubitza
08:06 PM Revision 1628: Makefiles: Let subdir's Makefile decide whether to delete on error
Aaron Marcuse-Kubitza
08:05 PM Revision 1627: input.Makefile: Save partial outputs of aborted src make scripts
Aaron Marcuse-Kubitza
06:44 PM Revision 1626: input.Makefile: Fixed bug in `%: %.make` rule to use $< instead of $*
Aaron Marcuse-Kubitza
06:20 PM Revision 1625: mappings/DwC2-VegBIEN.specimens.csv: minimumElevationInMeters: Remove any "ca." prefix
Aaron Marcuse-Kubitza
06:19 PM Revision 1624: xml_func.py: _replace: Strip whitespace from the returned string
Aaron Marcuse-Kubitza
06:09 PM Revision 1623: csvs.py: Added TsvReader to support TSV quirks. Added reader_class(). reader_and_header(): Use reader_class() to automatically use TsvReader instead of csv.reader for TSVs. Added is_tsv() and use it where `dialect.delimiter == '\t'` was used.
Aaron Marcuse-Kubitza
06:06 PM Revision 1622: strings.py: Added extract_line_ending() and remove_line_ending(). ensure_newl(): Use new remove_line_ending(). Moved Parsing section to top since it is used by the other sections.
Aaron Marcuse-Kubitza
04:40 PM Revision 1621: csvs.py: stream_info(): Set dialect.quoting = csv.QUOTE_NONE for TSVs because they usually don't quote fields. Factored dialect detecting code into new function sniff().
Aaron Marcuse-Kubitza
03:45 PM Revision 1620: input.Makefile: verify: Added reverify option, which can be turned off to prevent regenerating the verify/%.out file from the DB (which can be time-consuming), and instead just diff verify/%.out with verify/%.ref
Aaron Marcuse-Kubitza
 

Also available in: Atom